What is observability in microservices

Observability in microservices is the ability to monitor and measure the performance, behavior, and health of individual microservices, as well as the overall system. It involves collecting metrics, logs, and traces from each microservice and analyzing them to identify any issues and pinpoint the root cause of the problem. Observability allows developers and operations teams to quickly identify and fix problems, optimize performance, and increase the reliability of their microservices.

Observability in the Age of Service Mesh

In This Article:

Our Products

Overview

You have built a massively successful system. The users just can't get enough and request new features. Your developers crank out new services on a regular basis. Your DevOps/SRE team configures and scale your Kubernetes cluster (or clusters). As the system becomes more complicated and sophisticated you realize that there are common themes that repeat across all your services:

- Advanced load balancing

- Service discovery

- Support canary deployments

- Caching

- Tracing a request across multiple microservices

- Authentication between services

- Limiting the number of requests a service handles at a given time

- Automatically retrying failed requests

- Failing over to alternative component when a component fails consistently

- Collecting metrics on traffic

You quickly realize that all these concerns are shared by all your services. Kubernetes helps with some of these concerns like service discovery and load balancing, but you often need more powerful support. You definitely don't want to implement them in each service separately. The traditional way of addressing these issues is to write a big library that all services use. This is a reasonable approach, but there is a better way - the service mesh.

In this article we will explain what is a service mesh, why it is such an important trend, how service mesh works with Kubernetes and then we'll review some amongst the plethora of existing service meshes and discuss their relative pros and cons.

Let's get going...

What's a service mesh?

Service mesh is an architectural pattern for large-scale cloud native applications that are composed of many microservices. A lot of stuff happens between services. The service mesh externalizes all these concerns outside your application services and manages them centrally using proxies that intercept all traffic between services. Then you configure the service mesh to perform all the cool stuff on your behalf such traffic shaping, security and observability.

Here is what a service mesh look like:

Note that service meshes are not unique to Kubernetes. Here we focus on Kubernetes, but many of the concepts translate to other systems with a large number of interacting components.

Proxies

What's a proxy? A proxy is a component that sits in front of a service. When other services talk to your service they go through the proxy that can do various things like just pass through the request, send it somewhere else, reject it or even modify it. This is similar in spirit to Kubernetes admission controllers.

There are two primary ways to deploy a service mesh into your Kubernetes cluster.

Sidecar containers

The sidecar container approach injects a proxy container into every pod.

Here is a diagram that shows a service mesh that use sidecar containers:

Some of the attributes of sidecar containers are:

- No need to deploy an agent on each node

- Ability to deploy different pods with different sidecars (or versions) on the same node

- Each pod has its own copy of the proxy

Are those attributes pros or cons? that depends on context. For example, as an administrator you may prefer to be oblivious to the service mesh or alternatively you may want to control exactly what's going on at the network management level.

Node agents

The node agent approach installs a single agent on each node that intercepts the traffic and performs the routing and other service mesh functions.

Here is a diagram that shows a service mesh that uses node agents:

Some of the attributes of the node agent proxies are:

- More universal (doesn't require Kubernetes)

- More control over the service mesh proxies

- More efficient (no need for deploying a proxy per pod)

- Requires separate installation and maintenance

Data plane vs. control plane

When thinking about service mesh there are two separate aspects - the data plane and the control plane. The data plane is the set of proxies that connect your services (either as sidecar containers or node agents). The control plane as the name suggests controls the proxies that comprise the data plane. It is often a set of APIs and tools to configure policies, collect metrics and get aggregated view of your service mesh.

Service mesh on Kubernetes

Let's review some of the benefits that a service mesh can bring to your Kubernetes cluster!

Advanced load balancing

Kubernetes services provide a basic form of load balancing where a pool of backing pods serve requests coming into the service. You can even implement using services and labels simple canary deployments. If you want 10% of your requests to go to version 2 of a service you can deploy 9 pods with version 1 and one pod with version 2. But, with a service mesh you can do much more advanced load balancing that operates at the request level and not at the pod level. You can also do load balancing based on request path and parameters or use different algorithms like least number of connections for super fine-grained control.

Authentication and authorization

Authentication between services is important for security in depth. Kubernetes provides strong authentication and authorization around access to cluster resources and network policies, but a service mesh can take it to the next level with automatic mutual TLS and custom authorization.

Circuit breaking

Circuit breaking takes an unresponsive instance out of circulation. That helps prevent long delays by constantly retrying to reach an overloaded or dead pod. Kubernetes has some decent support for unresponsive pods with health checks and readiness probes. But, if the problem is misconfiguration or problem within the service itself Kubernetes can't help much. A service mesh operates at a higher level of abstraction and can do circuit breaking basked on the results of requests.

Rate limiting

Rate limiting is important to protect against denial of service attacks where attackers bombard your system with lots of requests, hoping to bring it down to its knees via resource exhaustion. It also helps to avoid paying enormous bills if you misconfigure your system or a load test goes out of control. Another use case is to prevent cascading failures where excessive load on one service propagates to lots of internal services.

A service mesh lets you define and control those limits centrally and without impacting the services themselves.

Kubernetes doesn't provide any built-in help here.

Retries and failovers

Building distributed systems is all about building a reliable system out of unreliable components. In a large microservice-based distributed system some services may be unreachable temporarily due to networking issues, maintenance or upgrades.

A service mesh can be configured to automatically retry failed requests. Retries address temporary intermittent failures in a smooth and streamlined way. However, if a service is down or unreachable for a prolonged period of time it is often best to fail over to alternative location (e.g. in another region).

Kubernetes has the concepts of automatically restarting failed containers and replica sets/deployments ensures enough pods are always running. But, as long as the pods and containers are running it will not help retry requests or fail over in case of consistent failures.

Caching

Caching can be a great performance enhancer and money saver. Especially for read-heavy workloads. A service mesh can be configured to cache the results of previous requests and return them instead of bothering the service. It may be even more powerful for serverless functions where each invocation may carry overhead.

Again, no assistance from Kubernetes on this front. Some ingress controllers can provide caching support.

Metrics

Metrics are one of the cornerstones of observability. A service mesh is aware of all traffic between services and can collect a lot of useful metrics automatically. Kubernetes provides the metrics server that collects CPU and memory usage for pods and containers. It is used by the horizontal pod autoscaler and the kubectl top command. You can also record custom metrics, but it will not do it for you. A service mesh can be configured to collect request-level metrics.

Distributed tracing

Debugging and troubleshooting a distributed system made of many microservices is not easy. A request often travels across multiple services. Distributed tracing (yet another observability cornerstone) lets you track the path of the request across all those services. Kubernetes doesn't have any built-in distributed tracing capability although multiple projects provide solutions for Kubernetes. Service mesh can integrate with those solution like Jaeger or OpenZipkin and help you to figure out what's wrong when things go south.

Who needs a service mesh?

OK. Now, we get what a service mesh is. But, do you really need one?

Yes, you do!

If you build and manage a large-scale cloud-native application you want many if not all the capabilities of the service mesh. Let's see why.

Aspect-oriented programming for the cloud

When you write a microservice the actual logic can be very minimal. Your system is comprised of a large number of relatively simple components. Even microservices that perform complex computations typically utilize libraries for the heavy lifting. The code for the service itself could be extremely simple, but when you add all the important security, observability and reliability aspects the code can balloon. All those critical aspects have nothing to do with the functionality of the service itself. They are all orthogonal operational concerns. They are a burden for the developers of the service. This is reminiscent of Aspect-oriented programming.

The service mesh allows the same benefits, but actually makes it easier because it can be bolted on completely transparently without changes to the application.

Service mesh vs. the big client library

Before the age of the service mesh, big client libraries ruled the land and centralized all those operational concerns. Every service had to include those libraries and use them in the same way. Some examples are Hystrix from Netflix (Java) and Finagle from Twitter (Scala targeting the JVM).

Here is what a system where services use a big client library looks like:

The library approach works, but forces you to make a hard choice - either you limit your microservice implementations to a single programming language or you have to develop and support this important library for multiple programming languages. For large organizations, the single programming language approach is often unacceptable due to existing legacy code or acquisitions.

The other major problem with the library approach is that when you make changes to the library you must upgrade ALL your services to use the latest version or suffer the consequences of incompatible services. In some cases like fixing security issues it's a hard requirement.

Upgrading all services for large systems is often a serious project and always disruptive to the developers.

With a service mesh you have no programming language limitations and upgrades can be done mostly transparently by cluster operators without upgrading and redeploying services.

Service mesh vs. serverless computing

Serverless is the new buzzword. There are two types of serverless:

1. You don't have to manage your servers or your nodes in the case of Kubernetes

2. Function as a service (a.k.a FaaS)

The first type is implemented on Kubernetes by supporting cluster autoscaling. If your cluster needs more nodes they are added to the cluster automatically. Since services and pods on Kubernetes normally don't care which node they run in then service mesh works pretty much the same. Both sidecar containers and node agents (deployed as a DaemonSet).

The second type of function as a service is a little more nuanced. There are many implementations of FaaS on Kubernetes. They are implemented in different ways and the details matter for service mesh. Some of the most common implementations like Kubeless and Fission already integrate with the Istio service mesh.

The bottom line is that on Kubernetes there isn't too much of a difference between services and serverless functions as a service. Services are best for long-running processes and serverless functions are better suited for event-driven invocations. Both can benefit from a service mesh.

Quick review of service meshes

Let's do a quick review of field. There are many service meshes for Kubernetes out there with interesting relationships between them.

Envoy

Envoy is a very versatile and high-performance L7 proxy developed by Lyft. It provides many service mesh capabilities, but is considered difficult to configure. Many other service meshes for Kubernetes are built on top of Envoy. The Envoy project itself recommends using other open source projects like Ambassador and Gloo as an Ingress controller and/or API gateway on Kubernetes.

Istio

Istio is arguably the most popular service mesh on Kubernetes. It is built on top of Envoy and provides a Kubernetes-friendly (YAML manifests) way to configure it. Istio was started by Google, IBM and Lyft. It is super easy (one click) to install on Google GKE and it captured a lot of mindshare.

Linkerd 2

Linkerd 2 is a service mesh developed by Buoyant. Buoyant coined the term service mesh and introduced it to the world a few years ago. They initially developed Linkerd as a Scala-based service mesh for multiple platforms including Kubernetes. But, they decided to develop a better and faster product more suitable for Kubernetes. That's where Linkerd 2 comes in. The data plane (proxy layer) of Linkerd 2 is implemented in Rust and the control plane in Go. It is one of a rare few service meshes that don't rely on Envoy.

Kuma

Kuma is a service mesh developed by Kong. It is also built-on top of Envoy. According to the Kuma team it is simpler than Istio on Kubernetes. It can also work in other environments besides Kubernetes.

Maesh

Maesh is an interesting service mesh from the creators of Traefic. It is using the node agents approach. It draws its capabilities from Traefic middleware and you can configure it by using annotations.

AWS App Mesh

App Mesh is a dedicated App Mesh for AWS. It supports EC2, Fargate, ECS and EKS and plain Kubernetes. It is also built on top of Envoy and may be a good option if you want a service mesh that is deeply integrated with AWS services. It lags behind Istio as far as features and maturity.

Network Service mesh

All the service meshes we discussed so far operate at the L4 (TCP, UDP) or L7 (HTTP, HTTP/2) of the network stack. The Network service mesh is quite different. It operates at the L2/L3 level and is designed to bring advanced networking capabilities to Kubernetes:

- Heterogeneous network configurations

- Tunneling and networking context as first-class citizens

- Policy-driven service function chaining

- On-demand, dynamic, negotiated connections

- Exotic protocols

Service mesh alternatives

A service mesh is super useful, but if you don't use its capabilities it might just introduce an extra layer of indirection and complexity. If your use case is more lightweight you get all you need a decent API gateway or sophisticated ingress controller. Some options are:

- Traefic

- Gloo

- Ambassador

- Contour

- Knative

Conclusion

Service meshes are an exciting technology. They provide real benefits for complicated distributed systems. Kubernetes provides a solid container orchestration platform and leaves many opportunities for service mesh to provide added value. In the future, I believe that service meshes will become a stable for well-architected distributed systems.

Plug: Keep your K8s clusters reliable with Squadcast

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.