
Congratulations! You have built a successful system. You have deployed it on Kubernetes. And the users just can't get enough. However, as traffic grows so does load on your cluster. How do you make sure your system is always up and running and your clusters grow with the demand?
In this article we’ll introduce you to intent-based capacity planning and how to pursue it in your Kubernetes cluster. First, we'll understand what is intent-based capacity planning, then we'll discuss capacity planning in the cloud and specifically for Kubernetes. Later, we'll cover Kubernetes autoscaling, namespaces and monitoring cluster utilization while tying it all back to intent-based capacity planning.
Capacity planning is the art of forecasting the necessary resources to satisfy the demands of your system over time and making sure the necessary resources are available. It's important to plan your capacity to match your actual needs as close as possible. If you underestimate your needs then your services will be unavailable, degraded or slow. If you overestimate your needs then you'll pay for capacity you don't use. The traditional approach for capacity planning is to collect forecasted resource demands from every team or department several quarters or even years in advance, aggregate them and come up with a purchase and allocation plan.
This approach is very brittle and fragile even for a single service for the following reasons:
Now, consider dependencies between services, the impact of marketing campaigns, the introduction of new services, dependencies between services and even mergers and acquisitions. For example, when Microsoft acquired Github in October 2018, Gitlab suddenly saw 10x the number of repositories imported when a lot of people and organizations moved from Github to Gitlab.
Capacity planning is very challenging and one of the things that make it even more challenging is that traditionally it is done with very low-level concepts such as servers, CPUs, memory, and disks. With this fine-grained resource accounting, it is extremely to get even a reasonable approximation of the need capacity.
This is where intent-based capacity planning comes in. It was originally conceived by the Google SRE organization. The main concept is to raise the level of abstraction. Focus on the SLOs and dependencies between services instead of focusing on the nitty gritty details (servers, CPU, disks). This allows the software to provide reasonable bin packing solution that takes into account the actual needs of a service.
For example, instead of requesting 20 servers in regions A, B and C for service X an intent-based specification would be run service X with 3 nines of availability and latency of less than 200 milliseconds.
The intent-based specification is much more flexible. Running the service according to this intent will require a different number of resources at different times. Intent-based capacity planning is able to accommodate these changing needs. But, for companies that don't run planet-scale infrastructure, it realistically requires operating in the cloud.
Back in the day capacity planning was all about physical data centers. You had to actually purchase physical hardware, have it installed, configured, upgraded and account for wear and tear. Capacity planning was painful, slow and it was difficult to adapt to change. Then the cloud emerged and it was awesome! Suddenly you had all these new toys: elastic provisioning, spot instances, reserved instances, serverless functions, and bottomless storage. The elastic aspect of cloud provisioning means that your capacity is not dominated by the maximum volume you have to support. You can shrink your capacity when it is not needed. This is a fundamental shift in capacity planning from a mostly proactive discipline that tries to anticipate future demand to a more reactive discipline that can respond almost on the fly to surges, ebbs and flows in demand. Is that cool or what?
Elastic provisioning means you can provision a new instance on demand whenever you need one within a few minutes. This can be done programmatically via an API, which means it can be incorporated into a mostly automated SRE solution. Such a solution can implement intent-based capacity and based on the high-level intents can dynamically adjust the actual resources (primarily instances).
On-demand elastic provisioning already provides a lot of flexibility, but the cloud offers even more economic options. If certain workloads don't have high-availability requirements and can accept short unplanned downtime (e.g. low priority batch jobs or unsupervised long tests) then you can specify it as a high-level intent and your intent-based capacity planner may use a cheaper instance type that is not as reliable as on-demand instances. These instances are called spot instances on AWS, Low-priority VMs on Azure and preemptible VM instances on the Google cloud. The idea is that if the cloud provider suddenly has a sudden surge of requests for some extra capacity they can just discard your workloads and use those instances for the high-priority workloads that someone else paid for. Your workloads will be rescheduled on other instances when they become available.
Reserved instances provide another dimension of flexibility in cost management. They are the end other end of the spectrum. When you have demand that never goes below a certain level you can benefit (as in pay less) by making long term reservation of resources. You still get to use on demand and low-priority instances for the dynamic part of your workload.
Recently serverless functions (aka function as a service) became a hot trend. The basic idea is that you write a function, upload it to your cloud provider, which takes care of provisioning the necessary resources for that run. You only pay for what you use. There are some use cases that really benefit from such ad-hoc mode of operation.
Do you like losing data? I didn't think so. Why not let the cloud providers worry about the disks, backups and the configuration and the patches? Cloud providers often offer several data stores you can use without the significant burden of managing storage clusters yourself. The most fundamental store is a blob store like AWS S3, Google cloud store and Azure's blob storage. All cloud providers offer 3 classes of storage: hot (standard), cool and cold (archive). The access time get slower as you go colder, but so is the price. Then, there are many other types managed data stores like key-value stores, data warehouse and even managed relational databases. What this means is that capacity planning for persistent data can be automated and incorporated into intent-based capacity planning by specifying high-level intents like "store data from the last 6 months as highly available and then archive".
Now, that we've seen the amazing benefits the cloud brings to SRE let's have some fun with the specifics of capacity planning on Kubernetes.
Kubernetes was designed as an orchestration engine for the containerized application. A Kubernetes cluster is a collection of physical resources like instances and storage that networked together. The Kubernetes API server does the hard work of bin packing the containers you deploy to the cluster. Kubernetes does a lot more, but here we focus only the orchestration aspect and how it relates to intent-based capacity planning. The greatest thing about Kubernetes is that you don't need to plan the capacity per service, but in aggregate for the entire cluster, since Kubernetes will globally schedule your workloads across the entire cluster resources.
If you want to learn more about Kubernetes check out my book: Mastering Kubernetes - 2nd edition
Let's look quickly at the primary players: pods, nodes and storage.
Pods are the unit of deployment on Kubernetes. You can think about pods as the atoms of Kubernetes. They contain one or more containers (think quarks). Usually, you put one container in a pod. In some cases like service mesh proxies or logging agent an additional container will be packed into the pod along with the primary application container. Kubernetes will always schedule all the containers in a pod to the same node and the containers will share an IP address and can communicate with each other via localhost or local storage. You can run a pod directly on Kubernetes, but the best practice is to specify a Deployment object. The Deployment in addition to providing the pod specification also specifies the number of required replicas. A pod dies for whatever reason (including its hosting node crashing or becoming unreachable)? No problem - Kubernetes will schedule a new pod to ensure the correct number of replicas is always running. Here is a Kubernetes Deployment manifest in YAML.

Your cluster needs nodes, buddy! No nodes, no pods. Nodes are the physical or virtual machines that the cluster is made of. Kubernetes schedules pods to run on the cluster nodes. This is the most important resource for capacity planning with Kubernetes. Here is how to get the list of nodes of a Kubernetes cluster using the kubectl CLI:

In this cluster there is one master node called k3d-k3s-default-server and three worker nodes.
The other important resource that can be managed and planned with Kubernetes is storage. Sure, nodes are where your pods run, but if you don't want your precious data to disappear when a node goes down you better think about persistent storage. Kubernetes has a conceptual model of storage using storage classes (types of storage), volumes and persistent volume claims. You can provision volumes statically or dynamically via cloud provider integration. Pods make persistent volume claims asking for a certain amount of storage and if it is provisioned successfully it will be mounted into the container. Here is how to add storage to the "trouble" deployment. First, define a persistent volume claim:

Then, incorporate it into the pod spec of the deployment:

Kubernetes provides sophisticated control of scheduling of pods to nodes. The end result is that the Kubenetes scheduler will bin pack all the pods into the cluster nodes based on the necessary requirements and constraints. But, what if you want to aim higher? For example, you don't care about the number of pods or even the CPU load on each node. What you really care about is the latency of requests. How to reconcile a high-level intent like having a latency of less than 200 milliseconds for the 95 percentile with the low-level mechanisms of Kubernetes? Well, lucky for you Kubernetes autoscaling supports custom metrics. You can hook up the horizontal pod autoscaler to scale on arbitrary custom metrics like latency or requests per second. But, it doesn't stop there. You can even use external metrics not related to Kubernetes objects! For example, if you use external queue system outside Kubernetes, you can hook up high-level metrics like the number of requests in the queue or an even better time in the queue (average latency) to high-level capacity planning. If you have items in your queue that are older than X seconds, add another instance!
It doesn't there. You can write your controller that listen to cluster events and monitor collections of pods and make very high-level capacity decisions automatically. Consider, a service that starts to slow down. You need to schedule a few more instances for this service or maybe one of its dependencies is overloaded or the data store it stores the data in.
Let's take a look at a bunch of low-level specifications and constraints that Kubernetes has to consider like node selectors, affinity, anti-affinity, taints, and tolerations. As you'll see trying to accommodate all that manually or even with custom scripts will pretty much require that you reinvent Kubernetes.
Alright. How do you associate pods with specific nodes? node selectors are the most straight-forward mechanism. You can label groups of nodes with labels and then in the pod spec specify that certain pods should be scheduled to run only on nodes that meet those labels. For example, suppose certain pods require fast local disk. You can label nodes that have SSD disk with a label "diskType: SSD" and then add a nodeSelector section to the pod spec:

Affinity and anti-affinity are a little more complicated but provide many benefits over node selectors:
Here is an example for anti-affinity between pods when pod-a shouldn't be scheduled on the same node as pod-b:

node selectors, affinity and anti-affinity are all mechanisms where the pods dictate where they can be or cannot be scheduled. The nodes don't get a say in the matter. Sometimes, you want to specify things at the node level. In this case, you use taints and tollerations. When adding a taint to a node pods will not be scheduled on this node unless they have the necessary toleration.
Here the IP address of the nodes that have pods scheduled on them.

Let's taint one of these nodes:

Here is how to taint worker-1:

After a short while Kubernetes will evict all the pods from worker-1 and redistribute them to the other nodes:

To get pods scheduled to worker-1 we need to add a toleration to the pod spec.

Kubernetes itself can add condition-based taints to nodes that suffer some problems. This is a great automatic way to manage intent-based capacity planning. If nodes are tainted by conditions like memory pressure, disk pressure and unreachable then even though they belong to the cluster they don't count towards its capacity and more nodes can be added.
A big part of allocating pods to nodes is knowing how much resources a pod needs. When Kubernetes has this information then it can make intelligent decisions and also manage the total capacity of the cluster. It is the job of the developers and operators to provide this information to Kubernetes. The vehicle to provide this information is requests and limits.
When resource requests are specified a pod will be scheduled only to a node that has at least the requested amount of resources for all its containers. Note the pod may use more resources than requested! When resource limit is specified the container may be terminated by Kubernetes if it exceeds a limit. Kubernetes will restart the terminated container. This property lets Kubernetes perform efficient bin packing because it can tell how many resources at most each pod needs.
You can specify resource requests and limits as part of the container spec, but it is recommended to have a default limit specified in case some pods don't specify limits. This is done using a LimitRange object. The following example ensures that by default a container will use at most 200 milli-cores and 6 MiB of memory. The default request means that each container will have 100 milli-cores at its disposal and 5MiB of memory.

In the example, the requests and limit were different, which is allowed. But, best practice is to have the requests equal the limits.
The most common resources are CPU and memory, but various Kubernetes objects and even non-Kubernetes resources can be specified using extended resource requests and limits.
OK. Enough with the big words. Next, let's talk namespaces.
The namespace concept of Kubernetes is very powerful. It allows operators to divide and conquer a Kubernetes cluster between different organizational entities (teams, departments) and in case of hosting customer workloads it allows for easy multi-tenancy. From the perspective of capacity planning, it allows managing capacity at the namespace level by specifying resources and quotas per namespace to ensure a rogue namespace doesn't hog all the cluster resources. Here an example where the number of pods is capped at 20, each container will get at least 1 CPU core 20 MiB of memory and will be limited to 2 CPU cores and 2 GiB of memory.

You can apply this quota to any namespace.

Ensuring that the cluster has enough resources to satisfy all the workloads and specifying reasonable limits and quotas is a big part of intent-based capacity planning on Kubernetes. But, it is also important to keep track of the actual utilization of resources. There are two reasons:
1. We may have allocated to many resources that are not utilized
2.The demands for resources might get close to the limits and need to be revised
When working with cloud providers like GCE, AWS, and Azure you should pay a lot of attention to the quotas and limits they impose and make sure to stay ahead of the curve. I've been bitten several times by running into an obscure limit such as the number of projects on GCE (not even a tangible resource). The problem with cloud provider quotas is that to increase them you usually have to file a request that needs to be reviewed by a person and it can take 2-3 business days. Obviously, this is not very elastic and agile.
Last but not least the cluster autoscaler (CA) is a Kubernetes project (not part of the core) that can scale the number of nodes in the cluster to match the actual load taking into account all the constraints and restrictions defined for the cluster such as pod priorities and preemptions. It constantly monitors the cluster and if it notices pending nodes it will add nodes to the cluster.
The CA works well with the horizontal pod autoscaler (HPA). If the HPA attempts to create new replicas of a certain pod, but they can't be scheduled due to insufficient capacity then the CA will add new nodes to the cluster. The CA will also scale down the number of nodes if the total load on the cluster goes down. This way you don't waste resources and keep the cluster utilization high.
The CA is integrated with your cloud provider because it needs to invoke your cloud API in order to provision more nodes for your cluster. See your cloud provider documentation for installation and configuration.
Intent-based capacity planning is an important practice for large-scale cloud-based systems. The dynamic nature of cloud-based systems makes traditional capacity planning very challenging. Using high-level intents makes it more practical. When using Kubernetes to orchestrate containerized applications you can benefit from multiple mechanisms that support intent-based capacity planning as well as automatic scaling, self-healing, scheduling, and load shifting. The trick is to be able to translate the high-level intents to the low-level mechanisms Kubernetes provides. It takes knowledge and effort, but for large systems, it's super useful. Give it a try.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.
Resources:
https://www.packtpub.com/application-development/mastering-kubernetes-second-edition
https://landing.google.com/sre/sre-book/chapters/software-engineering-in-sre/
https://dzone.com/articles/gitlab-benefits-from-githubmicrosoft-acquisition
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
.svg)