📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Chapter 8:

Canary Deployment: Tutorial & Examples

March 8, 2023
12 min

Releasing new features in SaaS deployments involves many moving parts. Rolling out new features in a live SaaS environment is subject to concerns like downtime, costs, and ease of rollbacks.

The canary deployment strategy employs the gradual rollout of new versions of the software. We can detect unforeseen errors by starting small and testing a small percentage of traffic on the latest version while the older stable version serves most of the traffic. This avoids widespread issues created by the new version. The older nodes are gradually replaced with new versions when confidence in the release is sufficient.

The origin of the term “canary” in this context comes from the mining industry. Coal miners used to send canary birds into mines to detect toxic gasses. The miners determined whether the mine was safe based on the canary's reaction. If the bird died or was negatively affected, the mine was unsafe. If the bird remained healthy, the miners were good to proceed with work. This is precisely the logic we use in canary deployments.

In general, the strategies discussed here may apply to multiple deployment scenarios. In this post, we will stick to Kubernetes and walk through an example demonstrating the benefits of canary deployments.

Summary of key canary deployment concepts

Some of the key concepts discussed in this post are summarized in the table below.

Concept Description
Traffic/Ingress All the incoming requests trying to access the SaaS product.
Services Cloud-native SaaS products are based on microservice architecture. Services represent a component which serves a specific purpose. During releases, all or some of the services may need to be replaced with new versions.
Load balancing Load balancers help balance the incoming traffic to multiple instances of the services to reduce latency, and improve performance. Additionally, it is also possible to configure load balancing properties to route traffic to specific nodes based on the request parameters.
Basic deployment This deployment strategy is straightforward. We remove all the older versions of the services, and create new versions afterwards. This involves downtime.
Rolling deployment Existing sets of instances/pods are gradually replaced with new versions. Specifying the maximum number of pods going down for replacement and the minimum number of healthy pods in existence is possible.
Blue-Green deployment This deployment strategy creates a parallel environment with the new version, and traffic is automatically switched to the new deployment by adjusting load balancer properties. Once successful, the older environment is decommissioned.
Canary deployment Controlled gradual deployment of new pods/instances, which facilitate easy rollback and no downtime.
Rollback If deployment fails, the operations must be rolled back to the previous stable version.
Downtime Length of time an application is unavailable to serve requests.

Canary deployment overview

Although this post focuses mainly on explaining and demonstrating the canary deployment strategy for Kubernetes, let’s cover the basics of other strategies for context.

Basic/recreate deployment

This is the simplest deployment strategy. All the pods are replaced with a new version of the service or application at once. This causes downtime; if the deployment fails, it can take longer to roll back to the previous stable version.

Pros Cons
Simple to implement Cannot avoid downtime/outage.
Difficult to rollback.

Rolling deployment

A rolling deployment strategy in Kubernetes is the default strategy. It is intended for releasing new features without downtime. When the pod specifications are changed, Kubernetes starts to replace the currently deployed pods with new image versions.

The maxSurge and maxUnavailable parameters in the YAML files control this behavior. maxSurge indicates the number of pods that are allowed to be created beyond the desired number of instances. The extra capacity is used to create pods with a new image version. Whereas maxUnavailable defines the number of pods that can be decommissioned once the new pods are running. These parameters are used to make sure not all the pods are decommissioned at once. 

A rolling deployment strategy is a safer and better (no downtime) way to release new features compared to the basic deployment. However, given its nature, it may be time-consuming as far as the testing efforts are concerned. This is also true for full rollout and rollback measures.

Pros Cons
Default K8s behavior, easy to implement Time-consuming
No downtime Complex to roll back
No tweaking of load balancer

Blue-green deployment

This strategy creates a parallel environment with the same infrastructure set but a new version of the application code. This new environment is called a staging environment. Staging is where tests can be performed before the production release. 

This approach allows us to thoroughly test the staging environments without worrying about downtime, as the production serves traffic in complete isolation. Typically, the staging environment is exposed to internal consumers or a selected set of users to ensure its stability before release.

Once the stability is established, the incoming traffic is switched to the staging environment at the load balancer level, and the older environment is decommissioned.

Pros Cons
Safe way to release new features to SaaS environment Can be costly as full scale parallel infrastructure exists for a while
No downtime Needs configuration changes at load balancer
squadcast
Integrated full stack reliability management platform
Try for free
checklist
Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets
phone_check
Manage incidents on the go with native iOS and Android mobile apps
puzzle
Seamlessly integrated alert routing, on-call, and incident response
Try for free

Canary deployment

Canary deployment provides more control over releases. With a canary deployment strategy, there is no need for a different environment, thus no additional costs. The user acceptance testing can happen in production with minimal impact. Nowadays, we also use various automated testing strategies to gauge the impact of releasing the changes, especially if we have a defined low-impact target environment for releasing new features. The rollbacks are easier. 

The canary deployment concept centers on introducing production changes in small increments of pods/instances serving the traffic. Out of all the currently deployed pods, a small percentage are replaced with new versions of the application image.

The load balancer invariably distributes the traffic amongst all the nodes resulting in a small percentage of traffic being served by the new version of the application image. For example, if 100 pods are running in the production environment and serving 100% of the requests, we can start by replacing 2% of the pods (i.e., 2 out of 100 pods) with a new image.

The load balancer still distributes the traffic to all 100 pods. Consequently, the two new pods get tested. If the requests are being served by these new pods successfully, then more pods are replaced. Let’s say we replace 10% of the pods in the next step. 

If something goes wrong, administrators can roll back by simply resetting the number of replicas in the YAML files for older and newer deployments. In the case of our example, the replica count of the YAML file responsible for deploying older versions can be reset to 100 and the new one to 0. Kubernetes will take care of this change automatically.

Such a gradual increase in the number of pods helps increase confidence, and eventually, all 100% of the pods are replaced with new application images, marking the release as successful.

Canary deployment demonstration

Suppose we have a custom Nginx service served by ten pods deployed within our Kubernetes cluster. The current version of our service is v1.0, and we would like to release a new version, v2.0. For this example, the difference between versions is the content on the web pages as shown in the examples below.

When v1.0 is deployed
When v2.0 is deployed

Let us assume we have deployed 10 pods of v1.0 on our K8s cluster as shown in the diagram below. 

Current state

This is achieved by creating the corresponding specifications in our deployment YAML file for v1.0. In the YAML example below, we have used an image with a tag v1.0 and the spec specifies to create 10 replicas of it.

---


    apiVersion: apps/v1
    kind: Deployment
    
    metadata:
     name: mynginx-v1-deployment
     labels:
       app: mynginx
    spec:
     replicas: 10
     selector:
       matchLabels:
         app: mynginx
    
     template:
       metadata:
         labels:
           app: mynginx
       spec:
         containers:
           - name: mynginx-v1
             image: sumeetninawe/mynginx:v1.0
             resources:
               requests:
                 cpu: "10m"
                 memory: "150Mi"
               limits:
                 cpu: "50m"
                 memory: "400Mi"
             imagePullPolicy: Always
         restartPolicy: Always
        

---

The output of the kubectl command confirms the same.


    canaryDeployment % kubectl get all
    NAME                                         READY   STATUS    RESTARTS   AGE
    pod/mynginx-v1-deployment-55fcf9bd4c-7vqn5   1/1     Running   0          101s
    pod/mynginx-v1-deployment-55fcf9bd4c-8lcgt   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-b48gh   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-bsl99   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-dwg9l   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-m8qfl   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-qc6l2   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-s5nvt   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-sgm78   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-zgdzx   1/1     Running   0          100s
    
    NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)        AGE
    service/kubernetes        ClusterIP      10.16.0.1     <none>         443/TCP        17m
    service/mynginx-service   LoadBalancer   10.16.7.155   104.199.38.8   80:30780/TCP   43s
    
    NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/mynginx-v1-deployment   10/10   10           10          102s
    
    NAME                                               DESIRED   CURRENT   READY   AGE
    replicaset.apps/mynginx-v1-deployment-55fcf9bd4c   10        10        10      102s
    
    

To release a new version (v2.0) of our custom Nginx image using a canary deployment strategy, we begin by creating a new deployment file. This will create a second deployment object in the Kubernetes cluster. However, the load balancer service would be the same. Following is the YAML file we created for v2.0.

---


    apiVersion: apps/v1
    kind: Deployment
    metadata:
     name: mynginx-v2-deployment
     labels:
       app: mynginx
    spec:
     replicas: 1
     selector:
       matchLabels:
         app: mynginx 
     template:
       metadata:
         labels:
           app: mynginx
       spec:
         containers:
           - name: mynginx-v2
             image: sumeetninawe/mynginx:v2.0
             resources:
               requests:
                 cpu: "10m"
                 memory: "150Mi"
               limits:
                 cpu: "50m"
                 memory: "400Mi"
             imagePullPolicy: Always
         restartPolicy: Always

---

As a first step, we intend to replace 10% of the pods. Thus, we create a single replica of v2.0 and reduce the corresponding replica count of v1.0 deployment to 9. The desired future state is represented below:

10% pod replacement

Next, we’ll “kubectl apply” both the deployment YAMLs. The output below reflects the corresponding pod deployments - 9 v1.0 and 1 v2.0 pods..


    canaryDeployment % kubectl get all
    NAME                                         READY   STATUS    RESTARTS   AGE
    pod/mynginx-v1-deployment-55fcf9bd4c-7vqn5   1/1     Running   0          8m20s
    pod/mynginx-v1-deployment-55fcf9bd4c-8lcgt   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-bsl99   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-dwg9l   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-m8qfl   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-qc6l2   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-s5nvt   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-sgm78   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-zgdzx   1/1     Running   0          8m19s
    pod/mynginx-v2-deployment-5d5f948fb7-m6xgk   1/1     Running   0          18s
    
    NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)        AGE
    service/kubernetes        ClusterIP      10.16.0.1     <none>         443/TCP        23m
    service/mynginx-service   LoadBalancer   10.16.7.155   104.199.38.8   80:30780/TCP   7m22s
    
    NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/mynginx-v1-deployment   9/9     9            9           8m21s
    deployment.apps/mynginx-v2-deployment   1/1     1            1           19s
    
    NAME                                               DESIRED   CURRENT   READY   AGE
    replicaset.apps/mynginx-v1-deployment-55fcf9bd4c   9         9         9       8m21s
    replicaset.apps/mynginx-v2-deployment-5d5f948fb7   1         1         1       19s

If we continuously refresh our web page, we will see that 10% of the requests respond with v2.0. That confirms we have successfully tested and deployed the first step of our canary deployment. 

The curl output below confirms the same.


    canaryDeployment % for ((i=1;i<=100;i++)); do   curl "104.199.38.8"; sleep .5; echo; done
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V2.0<H1>
    <H1>Hello World - V2.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>

From here on, we can gradually increase the number of pods for v2.0 and decrease the corresponding number of pods for v1.0. The image below summarizes the steps till we reach a point where all the pods are replaced.

Gradual increment in the number of v2.0 pods, until full replacement.

The output below confirms the same. At this moment, any request that reaches our custom Nginx service will always be served by v2.0.


    canaryDeployment % kubectl get all
    NAME                                         READY   STATUS    RESTARTS   AGE
    pod/mynginx-v2-deployment-5d5f948fb7-2m724   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-85grg   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-blhx9   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-gps9l   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-gzzgh   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-lzpm6   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-m6xgk   1/1     Running   0          13m
    pod/mynginx-v2-deployment-5d5f948fb7-nhrxf   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-rfdfw   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-zfn4k   1/1     Running   0          24s
    
    NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)        AGE
    service/kubernetes        ClusterIP      10.16.0.1     <none>         443/TCP        36m
    service/mynginx-service   LoadBalancer   10.16.7.155   104.199.38.8   80:30780/TCP   20m
    
    NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/mynginx-v1-deployment   0/0     0            0           21m
    deployment.apps/mynginx-v2-deployment   10/10   10           10          13m
    
    NAME                                               DESIRED   CURRENT   READY   AGE
    replicaset.apps/mynginx-v1-deployment-55fcf9bd4c   0         0         0       21m
    replicaset.apps/mynginx-v2-deployment-5d5f948fb7   10        10        10      13m
    
    

How to rollback changes

At any step, if the tests fail or if the desired results are not achieved, rolling back this deployment is quick and easy. Simply changing the number of replicas in the deployment YAML file of v1.0 back to the original value (10), and deleting the deployment of v2.0 replaces the pods to the previous stable version.

With canary deployments, rollbacks are very safe as it is possible to roll back at every step described above. For example, if things go wrong in the first step, where the impact is controlled or negligible, we can delete the v2.0 deployment and reinstate the number of replicas for v1.0. 

Thus we get to test the new release in production with no downtime before we decide to replace all the replicas with new application versions.

How to test and monitor canary deployments 

In the example above, we replaced the pods and let the load balancer distribute the traffic based on the routing protocol set. This essentially randomizes the requests, and thus tracking specific requests does not happen in the most efficient way.

To test the canary deployment predictably — the way can do in Blue-Green deployments — we can use various parameters from incoming requests.

These request parameters typically provide information about the origin of the requests. For example, the region where a request originates is based on the geolocation data or IP information. It also helps categorize users, which is a great asset when targeting such requests to the canary pods created.

Organizations can mark and identify incoming requests from users of the UAT group. The load balancer configurations are set in a way that routes these targeted requests to the v2.0 pods alone. If any feedback requires a rollback, then a simple change in configuration files is all that is needed.

Additionally, these request parameters are also helpful in providing data to monitoring systems like Prometheus to track and foresee any negative impact on the end-to-end system.

Conclusion

Canary deployment is the best way of rolling out new features in a SaaS environment. The approach involves introducing the new application image step-by-step in the production environment. 

Starting at a small scale minimizes the probability of failure and does not require system downtime. Tweaking the load balancer settings to direct a specific portion of traffic to the new version helps us correctly target and coordinate the testing and monitoring efforts.

Also, since we utilize the existing infrastructure to roll out new features, there is no additional infrastructure cost.

Integrating a canary deployment strategy with monitoring systems to identify, collect, and analyze the same may require an additional learning curve. But once set, this can give crucial insights about the new system without harming the business.

‍

squadcast
Integrated full stack reliability management platform
Platform
Blameless
Lightstep
Squadcast
Incident Retrospectives
âś”
âś”
âś”
Seamless Third-Party Integrations
âś”
âś”
âś”
Built-In Status Page
âś”
On Call Rotations
âś”
Incident
Notes
âś”
Advanced Error Budget Tracking
âś”
Try For free
Platform
Incident Retrospectives
Seamless Third-Party Integrations
Incident
Notes
Built-In Status Page
On Call Rotations
Advanced Error Budget Tracking
Blameless
âś”
âś”
FireHydrant
âś”
âś”
âś”
Squadcast
âś”
âś”
âś”
âś”
âś”
âś”
Try For free
Subscribe to our LinkedIn Newsletter to receive more educational content
Subscribe now
ant-design-linkedIN
Subscribe to our Linkedin Newsletter to receive more educational content
Subscribe now
ant-design-linkedIN
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024