📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
SRE
SLOs for AWS-based infrastructure

SLOs for AWS-based infrastructure

July 8, 2020
SLOs for AWS-based infrastructure
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

Overview

In this article we will discuss managing complex infrastructure on AWS with an eye towards SLOs (Service Level Objectives). The focus will be on compute infrastructure and we will leave storage and networking for another day. There are many ways to discuss management of infrastructure. We will use the lens of Kubernetes to compare and contrast comute infrastructure on AWS with Kubernetes.

How to Cloud?

There are two primary ways to use the cloud:

1. Use is IaaS (Infrastructure as a Service)
2. All-in

With the IaaS approach you try to use the bare minimum of cloud services. On AWS, this it is EC2, S3, IAM and the networking services. Everything else you deploy and manage yourself.

With the All-in approach you commit to your cloud provider and use each and every service under the sun - data stores, security, CI/CD, you name it.

The IaaS approach buys you some flexibility. Often, when you already have legacy infrastructure running on-prem it is the best way. You are less locked-in to your cloud provider and it is easier to test your system locally. It comes with the responsibility to install, maintain, monitor, upgrade and patch most of your stack. This is a big deal.

The All-in approach buys you a lot of confidence that you are on the right path. You get to benefit from hundreds of years of experience and management by your cloud provider. With a click of a button or CLI command you can deploy, scale and observe a plethora of high-end services and the list keeps growing.

The All-in approach seems like a no-brainer initially. When you actually apply this strategy at scale you discover that there is a price for all this goodness. The price is literally price! The cloud is expansive.

In addition, large cloud infrastructure is complicated and you need an SRE/DevOps team with a lot of cloud expertise to benefit from it and/or pay even more for professional support.

In practice, there is almost always some hybrid unless you are very disciplined. For example, even if you choose the IaaS approach it may be too tempting to just launch some service like RDS temporarily and get a managed database. It may be just for some prototyping with the best intentions, but you often end up supporting this temporary solution for a long time, sometimes forever. Then, it can become a slippery slope.

The other extreme is not always easy to maintain either. You want to use only AWS services, but someone just installs some open source project and pretty soon it becomes successful and you have to support it now.

SLOs for AWS

When running traditional infrastructure you typically care about the common CLIs (Service Level Indicators):

- latency
- throughput
- error rate
- utilization

Check out my previous article Using observability tools to set SLOs for Kubernetes Applications for in-depth discussion of SLOs in general and on Kubernetes in particular.

On AWS you still care about them, but the picture is much more complicated now. There are many pieces in place that you don't have to build, but just decide if you're going to use them or not.

The SLO of your application is built on top of the SLOs of the AWS services you use. Those SLOs can be difficult to ascertain because there are many ways to compose them.

For example, consider plain old blob storage like S3, there are various ways to utilze it:

- S3 standard
- Intelligent tiering
- S3 standard
- infrequent access
- S3 one zone
- infrequent access
- Glacier
- Glacier Deep Archive

Each one of these options come with its own SLOs and tradeoffs and there are easy ways to migrate your data from one to another and/or store hot/cold fractions of your data at different levels.

Now, consider that the S3 blob storage is just a part of the AWS data storage, access and transfer story. There are also EBS, EFS, FSx, ElastiCache, RDS, RDS Aurora, DynamoDB, DocumentDB, Neptune, RedShift, SQLDB, KeySpaces, etc. Don't even get me started on the variety of services to transfer data between those services as well as external data sources.

Check out this link for a one sentence explanation of each AWS service:

The paradox of choice is very real with AWS.

Luckily (or by design) AWS has strong observability capabilities.

Observability on AWS

To provide a service level, you must have a proper monitoring and observability posture. One of the greatest benefits of commiting to the AWS way is that observability is deeply integrated with all AWS services. AWS CloudWatch is the gateway to AWS observability.

At its core CloudWatch is a metrics repository. But now under the CloudWatch umbrella AWS centralizes all your observability needs including:

- Logs collection and analysis
- Metrics from AWS services and custom metrics
- Alarms
- ServiceLens
- Containerized insights

There is a whole other set of services for security and audit purposes. We will not get into it in this article.

However, even with CloudWatch the broad scope and complexity of AWS services are not easy to tame. In addition, there are other pitfalls to be aware of.

The Bane of AWS - Quotas and Limits

SLOs are about availability and performance. One of the unique challenges when using AWS is that each service and API comes with its own set of quotas and limits. If you are unaware, you will run into those limits and quotas at the worst time.

To ensure the availability of your applications and comply with your SLOs you must keep track of the quotas and limits of each AWS service that you use. This requires discipline and can be frustrating. A key element of the cloud is its elasticity and infinite capacity, but when you read the fine print and learn about those quotas and limits you realize that capacity planning is not a thing of the past. It just takes a different perspective.

Here is a look at the AWS quotas console

There are many quotas for each service. For example, 68 different quotas just for EC2!

Some of the quotas can be adjusted and some are fixed.

Here are the 15 quotas associated with the AWS lambda service

We need a plan to deal with quotas if we want to keep our system up and running. The alternative would be, to get surprised when bumping unknowingly against a quota. Here are some reasonable steps:

1. Understand the quotas associated with each AWS service that you use
2. Identify quotas that are relevant for your use case
3. Set up alarms to warn you when you get close to one of the quotas
4. Adjust quotas if possible (requires support request and can take a few days)
5. Design around the problem if it's not possible to adjust the quota.

At scale some quotas might force you to make major changes to your architecture. For example, you can have only 5 transit gateway attachments from your Virtual Private Cloud (VPC). This quota is not adjustable. If you need more, you will have to use multiple VPCs.

API rate limits are another obstacle you might run into that can force you to decrease the frequency you hit certain AWS APIs and as a result reduce the fidelity of your applications or find creative solutions.

Often quotas and limits are per AWS account. If you reach a scale that requires more resources than allowed your only option might be to switch to a multi-account architecture. This is far from trivial. I have done it twice in my career. The first time was switching from a single account to multiple accounts. The second time was building a multi-account architecture from scratch.

The Cost of ... Cost

When you start to use AWS services in abundance you realize that it's super easy to provision resources either manually using the console or via command-line, SDKs and APIS. But, those resources aren't cheap. There is literally a price to pay. At a certain point, the cost of your AWS infrastructure will start to play a major role in your design decisions, in the processes you employ and as a result also in the SLOs that you commit to. How much redundancy do you need? How much hot data do you keep? How often do you refresh your dashboards?

Eventually, cost may become its own Service Level Objective (SLO).

AWS provides a lot of tools for planning and managing your AWS cloud spending. Those tools include AWS budgets, cost allocation tags and cost explorer.

These tools can help meet your cost SLOs and not break the bank.

Optimizing cost on AWS is a big and never ending task. You have to be vigilant and track discounts, long term commitments pricing and changes to the basic pricing of various services.

To be Continued...

In the second part of this blog, we will use the lens of Kubernetes to compare and contrast compute infrastructure on AWS with Kubernetes and cover in detail setting of SLOs for ECS, EKS, Fargate, and Lambda based services.

Written By:
July 8, 2020
Gigi Sayfan
Gigi Sayfan
July 8, 2020
SRE
SLOs
Best Practices
Kubernetes
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

SLOs for AWS-based infrastructure

Jul 8, 2020
Last Updated:
November 20, 2024
Share this post:
SLOs for AWS-based infrastructure

In our latest two-part series blog, Gigi Sayfan, author of “Mastering Kubernetes”, discusses managing complex infrastructure on AWS with an eye towards SLOs (service level objectives). Though there are many ways to discuss the management of infrastructure, in this two-part series, he covers SLOs for AWS, Observability on AWS, Quotas Limits, and Optimizing cost on AWS and in the second part, he uses the lens of Kubernetes to compare and contrast compute infrastructure on AWS with Kubernetes.

Table of Contents:

    Overview

    In this article we will discuss managing complex infrastructure on AWS with an eye towards SLOs (Service Level Objectives). The focus will be on compute infrastructure and we will leave storage and networking for another day. There are many ways to discuss management of infrastructure. We will use the lens of Kubernetes to compare and contrast comute infrastructure on AWS with Kubernetes.

    How to Cloud?

    There are two primary ways to use the cloud:

    1. Use is IaaS (Infrastructure as a Service)
    2. All-in

    With the IaaS approach you try to use the bare minimum of cloud services. On AWS, this it is EC2, S3, IAM and the networking services. Everything else you deploy and manage yourself.

    With the All-in approach you commit to your cloud provider and use each and every service under the sun - data stores, security, CI/CD, you name it.

    The IaaS approach buys you some flexibility. Often, when you already have legacy infrastructure running on-prem it is the best way. You are less locked-in to your cloud provider and it is easier to test your system locally. It comes with the responsibility to install, maintain, monitor, upgrade and patch most of your stack. This is a big deal.

    The All-in approach buys you a lot of confidence that you are on the right path. You get to benefit from hundreds of years of experience and management by your cloud provider. With a click of a button or CLI command you can deploy, scale and observe a plethora of high-end services and the list keeps growing.

    The All-in approach seems like a no-brainer initially. When you actually apply this strategy at scale you discover that there is a price for all this goodness. The price is literally price! The cloud is expansive.

    In addition, large cloud infrastructure is complicated and you need an SRE/DevOps team with a lot of cloud expertise to benefit from it and/or pay even more for professional support.

    In practice, there is almost always some hybrid unless you are very disciplined. For example, even if you choose the IaaS approach it may be too tempting to just launch some service like RDS temporarily and get a managed database. It may be just for some prototyping with the best intentions, but you often end up supporting this temporary solution for a long time, sometimes forever. Then, it can become a slippery slope.

    The other extreme is not always easy to maintain either. You want to use only AWS services, but someone just installs some open source project and pretty soon it becomes successful and you have to support it now.

    SLOs for AWS

    When running traditional infrastructure you typically care about the common CLIs (Service Level Indicators):

    - latency
    - throughput
    - error rate
    - utilization

    Check out my previous article Using observability tools to set SLOs for Kubernetes Applications for in-depth discussion of SLOs in general and on Kubernetes in particular.

    On AWS you still care about them, but the picture is much more complicated now. There are many pieces in place that you don't have to build, but just decide if you're going to use them or not.

    The SLO of your application is built on top of the SLOs of the AWS services you use. Those SLOs can be difficult to ascertain because there are many ways to compose them.

    For example, consider plain old blob storage like S3, there are various ways to utilze it:

    - S3 standard
    - Intelligent tiering
    - S3 standard
    - infrequent access
    - S3 one zone
    - infrequent access
    - Glacier
    - Glacier Deep Archive

    Each one of these options come with its own SLOs and tradeoffs and there are easy ways to migrate your data from one to another and/or store hot/cold fractions of your data at different levels.

    Now, consider that the S3 blob storage is just a part of the AWS data storage, access and transfer story. There are also EBS, EFS, FSx, ElastiCache, RDS, RDS Aurora, DynamoDB, DocumentDB, Neptune, RedShift, SQLDB, KeySpaces, etc. Don't even get me started on the variety of services to transfer data between those services as well as external data sources.

    Check out this link for a one sentence explanation of each AWS service:

    The paradox of choice is very real with AWS.

    Luckily (or by design) AWS has strong observability capabilities.

    Observability on AWS

    To provide a service level, you must have a proper monitoring and observability posture. One of the greatest benefits of commiting to the AWS way is that observability is deeply integrated with all AWS services. AWS CloudWatch is the gateway to AWS observability.

    At its core CloudWatch is a metrics repository. But now under the CloudWatch umbrella AWS centralizes all your observability needs including:

    - Logs collection and analysis
    - Metrics from AWS services and custom metrics
    - Alarms
    - ServiceLens
    - Containerized insights

    There is a whole other set of services for security and audit purposes. We will not get into it in this article.

    However, even with CloudWatch the broad scope and complexity of AWS services are not easy to tame. In addition, there are other pitfalls to be aware of.

    The Bane of AWS - Quotas and Limits

    SLOs are about availability and performance. One of the unique challenges when using AWS is that each service and API comes with its own set of quotas and limits. If you are unaware, you will run into those limits and quotas at the worst time.

    To ensure the availability of your applications and comply with your SLOs you must keep track of the quotas and limits of each AWS service that you use. This requires discipline and can be frustrating. A key element of the cloud is its elasticity and infinite capacity, but when you read the fine print and learn about those quotas and limits you realize that capacity planning is not a thing of the past. It just takes a different perspective.

    Here is a look at the AWS quotas console

    There are many quotas for each service. For example, 68 different quotas just for EC2!

    Some of the quotas can be adjusted and some are fixed.

    Here are the 15 quotas associated with the AWS lambda service

    We need a plan to deal with quotas if we want to keep our system up and running. The alternative would be, to get surprised when bumping unknowingly against a quota. Here are some reasonable steps:

    1. Understand the quotas associated with each AWS service that you use
    2. Identify quotas that are relevant for your use case
    3. Set up alarms to warn you when you get close to one of the quotas
    4. Adjust quotas if possible (requires support request and can take a few days)
    5. Design around the problem if it's not possible to adjust the quota.

    At scale some quotas might force you to make major changes to your architecture. For example, you can have only 5 transit gateway attachments from your Virtual Private Cloud (VPC). This quota is not adjustable. If you need more, you will have to use multiple VPCs.

    API rate limits are another obstacle you might run into that can force you to decrease the frequency you hit certain AWS APIs and as a result reduce the fidelity of your applications or find creative solutions.

    Often quotas and limits are per AWS account. If you reach a scale that requires more resources than allowed your only option might be to switch to a multi-account architecture. This is far from trivial. I have done it twice in my career. The first time was switching from a single account to multiple accounts. The second time was building a multi-account architecture from scratch.

    The Cost of ... Cost

    When you start to use AWS services in abundance you realize that it's super easy to provision resources either manually using the console or via command-line, SDKs and APIS. But, those resources aren't cheap. There is literally a price to pay. At a certain point, the cost of your AWS infrastructure will start to play a major role in your design decisions, in the processes you employ and as a result also in the SLOs that you commit to. How much redundancy do you need? How much hot data do you keep? How often do you refresh your dashboards?

    Eventually, cost may become its own Service Level Objective (SLO).

    AWS provides a lot of tools for planning and managing your AWS cloud spending. Those tools include AWS budgets, cost allocation tags and cost explorer.

    These tools can help meet your cost SLOs and not break the bank.

    Optimizing cost on AWS is a big and never ending task. You have to be vigilant and track discounts, long term commitments pricing and changes to the basic pricing of various services.

    To be Continued...

    In the second part of this blog, we will use the lens of Kubernetes to compare and contrast compute infrastructure on AWS with Kubernetes and cover in detail setting of SLOs for ECS, EKS, Fargate, and Lambda based services.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    July 8, 2020
    July 8, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Gigi Sayfan
    Understanding the landscape of AWS compute
    Understanding the landscape of AWS compute
    July 10, 2020
    Kubernetes Operators for Automated SRE
    Kubernetes Operators for Automated SRE
    May 27, 2020
    Using observability tools to set SLOs for Kubernetes Applications
    Using observability tools to set SLOs for Kubernetes Applications
    April 16, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.