EC2 instances not being correctly sized can quickly lead to unnecessary expenses. Rightsizing is critical to optimize costs and stability. 

However, rightsizing a single EC2 instance is one thing — everything gets much more complicated if you want to rightsize within Auto Scaling Groups. Most sources of rightsizing recommendations overlook this area — yet it is a huge portion of your compute cost.

We’ll break down the challenges and share exactly what you need to know to do it right.

What is ASG rightsizing?

An AWS EC2 ASG (Auto Scaling Group) is a service that automatically adjusts the number of EC2 instances in your group to maintain performance and meet demand. Users define policies including which types of instances to use, how many instances to use, and the scaling policies to follow.

ASG rightsizing refers to optimizing the resources allocated to EC2 instances within an Auto Scaling Group (ASG) by changing an ASG’s instance type or types.

Why is ASG rightsizing so hard to do right?

Firstly, even rightsizing individual EC2 instances can be challenging. Vast amounts of rightsizing recommendations are available through tools such as AWS Trusted Advisor. Yet, these recommendations are often not accurate enough to match your workload. Reliable recommendations require granular historical data on memory, utilization, network bandwidth, volume size, and more. This data is difficult to collect and analyze, and there isn’t a single source for all the metrics you need. (You can learn how to collect the correct data for EC2 rightsizing through CloudWatch or Datadog in this guide.)

To make matters worse, rightsizing instances in an ASG is infinitely more complicated than rightsizing an EC2 instance. EC2 instances that are part of an ASG should be rightsized together — NOT individually as you would normally do when rightsizing. 

The dynamic nature of ASGs is such that instances come and go over time and may have different metrics distributions. For example, CPU utilization may be 60% on one instance and only 50% on the other. We must have a comprehensive understanding of all of these instances, including terminated instances, and how they work as a whole to rightsize effectively.

Example of ASG metrics from the AWS console.
Example of ASG metrics from the AWS console.

Let’s talk about all the additional variables you must consider as a result when rightsizing EC2 instances that are part of an Auto Scaling Group.

#1: ASGs scale up and down

By definition, ASGs are groups of instances that are not stable, but scale up and down. Usage may spike during specific periods (such as Friday evenings or business hours). And although the spikes on the instances from the same ASG will be correlated, they can also be different and happen at different times. When instances scale up in response to increased workload, they are eventually terminated once the peak traffic subsides, demand decreases, and the scaling down process occurs. 

To effectively right-size instances, it’s important to consider various metrics such as CPU usage, memory usage, and network usage. In Auto Scaling Groups, these metrics are not uniform across all instances. Some terminated instances may have a higher or lower percentage utilization. These factors all add a huge amount of complexity to the calculations.

#2: Tracking the right data is complicated

To make reliable rightsizing recommendations, we need to account for (1) ALL of the instances that belonged to this or that ASG, both short-lived and long-lived, (2) track all of the instances’ metrics over time, and (3) group this data together to analyze their min and max resource consumption at an aggregate level. 

Now-terminated Instances that worked only for a short period of time are hard to keep track of, requiring a monitoring agent. Your Datadog, CloudWatch, or other monitoring solution will only keep track of terminated instances for a certain amount of time, which may be too short. And you better have tagged them properly, or it’s going to take you hours and hours of time to find those terminated instances to get the data you need.

#3: Mistakes are costly

Given that you may have many ASGs whose instances you need to consider separately, this is a multivariate, complex problem in which it’s easy to track the wrong metrics, fail to account for a variable, or make a mistake in a calculation. All this has to be done for tons and tons of instances for at least the last 10+ days that might have scaled up in the middle of the night. 

And if you make just one mistake and act on an unreliable recommendation, this may result in problems when the instances reappear — affecting the performance and stability of your workload.

#4: Mixed Instance Policies

This becomes even more complicated when your ASG has different instance types. Different instance types may require different rightsizing checks, more preprocessing, and more attention — it’s easy to make a mistake here.

For example, say that we have a t3.nano which has a half GB of CPU, and a half GB of memory. We might also have a m5 8XL which is at 32 CPU and 64 GB memory as part of the same auto scaling groups. Developers need to aggregate and assess the data to provide recommendations. For instance, it might be found that m5 instances are utilizing more CPU and memory resources, while t3.nano instances are underutilized. As a result, you have to rightsize this and rightsize that to have a perfect balance

The mixed instance policy ASG can be an amalgamation of non-ephemeral and ephemeral instance types, in this situation, you may want to consider disk usage during the rightsizing procedure for ephemeral instances. However, this may not be the case with other instance types in your ASG. 

The upshot is that it can be difficult to ensure you’re taking all relevant metrics into account, or to wrongly take into account metrics that are not important — potentially rendering your recommendations unreliable.

nOps Makes ASG Rightsizing Simple and Seamless

If you find the above to be a nightmare, you’re not alone. Tracking all of your instances, finding the right data, performing the right calculations, and accounting for all of the possible variables in a mixed-instance ASG is almost impossible to do manually.

That’s why nOps integrates with the two industry-leading monitoring solutions, AWS CloudWatch and Datadog, for effortless rightsizing savings. We automatically analyze every EC2 instance in your environment (including shortlived ones) and pull their metadata to group them into their respective ASGs, analyzing min and max resource consumptions at an aggregate level to provide cost-saving recommendations.

Continuous coverage of resource-level insights such as memory, CPU, network bandwidth and storage are fed through nOps’s state-of-the-art ML engine for the best rightsizing recommendations available on the market

 

Rightsize with nOps for:

The most trustworthy rightsizing recommendations. Because nOps automatically collects and analyzes highly granular data, recommendations are 100% accurate and reliable — so engineers can act on them with the utmost confidence that workloads won’t be disturbed.

Up to 50% in immediate cost savings. When engineers don’t act on rightsizing recommendations, underutilized and idle resources continue to drive unnecessary AWS costs. nOps make it completely pain-free, safe and effortless for engineers to actually act on recommendations and start saving.

How it works

how-it-works
  1. nOps integrates with your CloudWatch, CloudWatch Agent or Datadog to collect all of the metrics needed for ASG rightsizing recommendations, based on your last 10+ days of usage. Our API queries your data every 24 hours.
  2. We quickly and efficiently process huge amounts of your CloudWatch data, crossed-referenced with AWS EC2 metadata and the latest AWS On-Demand pricing data to keep track of all of your ASGs (including terminated instances). These three sources are combined and fed through a Rightsizing Engine, allowing us to understand your dynamic ASGs holistically.
  3. For each ASG, each of your instances is analyzed taking all relevant info into account, such as the metrics necessary for your particular operating system. For each instance in your environment, we make the following calculations:
    • Max Disk usage
    • Max Network usage
    • Max RAM utilization 
    • Max CPU utilization

      For each instance, our rightsizing algorithm compares maximum recorded usage against the capacity of a lower instance type, multiplied by a threshold value that accounts for potential future usage spikes. nOps takes into account the aggregate performance and utilization metrics of all instances within an ASG to make informed recommendations.
  4. If all of the instances are rightsizable, the whole ASG is rightsizable. If you have several instance types, they can be analyzed and rightsized separately. 
  5. These rightsizing recommendations are then pushed to nOps microservices, which are responsible for showing recommendations from the nOps platform on the UI.
Screenshot of rightsizing savings in the nOps dashboard
View your rightsizing savings in the nOps dashboard

     6. Every 24 hours the process runs from top to bottom.

About nOps

Our mission is to make it easy for engineers to cost-optimize, so that they can focus on building and innovating. nOps was recently ranked #1 in G2’s cloud cost management category. 

Join our customers using nOps to cut cloud costs and leverage automation with complete confidence by booking a demo today!