Considerations for Successful Rightsizing in Cloud
Lately there has been a bit of organizational angst around “Experiencing all the expected cloud benefits viz a viz unexpected surge in monthly cloud bills”. Organizations are unsure of what hit them. There is combination of factors that contribute to cloud overspending like unused or underutilized resources, no commitments, ungoverned costs, unanticipated usage etc. In the latest Flexera study[1], organizations estimated that 32% of their cloud spending can be classified under "wasteful spending”. Now you might have come across lots of resources/people swearing by ‘Rightsizing’ as a ‘quick win’ to optimize cost and eliminate waste spending. It’s not quite as quick a win as you’d want. There are multiple risks and considerations associated with a successful rightsizing exercise:
Tracking the right metrics
To identify opportunities to right size, first step is to monitor and analyze current use of services/resources to gain insights into usage and performance. By leveraging cloud native tools like AWS CloudWatch, AWS Cost Explorer, AWS Usage Reports etc. organizations can gather sufficient data around key metrics like vCPU utilization, memory utilization, network utilization and ephemeral disk usage over at least 4-week period.
If the average CPU utilization has been less than 1% and network I/O has been less than 5MB for past 2-4 weeks, the instance is considered as an idle instance. The recommended action is to terminate unused resources and also check any attached volumes, snapshots or linked elastic IP addresses to it. While identifying the underutilized or over utilized resources it is a best practice to analyze 95th or 99th percentile usage instead of just analyzing peak usage, to avoid any false peaks that can occur during backups or OS updates etc. If the 95th percentile average CPU utilization over a 4 weeks period is lesser than 40% then the instance is considered to be underutilized and is the right candidate to right size.
Selecting the right instance type and families
Checklist for rightsizing an instance (EC2/RDS) by migrating it within or to a different instance family[2]:
Make sure current instance type and the new instance type are compatible in terms:
- Virtualization type – The instances must have the same Linux AMI virtualization type (PV AMI versus HVM) and platform (EC2-Classic versus EC2-VPC)
- Network – Some instances are not supported in EC2-Classic and must be launched in a virtual private cloud (VPC)
- Platform – If your current instance type supports 32-bit AMIs, make sure to select a new instance type that also supports 32-bit AMIs (not all EC2 instance types do). When you resize an EC2 instance, the resized instance usually has the same number of instance store volumes that you specified when you launched the original instance. You cannot attach instance store volumes to an instance after you’ve launched it, so if you want to add instance store volumes, you will need to migrate to a new instance type that contains the higher number of volumes
Database instances can be rightsized or scaled by adjusting memory or compute power up or down as per performance and capacity requirements. Consider the following factors when scaling a database instance:
- Since, storage and instance type are independent, scaling database instances up or down would not affect storage size
- To increase the allocated storage space or improve the performance of database instance it is recommended to change the storage type separately
- Before you scale, make sure you have the correct licensing in place for commercial engines (SQL Server, Oracle), especially if you Bring Your Own License (BYOL)
- Determine maintenance window specified for the instance to apply the change
Acceptable downtime
Once we have all the recommendations for rightsizing instance types it’s time to begin the implementation phase. In order to resize an existing EBS backed instance, you will have to turn it off, which means plan for a downtime. It is essential to thoroughly analyze mission criticality and acceptable downtime of your workloads.
For the workloads where a few minutes of downtime is not a big deal, the simplest method is to communicate scheduled downtime beforehand, while the instance restarts, and perform the upgrade at night or the time when traffic is low. If you have a couple of instances behind a load balancer, upgrading each of them one by one will avoid application outage.
For the mission critical workloads, the best way to achieve a zero-downtime upgrade is with a blue/green deployment. This comprises of creating a new instance of the optimal type/family, setting it for production, swapping traffic over to it, then terminating the old instance. In case you don’t have a source script or custom AMI set up, launching a new instance can take a bit of time. One of the techniques is to create an image of your running instance, and spin up a new instance using that image. Once the instance is up and running, final step is to swap the traffic over by changing the association on your elastic IP address.
Risk considerations
When people say ‘just’ rightsize/optimize the overprovisioned resources to save some costs, they forget to highlight the most significant risk of application outage. Let’s assume you are advised to downsize your instance to have a fewer vCPUs to save some costs, however what you forgot to consider is if the smaller instance would offer enough throughput to read and write to the disk. This will put a lot of pressure on the reads and writes to the database, resulting in a backlog of transactions on the website, which in turn leads to damage during peak hours. There could be another risk of application performance slowdown due to reduction in computing resources which could be difficult to catch by the engineers resulting in irritated and disappointed end users.
Final Thoughts
The misplaced trade-off between outage (and/or application performance) and savings is a constant reminder to FinOps teams to devise a holistic approach for cost optimization. As we know that rightsizing is a continuous process, specific set of policies are needed for better recommendations. With a plethora of cloud cost management tools (cloud native vs SaaS) promising to provide automated right sizing, its important that we understand the working behind their automation algorithm. Few things to look for while selecting an automated recommendation solution are workload observation window, workloads peaks and valleys considerations, liberty to adjust risk tolerance based on workload types and user defined constraints etc.
In conclusion, rightsizing is definitely a win (might not be a quick one) if strategized meticulously.
[1] Trends in Cloud Computing: 2022 State of the Cloud Report | Flexera Blog
[2] Tips for Right Sizing - Right Sizing: Provisioning Instances to Match Workloads (amazon.com)
Disclaimer: The views reflected in this blog are those of the author and do not necessarily reflect the views of PWC or its member firms
Helping CIOs on Reshaping IT for the Digital Age
3yVery well summarized, Ishita. Absolutely agree that rightsizing is a continuous process.