You’re a DevOps engineer and your boss comes to you and says, “We need to buy Reserved Instances to cut our costs.” You do some googling to learn more about RIs; you learn there are different types of Reserved Instances: Standard and Convertible. A Standard Reserved Instance provides a more significant discount than a Convertible Reserved Instance, but a standard reserved instance is “locked in” to a specific instance type and can’t be exchanged for a different type. Picking the type of instances is an important decision because the cloud environment is always changing. With a convertible reserved instance, if you end up reserving the wrong capacity, you can make adjustments.
In addition to the two types of reserved instances, there are different payment options to buy RIs: No Upfront, Partial Upfront, and All Upfront. Reserved instances with a higher upfront payment provide greater discounts.
Furthermore, you need to figure out how many reserved instances to purchase. Cost explorers can give you some idea based on your historical utilization of how many reserved instances to buy. You look at the cost explorer, you buy some RIs, and you are ready to call it a day. Your work here is done. But it turns out that it’s not as easy as just buying the RIs – you have to constantly monitor and adjust to make sure that you are utilizing the capacity you reserved. AWS applies RIs across all the accounts in your organization, so it is kind of tricky to monitor RI utilization. Here is a good explanation of how reserved instances are applied. As you can see from that AWS blog post, it’s pretty complex logic. Next thing you know, managing RIs becomes a full-time job!
At nOps, we are building a cloud data and automation platform – to help companies reduce costs and get the most from their cloud deployments. We speak with many customers, and managing RIs is one of their major pain points. So, based on customer feedback, we built a solution for DevOps and engineering managers to easily manage their RIs. We focused on:
- Making it easy for you to buy the right instances.
- Facilitating real-time decisions about RI purchases and convertible RI configurations. (nOps’ RI benefit is calculated hourly, while the data in AWS about coverage doesn’t become available until about 24 hours after the usage has occurred.)
- Helping you purchase the RIs across your organization to maximize savings.
In this blog post, we’ll walk through how we built the solution. (nOps also provides a solution to completely manage RIs for you – we’ll tell you about that in a subsequent blog post.)
Let’s get started!
How nOps Optimizes Your Reserved Instances
Solving this optimization problem presents several difficulties, especially in acquiring the right data. AWS’s CloudTrail provides near real-time information on when you launch or terminate EC2 instances. However, CloudTrail doesn’t contain the information related to RIs. And, especially when an EC2 instance is created, key information is required for calculating the RI benefit, like platform. Similarly, CloudTrail does not contain all the metadata for reserved instance purchases. The RI-purchase metadata is mandatory for calculating the RI benefit in real-time.
At the same time, the CloudTrail event shows only changes like creations or deletions of EC2 instances or RIs. To solve this data problem, the nOps platform leverages its data and messaging infrastructure to correlate and analyze events and resource metadata in-line, and in real-time:
- We initialize by fetching all the metadata of EC2s and RIs in AWS accounts under the org and producing messages in our Kafka stream messaging core.
- In addition, we capture EC2 and RI creation and deletion events by using a real-time CloudTrail event collector, in AWS Lambda, with CloudWatch/S3 + CloudTrail on the central management AWS customer account, hosted using the CloudFormation template
- Once the Lambda event collector is established, we collect any new EC2 or RI events and send them to nOps through our events-collector REST API.
- The nOps events landing zone (the REST API) runs on a scalable Amazon EKS, as a microservice, and is accessible only from valid Lambda clients. The API validates the event message schema and produces the message to the Kafka topic.
- The difficulties in acquiring necessary data are solved by utilizing the EC2 metadata collector, which consumes the CloudTrail events of changed resources and, using details, fetches complete metadata of those resources (EC2 and RI) from the respective AWS account and then combines them with details from the triggering event.
- The combined EC2 and RI metadata are enriched by transforming the events with normalization factor, order of normalization, platform, region, availability zone, instance family, and size, and then are produced to a separate Kafka topic. For details on this logic, check out AWS’s documentation on How Reserved Instances are applied.
- The above process enables us to combine history data and real-time data. But still, there might be cases where the events are missing and metadata is not synced with the facts of the AWS account. To overcome this, we incorporate periodic reconciliation jobs that sync EC2 and RI metadata from customer AWS accounts.
- In this event-driven flow, a failure situation can lead to service on the customer AWS account not producing any events to the nOps events-collector API. To monitor for this situation, nOps has implemented a heartbeat that ensures that the Lambda event collector and complete event pipeline are active to produce the events.
- Once we put the architecture in place to collect the necessary metadata, the next step was to devise the real RI calculator algorithm. The collected metadata is stored in time-series in our RI database, and then a calculation algorithm is run on a time batch of the RI and EC2 data, producing the RI-calculated benefits. This allows us to show you both the current and historical benefit of RI purchases.
- The nOps solution includes an optionally configured webhook that can be used to trigger customer-automated optimization each time there is a change to calculated benefits, and included in the webhook payload can be the currently running normalized units, reserved units, platform, tenancy, region, and family details.
Visibility and Reporting of Your Reserved Instances
In addition to triggering off the webhook, nOps’ real-time data RI calculations can be consumed via a simple web interface (pictured below), downloaded in CSV format, or accessed through our API to drive sophisticated, automated optimization.
Cloud environments are very dynamic nowadays. Resources come and go every minute. By leveraging event-driven architecture, we provide customers with real-time analysis and automation to reserve and utilize the right capacity. nOps also offers a solution where we completely manage RIs for you, which we’ll discuss in a subsequent blog post.