What Are The Best Practices For Setting Up Karpenter?

Karpenter is a high-performance, flexible, open-source Kubernetes cluster autoscaler designed for AWS. It improves application availability and cluster efficiency by rapidly launching right-sized compute resources in response to changing application load.

Karpenter’s flexibility, ease of use, granular control and high level of automation are a significant upgrade over Cluster Autoscaler — helping you to more quickly adjust resources, efficiently scale, and continuously optimize.

With the recent release of Karpenter 1.0 General Availability (GA), it’s the perfect time to adopt Karpenter.

It only takes 20 minutes on average to migrate

Migrating to Karpenter via EKS managed node groups or Fargate is straightforward and involves minimal disruption, thanks to its compatibility with existing Kubernetes clusters and its ability to leverage standard Kubernetes resources.

For detailed instructions, check out our complete ebook guide to migrating to Karpenter.

The Ultimate Guide to Adopting Karpenter

The Technical Step-by-Step Guide on How to Migrate to Karpenter

Best Practices For Setting Up Karpenter

Here at nOps, we are huge fans of Karpenter and have helped many teams make the transition. Here are our best practices to keep in mind as you set up Karpenter.

Use Fargate or a Dedicated Node Group

Running the Karpenter controller on EKS Fargate or a dedicated node group is a best practice to avoid self-management issues and resource contention. If the controller runs on nodes it manages, it might inadvertently scale down those nodes, leading to instability. By using EKS Fargate, the controller remains separate from the managed nodes, ensuring consistent availability and performance. Alternatively, a dedicated node group isolates the controller from application workloads, further preventing competition for resources and enhancing reliability.

Use Pod Disruption Budgets

Karpenter will ask the scheduler to destroy nodes as part of its normal course of operation. The only way to ensure service reliability will be to inform the scheduler of the requirements for each Deployment or StatefulSet. Refer to the Kubernetes documentation for more information.

Disruption budgets are critical for maintaining application availability during updates or scaling operations by limiting the number of pods that can be disrupted at any given time. They help prevent service downtime by avoiding the simultaneous termination of too many pods. Additionally, disruption budgets balance maintenance and stability, allowing necessary updates while keeping a minimum number of pods running, ensuring a reliable and stable Kubernetes environment.

Avoid Custom Launch Templates

Karpenter guidelines recommend avoiding custom launch templates since they don’t support the automatic upgrade of nodes, multi-architecture support, or securityGroup discovery.

Instead of launch templates, you can use custom user data or directly add custom AMIs in AWS node templates.

Configure Node Expiration On Your NodePool

With Karpenter, it’s possible to expire nodes automatically after a specific time period without causing any downtime. This helps ensure that all the nodes are always running the latest security patches.

Set Up NodePools According To Your Workload Types

Stateful workloads are less tolerant to node churn which is why it’s advised to set up a provisioner that only uses on- demand instances for these types of workloads. For stateless fault-tolerant workloads, you can set up a NodePool that only uses Spot instances.

Create specific NodePools for GPU workloads or general compute

Creating specific NodePools for GPU workloads or general compute allows for tailored resource allocation and optimization. For GPU-intensive tasks, a NodePool configured with GPU instances ensures that these specialized resources are available and efficiently used. Interestingly, GPU instances can sometimes be cheaper than general compute instances. This cost advantage means you can leverage GPU instances for normal workloads as well, provided you configure your NodePool and workloads properly. This flexibility not only reduces costs but also ensures that workloads are matched with the appropriate hardware, preventing resource contention and simplifying management by allowing distinct scaling policies and configurations for different types of workloads.

Set Up A Large Variety Of Instance Types For Better Availability

One of the main benefits of Karpenter is its ‘just in time capacity’ which basically means that Karpenter chooses an instance type that fits your workload as best as possible. But to leverage the power of this feature, you need to set up a large array of instances.

If you limit the instance types, you won’t be able to maximize the benefits of using Karpenter.

Enabling broader instance type usage enhances availability by allowing Karpenter to choose from a wider range of instances, ensuring that capacity can always be found, even during high demand periods. This flexibility reduces the chances of resource shortages and improves cluster resilience. Additionally, using diverse instance types optimizes spot instance utilization by leveraging varying prices and availability of different spot markets. This approach not only reduces costs but also increases the likelihood of securing Spot instances, providing both economic and operational benefits for Kubernetes clusters managed by Karpenter.

Use Spot Instances With Interruption Handling

Spot instances provide significant cost savings compared to On-Demand instances, but they can get interrupted if the demand increases beyond the available capacity.

Enabling interruption handling in Karpenter can help manage involuntary interruptions, like with Spot instances that can subsequently cause workload disruptions. It can also handle other events like maintenance, instance terminating, and instance stopping events.

To enable interruption handling, you just need to enable aws.interruptionQueueName in the Karpenter Settings. If you do so, it’s important that you are not also using Node Termination Handler as this will cause contention.

Specify resources for your deployments/pods

Karpenter will use the pods resource request and limits for its calculations which is why it is necessary to specify resources for deployments/ nods. Not specifying resources can cause unexpected behavior in the cluster scaling.

Distribute pods across multiple nodes and zones

Distributing pods across multiple nodes and zones enhances resilience and availability in Kubernetes applications. By spreading pods, the risk of a single point of failure affecting service is minimized, as workloads can continue running on other nodes or zones if one fails.

Karpenter automates this distribution by dynamically provisioning nodes across different availability zones, ensuring balanced load and optimized resource usage. Incorporating pod topology constraints, Karpenter ensures that pods are placed according to specified rules, preventing resource contention and boosting performance and reliability, maintaining uninterrupted service even during node or zone disruptions

Best Practices For Migrating to Karpenter

Migrating to Karpenter is straightforward, taking only 20 minutes on average. You simply install Karpenter on your cluster, configure a few provisioning specifications based on your needs, and it seamlessly takes over the node provisioning process with minimal disruption to your existing operations.

For detailed instructions and a full list of best practices, check out our complete ebook guide to adopting Karpenter.

Best Practices For Configuring Karpenter

Karpenter guidelines recommend avoiding custom launch templates since they don’t support the automatic upgrade of nodes, multi-architecture support, or securityGroup discovery.

Instead of launch templates, you can use custom user data or directly add custom AMIs in AWS node templates

Prioritize Savings Plans and/or Reserved Instances

By configuring a specific NodePool with a high weight and setting instance limits, Karpenter ensures that these cost-effective resources are consumed before transitioning to On-Demand or other instance types. This strategy helps you maximize your reserved capacity and cost savings while maintaining flexibility for scaling.

Split Between On-Demand & Spot Instances

This configuration allows you to create a mixed instance setup where a specific percentage of your EKS nodes run on On-Demand instances, while the remaining portion runs on Spot instances. This setup is ideal for workloads that can tolerate interruptions and benefit from the cost savings of Spot instances.

To do this, you can create a NodePool each for Spot and On-Demand with disjoint values for a unique new label called capacity-spread. Then, assign values to this label to configure the split. If you’d like to have a 20/80 split, you could add the values [“2″,”3″,”4″,”5”] for the Spot NodePool, and [“1”] for the On-Demand NodePool.

Tip: Balancing between Savings Plans, Reserved Instances, and Spot is extremely difficult to do manually.

Your workloads are changing every minute — trying to keep up with all the changes is a full-time job. But with the power of AI, you can stay continually optimized for minimal effort.

nOps Compute Copilot ensures all of your compute is on the most cost-effective capacity at all times, whether that’s RIs, SPs, or Spot. As the market and your utilization shifts, it makes adjustments to workload placement to maximize savings.

Fully utilize all of your Reserved Instances and Savings Plan every month and never over-commit again. nOps backs you with a 100% utilization guarantee (or you get a refund).

Book a demo to find out how easy it is to do RI, SP and Spot with nOps.

Protecting Batch Jobs During the Disruption (Consolidation) Process

This feature addresses the need to safeguard long-running batch jobs from being disrupted during the node consolidation process managed by Karpenter. Consolidation is a process where Karpenter identifies underutilized nodes that can be removed or replaced to reduce cluster costs. However, this process can disrupt running pods, including critical batch jobs.

By using the karpenter.sh/do-not-disrupt: “true” annotation, you can protect these pods from being moved or interrupted until their tasks are complete, ensuring that they run to completion without interference.

Or you can configure the `disruption` field, combining consolidationPolicy with consolidateAfter.

With disruption you can tell Karpenter which types of Nodes should be considered. You can also choose to disable consolidation entirely by setting the string value ‘Never’.

Advanced Best Practices

Update Nodes Using Drift

Karpenter’s “drift” mechanism marks nodes as “drifted” when their configuration no longer matches the desired state in their NodePool or nodeClassRef. Drift can occur for several reasons, such as changes to the NodePool configuration or updates to the underlying infrastructure (like AMI version changes). By using Karpenter’s drift feature, users can efficiently manage upgrades for worker nodes, ensuring they are aligned with the latest control plane version and infrastructure.

Customizing Nodes with Your Own User Data Automation

By using the userData field in the EC2NodeClass, users can automate additional configurations to their worker nodes upon launch without deviating from the standard AWS EKS optimized AMI. This can include tasks like modifying Kubernetes settings, mounting volumes, or running specific startup scripts.

Overprovision Capacity in Advance to Increase Responsiveness

This strategy is designed to ensure that you have immediate availability of compute resources when needed by preemptively provisioning extra capacity. This is particularly useful for scenarios where you know in advance that a large number of pods will need to be launched simultaneously, such as during data pipeline processing. By overprovisioning capacity ahead of time, you can significantly reduce the time it takes for your actual workloads to start, improving overall responsiveness and performance.

A sensible percentage might be 10-20% for mission-critical production environments.

Karpenter + nOps is Even Better

You can get the most out of Karpenter with nOps. nOps continuously manages Karpenter for you, providing the most efficient, reliable and cost-effective operations for less manual effort.

Complete EKS visibility. Allocate 100% of your unified AWS spend, see the efficiency of your clusters, drill down to the container level and more.

AI-Powered Continuous Cost Savings: nOps Copilot is aware of all your purchase commitments and the Spot market, so you automatically get the best performance at the lowest costs.

nOps is Invested in Karpenter’s Success: nOps has been optimizing and working with the Karpenter community since early beta versions and will continue to support the latest updates as they are released.

Karpenter + nOps are better together

nOps Compute Copilot built on Karpenter makes it simple for users to maintain stability and optimize resources efficiently.

New to Karpenter? No problem! The Karpenter experts at nOps will help you navigate Karpenter migration. We also support other autoscaling technologies like Cluster Autoscaler and ASGs.

nOps was recently ranked #1 in G2’s cloud cost management category and we optimize over $1.5 billion in cloud spend for our customers. Book a demo to find out how to save in just 10 minutes!

Best Practices for Maximizing Stability and Savings with Spot

What Are The Best Practices For Setting Up Karpenter?

It only takes 20 minutes on average to migrate

The Ultimate Guide to Adopting Karpenter