Before we dig into what Karpenter is, let's take a look at a common problem you can face with commercial Kubernetes architecture. Let's assume that your organization runs several microservices, each with a different amount of resource requirements. For example, the main microservices that act as the external facing API endpoints in your system will likely face a large number of requests per second and therefore require large amounts of CPU & memory requirements. Meanwhile, the microservices that do calculations in the background can use large amounts of CPU, but not huge amounts of memory. During peak times, the number of calculations made as well as the number of requests could be so large as to warrant extra large machines, whereas in the middle of the night outside of business hours, the resource requirements could be a fraction of the costs at peak hours. So, how do you provision resources while taking all this into consideration?
One obvious solution is to create consumption profiles for each of the microservices and group them. Then you would have CPU-intensive microservices, memory-intensive microservices, etc... Next, you could create node groups with "c" family machines for the CPU-intensive ones and "r" family machines for memory-intensive ones. This is a great step into ensuring your cost efficiency is as optimal as possible, but one problem from the above case remains. Imagine you went with a c5d.4xlarge
node group since it was the best option for your CPU-intensive workloads. As peak time is reached, the number of instances would get scaled up by the cluster autoscaler to match demand. Then, in the middle of the night when your system falls to an idle state, the instances would scale down. However, at minimum, you would still be running one 4xlarge machine, and that is not cost-effective.
A more cost-effective method is to determine the load that each node group in your cluster is receiving, then dynamically switch out the nodegroups so that instead of having the 4xlarge machine running in the middle of the night, you would have something like a medium machine handling the load. But doing this manually is pretty much impossible, and writing scripts to handle this nodegroup switching is a massive task. This is where Karpenter comes in. Karpenter is an infrastructure scaling solution by AWS that aims to replace the default cluster autoscaler with a much more intelligent scaling tool. Once you set up Karpenter, it can read the load that comes in and dynamically create/destroy nodegroups or single nodes, mixing and matching machine types to ensure the best cost efficiency.
Karpenter does this by getting permissions to manage everything from your workloads to your nodegroups, looking at the load and adding/removing nodegroups as needed. You can also have it adjust a nodegroup that you created yourself, or have it add on individual nodes that aren't affiliated with a nodegroup at all. Basically, it acts the same as a person with admin access over your EKS cluster sitting down and handling the infrastructure management 24x7.
Now that you have gotten an introduction into what Karpenter is, let's move to the Karpenter lab.