Kubernetes Autoscaling: HPA vs VPA, A Complete Guide
What is Kubernetes?
Kubernetes is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It’s supported by all hyperscaller cloud providers and widely used by different companies. Amazon, Google, IBM, Microsoft, Oracle, Red Hat, SUSE, Platform9, IONOS and VMware offer Kubernetes-based platforms or infrastructure as a service (IaaS) that deploy Kubernetes.
Vertical Scaling
Vertical scaling means increasing the amount of CPU and Ram that’s used by a single instance of your application.
For example, if we deployed our application to a virtual machine (or an EC2 instance) with 8 Gib of Ram and 1 CPUs, and our application is getting more traffic, we can vertically scale the app by increasing the Ram to 16 Gib and adding one more CPU.
A drawback to this approach is that it has limits. at some point you won’t be able to scale more. That's why we need horizontal scaling as well.
Horizontal Scaling
Horizontal scaling means increasing the number of instances that run your application.
For example, if we deployed our application to a virtual machine (or an EC2 instance), and our application is getting more traffic, we can horizontally scale the app by adding one more instance and use a load balancer to split the traffic between them.
If you are using a cloud provider (like AWS), theoretically, you can add an unlimited number of instances (of course it's going to cost some money).
Why do we need Autoscaling?
Autoscaling means automatically scaling your application, horizontally or vertically, based on a metric(s) like CPU or memory utilization without human intervention.
We need autoscaling because we want to respond to increasing traffic as quickly as possible. We also want to save money and run as few instances with as little resources as possible.
In Kubernetes, We use Vertical Pod Autoscaler (VPA) and Horizontal Pod Autoscaler (HPA) to achieve autoscaling.
Install the metrics server
For Horizontal Pod Autoscaler and Vertical Pod Autoscaler to work, we need to install the metrics server in our cluster. It collects resource metrics from Kubelets and exposes them in Kubernetes apiserver
through Metrics API. Metrics API can also be accessed by kubectl top
, making it easier to debug autoscaling pipelines.
To install,
$ kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
To verify installation,
$ kubectl top pods -n kube-system
This should return all metrics for pods in kube-system
namespace.
Vertical Pod Autoscaler
Vertical Pod Autoscaler (VPA) allows you to increase or decrease your pods' resources (RAM and CPU) based on a selected metric. The Vertical Pod Autoscaler ( VPA ) can suggest the Memory/CPU requests and Limits. It can also automatically update the Memory/CPU requests and Limits if this is enabled by the user. This will reduce the time taken by the engineers running the Performance/Benchmark testing to determine the correct values for CPU and memory requests/limits.
VPA doesn't come with kubernetes by default so we need to install it first,
Example
Create a redis
deployment and request too much memory and cpu.
Then, we need to run the following command to create the deployment,
Next step will be to create the VPA,
As you can see, VPA definition file consists of three parts,
targetRef
which defines the target of this VPA. It should match the deployment we created earlier.updatePolicy
it tells the VPA how to update the target resourceresourcePolicy
, optional. This allow us to be more flexible by defining minimum and maximum resources for a container or to run of autoscaling for a specific container usingcontainerPolicies
VPA update Policy
Here are all valid options for updateMode
in VPA:
- Off – VPA will only provide the recommendations, then we need to apply them manually if we want to. This is best if we want to use VPA just to give us an idea how much resources our application needs.
- Initial – VPA only assigns resource requests on pod creation and never changes them later. It will still provide us with recommendations.
- Recreate – VPA assigns resource requests on pod creation time and updates them on existing pods by evicting and recreating them.
- Auto – It automatically recreates the pod based on the recommendation. It's best to use
PodDisruptionBudget
to ensure that the one replica of our deployment is up at all the time without any restarts. This will eventually ensure that our application is available and consistent. For more information please check this.
Horizontal Pod Autoscaler
Horizontal Pod Autoscaling (HPA) that allows you to increase or decrease the number of pods in a deployment automatically based on a selected metric. HPA comes with kubernetes by default.
Example
Create an nginx
deployment and make sure you define resources request and limits,
Then we need to run the following command to create the deployment,
Next step will be to create the HPA,
This HPA will use CPU utilization to scale the deployment. If it's more than 55%, it will scale up. If it's less than 55%, it will scale down. To create the HPA, we need to run the following command:
There are a lot more configurations we can use to make our HPA more stable and useful. Here's an example with common configuration:
In this example, we use CPU and Memory Utilization. There's also the possibility to add more metrics if we want. We also defined scaleUp and scaleDown behaviors to tell kubernetes how we want to do it. For more info please check this.
Custom metrics
In some applications, scaling based on memory or CPU utilization is not that important, probably it does some blocking tasks (live calling external API) which doesn't consume much resources. In this case, scaling based on the number of requests makes more sense.
Since we are using autoscaling/v2
API version, We can configure a HPA to scale based on a custom metric (that is not built in to Kubernetes or any Kubernetes component). The HPA controller then queries for these custom metrics from the Kubernetes API.
Conclusion
Autoscaling is a powerful feature. It allows us to easily adopt our application to handle load change automatically without any human intervention. We can use VerticalPodAutoscaler to help us determine the resources needed for our application. We also can use HPA to add or remove replicas dynamically based on CPU or/and memory utilization. It's also possible to scale based on a custom metric like RPS (number of requests per second) or number of messages in a queue if we use event driven architecture.