This is a complete guide to Vertical Pod Autoscaling in Kubernetes.
In this new guide, you’ll learn:

  • Why we need Vertical Pod Autoscaling?
  • Kubernetes Resource Requirements Model
  • What is Vertical Pod Autoscaling?
  • Understanding Recommendations
  • When to use VPA?
  • VPA Limitations
  • Real-World Examples
  • How does VPA work?
  • VPA’s Recommendation model
  • Lots more

Learning about Kubernetes? Become Kubernetes Certified. Consider applying for Certified Kubernetes Administrator certification and master your Kubernetes skills.

Let’s get started.

Why we need Vertical Pod Autoscaling?

When you deploy a new application to Kubernetes, you need to specify it’s resource requirements. Typically engineers start with some random number copy-pasted from somewhere. As you develop and deploy more and more applications, there will be many more resource requirement speculations. And the difference between actual usage and resource requirements will compound.

Guessing correct resource requirements is a nuisance to the developers. It is hard to estimate how many resources an application needs to run optimally: the right combination of CPU power, memory, and the number of concurrently running replicas.

Over time application usage patterns might change. Some applications will need more CPU and Memory. Other applications are less popular; thus, the resource requirements should be smaller. 

  • Under subscription typically is fixed by DevOps or SREs when they get paged. Site Reliability Engineers see that the application is dropping end-user requests due to Out-of-Memory kills or the application is slow due to CPU throttling.
  • Oversubscription doesn’t cause an immediate problem, but it contributes to a massive aggregate resource wastage. Thus your infrastructure or platform team needs to keep adding more Kubernetes nodes, although your resource utilization is low. 

These are the problems Autoscaling is trying to solve. Horizontal Autoscaling solves the problem of running the optimal number of replicas for your application. For example, maybe you are running too many pods, thus contributing to resource wastage?

Meanwhile, Vertical Autoscaling is solving setting correct CPU & Memory requirements. In this article, we will only look at Vertical Pod Autoscaling.

Let’s start first by understanding the Kubernetes resource requirements model.

Kubernetes Resource Requirements Model

Kubernetes makes users specify resource requirements using resource requests and resource limits. Let’s start with resource requests:

Resource requests are a reserved amount of resources for your application. You can specify the resource request for Containers in a Pod. Then the scheduler uses this information to decide where to place the Pod. You can think of resource requests as the minimum amount of resource your Pod needs to operate successfully.

It’s important to note that your application can use more resources if a node has some slack resources available. Limits provide a way for you to tell what is the maximum resources your Container can use. If memory grows more than your specified limit, the whole Pod gets killed. If your Container uses more CPU time than the limit, it gets throttled. 

Limits are effectively a safety valve. It protects you from consuming an unbounded amount of memory if your application has a memory leak. Similarly, it saves you from starving CPU from other applications. Imagine somebody deploying bitcoin miners, which would cause CPU starvation for all the applications in the cluster.

Importantly, if there are no slack resources on the scheduled node, you won’t get those resources. You are only guaranteed to get what you request if there is available capacity.  

Additionally, if you don’t specify requests, Kubernetes will automatically set requests to equal to Pod’s limits.

Setting only requests is a common mistake. Many users do it, hoping that their application can consume unbounded resources and don’t have to deal with Out Of Memory kills or CPU throttling. Kubernetes doesn’t allow that. So make sure to set both requests and limits for the best outcome.

Moreover, this resource model is extendible:

There can be different compute resources, such as: Ephemeral storage, GPUs, Linux Kernel Huge Pages.

For this article, we focus only on CPU and Memory, as currently, Vertical Pod Autoscaler only works on these resources. If you are interested to learn more, you can read about it in Managing Resources for Containers documentation.

What is Vertical Pod Autoscaling?

Vertical Pod Autoscaling (VPA in short) provides an automatic way to set Container’s resource requests and limits. It uses historic CPU and memory usage data to fine-tune the Container’s resource requirements.

VPA’s primary goal is to reduce resource wastage while minimizing the risk of performance degradation due to CPU throttling or errors due to Out Of Memory kills. 

VPA maintainers are engineers from Google. And based on experience from building a similar in-house built system for its container orchestrator named Borg. The system is called Autopilot. What Google found by using Autopilot in production:

In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.

Autopilot: workload autoscaling at Google

You can learn more about it in the paper Autopilot: workload autoscaling at Google

VPA introduces a couple of Custom Resource Definitions (CRD for short) to control automatic recommendations behavior. Typically Developers would add a VerticalPodAutoscaler object to their application deployments.

Let’s figure out how to use it.

How to Use Vertical Pod Autoscaling?

VPA custom resource definition gives many options to control recommendations. To get a better idea of using Vertical Pod Autoscaler, let’s take a look at this example VerticalPodAutoscaler object:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: prometheus-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: StatefulSet
    name: prometheus
  updatePolicy: 
    updateMode:  Recreate
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 0m
          memory: 0Mi
        maxAllowed:
          cpu: 1
          memory: 500Mi
        controlledResources: ["cpu", "memory"]
        controlledValues: RequestsAndLimits

You start writing VerticalPodAutoscaler by setting targetRef, which points to some Kubernetes controller object, which manages pods.

VPA supports all well-known controller types such as Deployment, StatefulSet, DaemonSet, CronJobs. It should also work with any custom type, which implements scale subresource. VPA retrieves the set of Pods via the controller’s ScaleStatus method. In this example, we are autoscaling StatefulSet named prometheus.

Next, updateRef field allows you to choose the operation mode for this controller. There are multiple options:

  • Off – VPA will not automatically change resource requirements. Autoscaler computes the recommendations and stores them in the VPA object’s status field.
  • Initial – VPA only assigns resource requests on pod creation and never changes them later.
  • Recreate – VPA assigns resource requests on pod creation and updates them on existing pods by evicting them when the requested resources differ significantly from the new recommendation. 
  • Auto mode – currently does the same as Recreate. In the future, it may take advantage of restart-free updates once they are available.

Then for each Container in the Pod, we need to define resourcePolicy. Resource policies allow you to choose which Container to provide resource recommendations and how to do that.

You provide a list of resource policies that are filtered by containerName. You can select a specific container in the Pod to match a resource policy. Additionally, you can set containerName to * to establish a default resource policy if none other resource policies are matched by the Container name.

Optionally resource policies allow you to cap resource recommendation to a range defined in minAllowed and maxAllowed. If you don’t set minAllowed and maxAllowed, resources are not limited.

Additionally, you can choose which resources will get recommendations by setting controlledResources. The only supported values are cpu and memory. If not specified, VPA computes both CPU & memory are recommendations.

Lastly, setting controlledValues allows you to choose whether to update the Container’s resource requests – RequestsOnly option or both resource requests and limits – RequestsAndLimits option. The default value is RequestsAndLimits.

If you choose the RequestsAndLimits option, then requests are computed based on actual usage. Meanwhile, limits are calculated based on the current Pod’s request and limit ratio. For example, if you start with the Pod that requests 1 CPU and limits to 2 CPUs. VPA will always set the limit to be twice as much as requests. The same principle applies to memory. So in the RequestsAndLimits mode, treat your initial application resource requests and limits as a template.

You can simplify VPA object by using Auto mode and computing recommendations for both CPU and Memory. So VPA object can be simplified to this:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: vpa-recommender
  updatePolicy: 
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        controlledResources: ["cpu", "memory"]

Now let’s look at recommendations provided by the Vertical Pod Autoscaler in the status field of VPA custom resource definition.

Understanding Recommendations

After you apply VeritcalPodAutoscaler object, VPA starts collecting the usage data and computing resource recommendations. Once some time passes, you should see resource recommendations in your VerticalPodAutoscaler object’s status field.

You can view estimations by executing:

kubectl describe vpa NAME

Let’s analyze an example status report:

Status: 
  Conditions:
    Last Transition Time:  2020-12-23T08:03:07Z
    Status:                True
    Type:                  RecommendationProvided
  Recommendation:
      Container Recommendations:
      Container Name:  prometheus
      Lower Bound:
          Cpu:     25m
          Memory:  380220488
      Target:
          Cpu:     410m
          Memory:  380258472
      Uncapped Target:
          Cpu:     410m
          Memory:  380258472
      Upper Bound:
          Cpu:     704m
          Memory:  464927423

As you can see, there are four different estimations provided for the prometheus container. Memory estimation values are in bytes. In CPU estimations, m means millicores. Let’s figure out what these estimations mean:

The lower bound is the minimum estimation for the Container. This amount is not guaranteed to be sufficient for the application to be stable. Running with smaller CPU and memory requests are likely to have a significant impact on performance or availability.

The upper bound is the maximum recommended resource estimation for the Container. Resource requests higher than these values are likely to be wasted.

Target estimation is the one we will use for setting resource requests.

All of these estimations are capped based on min allowed – max allowed container policies.

The uncapped target estimation is a target estimation produced if there were no minAllowed and maxAllowed restrictions. 

Why do we need four estimations? Well, Vertical Pod Autoscaler uses Lower and upper bound to evict pods. If the current resource request is lower than the lower bound or greater than the upper bound and there is a 10% change in resource requests versus the target estimation, then eviction might happen.

One neat thing is that VPA adds Pod annotations when it changes resource requirements. If you describe the Pod controlled by VPA, you can see annotations like vpaObservedContainers, which lists observed Containers, vpaUpdates, which describes actions taken. Additionally, you can tell that recommendation is limited to minAllowed or maxAllowed, or Kubernetes LimitRange object. Here is an example of Pod annotations:

apiVersion: v1                                                      
kind: Pod                                                           
metadata:                                                           
  annotations:                                                      
    vpaObservedContainers: recommender                              
    vpaUpdates: 'Pod resources updated by vpa-recommender: container 0: cpu request, memory request, cpu limit, memory limit' 

Let’s figure out when to use Vertical Pod Autoscaler.

When to use VPA?

Firstly, you can add VPA to your databases and stateful workloads if you are running them on top of Kubernetes. Typically stateful workloads are harder to scale horizontally, so having an automatic way to scale up resource consumption or compute good estimation helps solve capacity problems. If your database is not highly available or cannot tolerate disruptions, you can set the mode to Initial or Off. In this mode, VPA will never evict pods. It will either recommend the requests or update them once the application has rolled.

Secondly, another good use-case is CronJobs. Vertical Pod Autoscaler can learn the resource consumption for your recurring jobs and then apply the learned recommendation to a newly scheduled run. You just set the recommendation mode to Initial. This way, each recently launched Job will get recommendations calculated from the previous Job’s run. It’s important to note this will not work for short-lived (less than 1 minute) jobs.

Thirdly, Stateless workloads are a good target for Vertical Pod Autoscaling. Stateless applications usually are more tolerant of disruptions and evictions, so it’s an excellent place to start. You can test the Auto or Recreate mode. One significant limitation is that it won’t work with Horizontal Pod Autoscaler if you are Autoscaling on the same metrics: CPU or Memory. Typically you can add VPA to applications, which have predictable resource usage and if running more than a few replicas doesn’t make sense. For these types of applications, it doesn’t make sense to scale it using Horizontal Pod Autoscaler, so VPA is the right choice.

It’s important to know that currently, Vertical Pod Autoscaler has some limitations, so it’s essential to know when not to use it.

VPA Limitations

Firstly, don’t use Vertical Pod autoscaling with JVM-based workloads. JVM provides limited visibility into the actual memory usage, so that recommendations might be off.

Secondly, don’t use Vertical Pod Autoscaler with the Horizontal Pod Autoscaler, which scales based on the same metrics: CPU or memory usage. However, you can use VPA with HPA, which uses custom metrics.

Thirdly, VPA recommendation might exceed available resources, such as you cluster capacity or your team’s quota. Not enough available resources may cause pods to go pending. You can set LimitRange objects to limit resource requests per namespace. Additionally, you can set maximum allowed resource recommendations per Pod in a VerticalPodAutoscaler object.

Fourthly, VPA in Auto or Recreate mode won’t evict pods with one replica as this would cause disruption. However, if you still want automatic recommendations for your single replica applications, you can change this behavior. There is a – –min-replicas flag on the updater component.

Fifthly, when using RequestsAndLimits mode, set initial CPU limits to have a high multiple of requests. There is a Kubernetes/ Linux kernel issue, which leads to over throttling. Many Kubernetes users either disable CPU throttling altogether or set huge CPU limits to work around this problem. Typically, this is not a problem as CPU utilization on the cluster nodes is low.

Sixthly, not all VPA recommendations succeed, let’s say you run a highly available system with two replicas, and one of the Containers decides to grow the memory quickly. Quick memory growth might cause the Container to be Out of Memory killed. As Out Of Memory killed pods aren’t rescheduled, VPA won’t apply new resource recommendations. Pod eviction won’t happen, as one Pod is always either not ready or crash looping. Thus, you are in a deadlock. The only way to resolve these situations is by killing the Pod and letting new resource recommendations sink in. 

Now, let’s take a look at some real-world examples.

Real-World Examples

MongoDB Cluster

Let’s start with a replicated MongoDB cluster, with 3 replicas. Initial Statefulset’s resource requirement are:

resources:                                                                                                                      
  limits:                                                                                                                       
    memory: 10Gi                                                                                                                
  requests:                                                                                                                     
    memory: 6Gi     

We set the Pod Disruption Budget to allow only a single replica to be down.

Then, we deploy the StatefulSet without Vertical Pod Autoscaling and let it run for a while.

This graph shows the MongoDB cluster memory usage. Each line is a replica. As you can see, the actual memory usage for two replicas is close to 3GiB and one close to 1.5GiB.

After some time passes, we automate resource requirements with Vertical Pod Autoscaler object set in Auto mode autoscaling both CPU and Memory. VPA computes the recommendation and evicts pods one by one. Here is how a recommendation looks like:

Container Recommendations: 
  Container Name:  mongodb
  Lower Bound:
    Cpu:     12m
    Memory:  3480839981
  Target:
    Cpu:     12m
    Memory:  3666791614
  Uncapped Target:
    Cpu:     12m
    Memory:  3666791614
  Upper Bound:
    Cpu:     12m
    Memory:  3872270071

VPA set memory requests to 3.41GiB and limits to 5.6 GiB (same ratio as 6GiB – 10GiB) and CPU requests and limits to 10 millicores.

Let’s see how it compares to initial estimates. In total, we saved 1.6 GiB of requested memory per Pod. So in total, we requested 4.2 GiB less memory. It might not be a big difference for this single instance, but it compounds if you run many MongoDB clusters.

etcd

Another example is etcd, etcd is a highly available database, which uses raft leader election. The initial resource estimate didn’t set any memory resources requests, just CPU:

Limits:                                                         
  cpu: 7                                                                                                                                                    Requests:             
  cpu:  10m                                                                                                           

Then, we deploy the StatefulSet without Vertical Pod Autoscaling and let it run for a while.

This graph shows the etcd cluster memory usage. Each line is a replica. As you can see, one replica’s actual memory usage is close to 500Mib other two seem to use close to 300Mib.

This graph shows the etcd cluster CPU usage. As you can see CPU is relatively constant and close to 0.03 CPU cores.

Here is how VPA recommendation looks like:

  Recommendation:
    Container Recommendations:
      Container Name:  etcd
      Lower Bound:
        Cpu:     25m
        Memory:  587748019
      Target:
        Cpu:     93m
        Memory:  628694953
      Uncapped Target:
        Cpu:     93m
        Memory:  628694953
      Upper Bound:
        Cpu:     114m
        Memory:  659017860

VPA set memory requests to 599MiB and no limits; CPU requests to 93 millicores (0.093 cores) and limits to 65 cores, as the request to limits ratio is 700.

In total, VPA reserved more capacity for etcd to run successfully. Previously we didn’t request any memory for this Pod, which might schedule it into an overutilized node and cause issues. Similarly, CPU requests were too low for etcd to run.

The interesting finding here is that the current leader is using significantly more memory than secondaries. As VPA recommended resources, it gave the same recommendation to all the replicas. Thus there is a gap between what you need to reserve versus what you use. As secondaries will not use more than 300MiB memory until they become primaries, the node will have some extra space.

Although in this example, this gap is adequate. If the current leader stepped down, the secondary would have to take over the work, which would cause it to start using those reserved resources. If we didn’t secure those resources, the new replica could be out of memory killed, which would cause downtime.

Backup CronJob

This last example shows a simple backup CronJob, which takes a copy of a MongoDB database and stores it in S3. It runs daily, and typically it takes about 12 minutes to complete.

The initial estimate didn’t set any resource requirements. Vertical Pod Autoscaler object is set in Initial mode autoscaling both CPU and Memory.

The first couple of runs don’t have any resource requirements as VerticalPodAutoscaler takes a bit of time learning resource usage. During that time, VPA reports “No pods match this VPA object” error. By the time CronJob wants to schedule a third job, VPA figured out these resource recommendations:

  Recommendation:
    Container Recommendations:
      Container Name:  backupjob
      Lower Bound:
        Cpu:     25m
        Memory:  262144k
      Target:
        Cpu:     25m
        Memory:  262144k
      Uncapped Target:
        Cpu:     25m
        Memory:  262144k
      Upper Bound:
        Cpu:     507m
        Memory:  530622257

And during the next run, a Pod with 25m CPU 262144k Memory requests is created. The excellent part of this is as VPA runs using Initial mode, no evictions or disruptions happen.

Now let’s figure out how Vertical Pod Autoscaling works:

How does VPA work?

Vertical Pod Autoscaler consists of three different components: 

  • Recommender, which uses some heuristics to compute recommendations;
  • Updater, which is responsible for evicting Pods, when there is a significant drift in resource requirements;
  • Admission Controller is a component responsible for setting Pod resource requirements.

In theory, you can easily swap one of the components with your custom-made one. And it should still work. Let’s go over the components in more detail:

Recommender

Recommender provides the core resource estimation logic. Recommender monitors the actual resource consumption and Out of memory events and provides recommended values for containers CPU and memory requests. It stores recommendations in the VerticalPodAutoscaler object’s status field.

You can choose how Recommender loads initial historic CPU and Memory usage metrics. Recommender supports checkpoints (the default) and Prometheus. You can change it via –storage flag. 

Checkpoints store aggregated CPU and memory metrics in VerticalPodAutoscalerCheckpoint custom resource definition objects. You can describe the object to view the values stored. Recommender maintains checkpoints from real-time signals, which it starts collecting after historical metrics are loaded. 

When using the Prometheus option, VPA Recommender executes a PromQL query, which uses cAdvisor metrics. Recommender allows you to fine-tune labels used in the query. You can change the namespace, pod names, container names, Prometheus job name labels. In general, it will send PromQL queries similar to:

rate(container_cpu_usage_seconds_total{job=”kubernetes-cadvisor”}[8d])

and:

container_memory_working_set_bytes{job=”kubernetes-cadvisor”}

These queries result in CPU and memory usage. The Recommender will parse results and use them for resource recommendations.

Once it loads historical metrics, it will start real-time polling metrics from the Kubernetes API Server via Metrics API (similar to kubectl top command). Additionally, it watches Out Of Memory events to adapt to these situations quickly. VPA then computes the recommendations, stores them in VPA objects, and maintains checkpoints. You can configure the poll interval via –recommender-interval interval flag.

We will go over how VPA computes recommendations in VPA’s Recommendation model section.

Updater

VPA Updater is a component responsible for updating Pods resource requirements to match recommendations. If VeritcalPodAutoscaler is in Recreate or Auto mode, Updater may decide to evict the Pod to have it recreated with new resources. In the future, the Auto mode will most likely change to take advantage of in-place updates, thus potentially avoiding evictions. An in-place pod resource update feature is still in development. You can track the progress in this GitHub issue.

Updater has a set of safeguards, limiting Pods eviction:

Firstly, it won’t evict a Pod, which doesn’t have at least two replicas. You can change this behavior iva –min-replicas flag.

Secondly, as it uses the eviction API, it respects Kubernetes pod disruption budgets. Pod disruption budgets allow you to set availability requirements to stop the eviction of too many pods. For example, choosing max unavailable of one would allow only to evict a single Pod. You can read more about PodDisruptionBudgets here.

Thirdly, by default, it will only allow evicting 50% of the same replica set. Even if you don’t use the pod disruption budgets, it will still try to evict slowly. You can change this flag via –eviction-tolerance. 

Fourthly, you can also configure a global rate limiter, which would slow down eviction via –eviction-rate-limit and –eviction-rate-burst flags. By default, they are disabled.

Updater decides to evict pods based on lower and upper bounds. It will evict a Pod if the request is lower than the lower bound or bigger than the upper bound, and there is a significant change in resource requests versus the target estimation. Currently, VPA Updater looks for at least a 10% difference.

After Updater evicts pods, the last component, which handles Pod Creation and applies the recommendations, is the Admission Controller.

Admission Controller

Admission Controller is a component responsible for setting Pod resource requirements.

Before Kubernetes schedules, a Pod, Admission Controller will receive a webhook request from Kubernetes API Server to update the Pods spec. Admission Controller does this via mutating webhook configuration, explained in Kubernetes Admission Control documentation. You can view your mutating webhooks by executing:

kubectl get mutatingwebhookconfigurations

If you installed VPA correctly, you should see a mutating webhook configuration for the VPA admission controller.

Once Admission Controller receives a Pod request, it will match it to the Kubernetes VerticalPodAutoscaler object. If it doesn’t match, it will return Pod unmodified. If Pod matches VPA object, then based on the VPA object’s configuration, it might update the Pods resource requests or both resource requests and limits. Note: it will not change Pods resource requirements if the update mode is Off.

Let’s figure out how does VPA recommends resources.

VPA’s Recommendation model for CPU usage

Let’s say we have a container, we sampled CPU Usage every minute for 48 hours, and the CPU usage graph looks like this:

To compute CPU recommendation, we create a histogram with exponentially growing bucket boundaries. The first bucket starts at 0.01 cores (1 millicore) and finishes at roughly 1000 CPU cores. Each bucket grows exponentially by the rate of 5%.

When we add a CPU sample to the histogram, we find the bucket based on actual CPU usage and add a weight based on the current Container’s CPU request value. 

When CPU request increases, the bucket weight will increase too. This property makes previous observations less important, which helps to react to CPU throttling quickly.

Additionally, we decay the weight value based on time, with the default half-life of 24 hours. Thus if you add a new sample into the histogram, which is 24 hours old, its weight will be half of the container request at the time. Decying makes it that more recent samples have a more significant impact on the predictions than the older values. You can change the half-life of the weights by changing –cpu-histogram-decay-half-life flag.

Let’s put our example usage graph showed in the first figure into this exponentially bucketed, CPU requested weighted decaying histogram. Let’s assume the CPU request was one core for all 48 hours.

The histogram looks like this:

Note: we only plotted the first 0 to 36 buckets, as other buckets are empty. The values of buckets range from 0 to 0.958 CPU cores (rounded). 37th bucket has a value of 1.016. As our graph never hits this value, it’s empty.

Then VPA computes three different estimations: target, lower bound, upper bound. We use the 90th percentile for the target bound, the 50th percentile for the lower bound, and the 95th percentile for the upper bound.

Let’s compute bounds for the example provided in the first figure. The values are:

Lower Bound0.5467
Target bound1.0163
Upper bound1.0163
Note: red line indicates the lower bound; green line indicates both the target and upper bound. As for our example, these ended up having the same value.

After we compute initial bounds, we add some safety margin so that our Container has some breathing room. I.e., if it suddenly decides to consume more resources than previously. VPA adds a fraction of computed recommendation. By default, it’s 15%. You can configure this value –recommendation-margin-fraction flag. 

Then, we add a confidence multiplier for both upper and lower bounds. The confidence multiplier is based on how many days we have collected samples. For upper bound, we compute estimation as:

estimation = estimation * (1 + 1/history-length-in-days)

This formula shows that the more history we collect, the lower the multiplier is. Thus the upper bound will close down to the target bound as time goes on. To better understand this formula, here are some sample multiplies for various history lengths:

5 minutes289
1 hour25.4
1 day2
2 days1.5
1 week1.14
1 week 1 day1.125

Our example had two days of metrics collected. So the upper bound confidence multiplier is 1.5.  

Similarly, for lower bound estimation, we also multiply it by the confidence interval. But we use a slightly different formula:

estimation = estimation * (1 + 0.001/history-length-in-days)^-2

This formula shows that the more history we collect, the bigger the multiplier is. Thus the lower bound will grow up to the target bound as time goes on. To better understand this formula, here are some sample multiplies for various history lengths:

5 minutes0.6
1 hour0.9537
1 day0.9980
2 days0.9990

As you can see, this rapidly grows to 1. Our example had two days of metrics collected. So the upper bound confidence multiplier is almost 1.  

Then, VPA checks if the estimations are above some minimum threshold. If they are not, VPA will set it to the minimum. Currently minimum for CPU is 25 millicores, but you can change the minimum via –pod-recommendation-min-cpu-millicores flag.

After applying the safety margin and confidence multipliers to our example, our final estimation values are:

Lower Bound0.626
Target Bund1.168
Upper bound1.752
Final estimation bounds.

Lastly, VPA scales the bounds to fit into minAllowed and maxAllowed configured in VerticalPodAutoscaler object. Additionally, if Pod is in a namespace with LimitRanger configured, the recommendation is adjusted to fit into your LimitRanger’s rules.

VPA’s Recommendation model for Memory usage

Although most of the recommendation algorithm steps are the same, there are significant deviations from the CPU estimation. Let’s start with a memory usage that looks like this:

Note that the graph shows memory usage for seven days. The longer time interval is essential here, as Memory estimation starts with computing the peak value for each interval. We use the peak value rather than the whole distribution because we typically want to provision Memory close to the peak. As underprovisioning will terminate tasks with OOM kills. Meanwhile, the CPU isn’t that sensitive to this behavior, as Pods get CPU throttled, not killed.

So by default, the aggregation interval is 24 hours. You can change that via –memory-aggregation-interval flag. Additionally, we only keep eight intervals, but you change that via –memory-aggregation-interval-count flag. Thus by default, we are keeping 8 * 24 hours = 8 days worth of memory peaks.

Let’s see how these peak aggregations look in our example:

Memory usage peak aggregation.

Additionally, if we see an Out Of Memory event during that time, we parse the evicted memory usage and take maximum value out of Memory used, adding 20% of safety margin or 100MiB, depending on which is bigger. So this makes VPA adapt to OOM kills quickly.

After we have the peaks, we put peak values into a histogram. VPA creates a histogram with exponentially growing bucket boundaries. The first bucket starts at 10MB and finishes at roughly 1 TB. Each bucket grows exponentially by the rate of 5%.

Similarly, we decay the sample based on time, with the default half-life of 24 hours. If you add a new sample into the histogram, which is 24 hours old, it will weight 0.5. Decying makes it that more recent samples have a more significant impact on the predictions than the older values. You can change the half-life of the weights by changing –memory-histogram-decay-half-life flag.

Let’s see how the histogram looks for our example’s peak values:

Note: we only plotted the first 16 to 38 buckets, as other buckets are empty. The values of buckets range from 225.62 MiB to 969.20 MiB (rounded). 39th bucket has a value of 1088.10. It’s empty as our graph never hits this value.

Then VPA computes three different estimations: target, lower bound, upper bound. We use the 90th percentile for the target bound, the 50th percentile for the lower bound, and the 95th percentile for the upper bound.

For our example. all three estimations are the same: 1027.2 MiB

Estimations after computing 50th, 90th and 95th percentile.

After we compute initial bounds, we add some safety margin so that our Container has some breathing room. I.e., if it suddenly decides to consume more resources than previously. VPA adds a fraction of computed recommendation. By default, it’s 15%. You can configure this value –recommendation-margin-fraction flag.

Then, we add a confidence multiplier for both upper and lower bounds. The confidence multiplier is based on how many days we have collected samples. The formulas are the same as for the CPU estimations.

Then, VPA checks if the estimations are above some minimum threshold. If they are not, VPA will set it to the minimum. Currently minimum for Memory is 250 MiB. You can change the minimum via –pod-recommendation-min-memory-mb flag.

After applying the safety margin and the confidence multiplier, our final estimation values are:

Lower Bound1237422043 bytes = 1.15 GiB
Target Bound1238659775 bytes = 1.15 GiB
Upper Bound1857989662 bytes = 1.73 GiB
Note green line is the upper bound, red line is the lower bound. Target bound is not seen as it’s close to the red line (the difference between the two is 1.18MiB)

Lastly, VPA scales the bounds to fit into minAllowed and maxAllowed configured in VerticalPodAutoscaler object. Additionally, if Pod is in a namespace with LimitRanger configured, the recommendation is adjusted to fit into your LimitRanger’s rules.

References

6 thoughts on “Vertical Pod Autoscaling: The Definitive Guide

  1. Excuse me, this is a wonderful article but I cannot understand it here
    “`
    Recommendation:
    Container Recommendations:
    Container Name: etcd
    Lower Bound:
    Cpu: 25m
    Memory: 587748019
    Target:
    Cpu: 93m
    Memory: 628694953
    Uncapped Target:
    Cpu: 93m
    Memory: 628694953
    Upper Bound:
    Cpu: 114m
    Memory: 659017860
    “`
    VPA set memory requests to 599MiB and no limits; CPU requests to 93 millicores (0.093 cores) and limits to 65 cores, as the request to limits ratio is 700.

    How can I calculate the CPU limits(65 core) from here

    Also from here
    “`
    Container Recommendations:
    Container Name: mongodb
    Lower Bound:
    Cpu: 12m
    Memory: 3480839981
    Target:
    Cpu: 12m
    Memory: 3666791614
    Uncapped Target:
    Cpu: 12m
    Memory: 3666791614
    Upper Bound:
    Cpu: 12m
    Memory: 3872270071
    “`
    VPA set memory requests to 3.41GiB and limits to 5.6 GiB (same ratio as 6GiB – 10GiB) and CPU requests and limits to 10 millicores.

    and how can I calculate the 5.6GiB Memory limits here.

    Thank you so much

    1. Just calculate the ratio. Let’s say the initial Pod CPU Requests is 1 and limits 2 is. The ratio is 2x. If VPA estimates 0.1 CPU, then limits would be set to 2 * 0.1 = 0.2.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.