Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
HPA is a Kubernetes component that automatically updates workload resources such as Deployments and StatefulSets, scaling them to match demand for applications in the cluster. Horizontal scaling means deploying more pods in response to increased load. It should not be confused with vertical scaling, which means allocating more Kubernetes node resources (such as memory and CPU) to pods that are already running.
When load decreases and the number of pods exceeds the configured minimum, HPA notifies the workload resource, for example the Deployment object, to scale down.
HPA is widely used in scenarios where dynamic scaling is required. Here are some common use cases:
HorizontalPodAutoscaler calculates the number of replicas required for a deployment based on the metrics in the HPA configuration. Here is a step-by-step explanation of the calculation process:
Desired Replicas=(Current CPU Utilization/Target CPU Utilization)×Current Replicas
If the current CPU utilization is higher than the target, more replicas will be needed to balance the load. If it’s lower, fewer replicas will suffice.
Cooldown periods: These prevent scaling actions from occurring too frequently. This helps in stabilizing the application performance and avoiding unnecessary resource allocation.
Itiel Shwartz
Co-Founder & CTO
In my experience, here are tips that can help you better utilize the Horizontal Pod Autoscaler (HPA):
Define appropriate resource requests and limits for your pods.
Use Prometheus or other monitoring tools to track metrics that trigger autoscaling.
Adjust the HPA scaling policies to match your application’s load patterns.
Use Vertical Pod Autoscaler (VPA) alongside HPA for optimal resource utilization.
Simulate various load conditions to ensure HPA responds correctly.
Horizontal pod autoscaling has been a feature of Kubernetes since version 1.1, meaning that it is a highly mature and stable API. However, the API objects used to manage HPA have changed over time. In the V2 API, the HPA was upgraded to support custom metrics, as well as metrics from objects not related to Kubernetes. This lets you scale workloads based on metrics like HTTP request throughput or the size of the message queue.
You can define scaling characteristics for your workloads in the HorizontalPodAutoscaler YAML configuration. You can create a configuration for each workload or group of workloads. Here is an example:
apiVersion: autoscaling/v2beta2kind: HorizontalPodAutoscalermetadata: name: my-app-hpaspec: minReplicas: 2 maxReplicas: 10 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
A few important points about this configuration:
You can use other metrics, such as memory utilization, instead of CPU. You can also define several metrics to represent application load, and the HPA algorithm adjusts the number of pods to satisfy the most demanding metric across all pods.
Related content: Read our guide to Kubernetes Cluster Autoscaler
Although HPA is a powerful tool, it is not suitable for all use cases and does not solve all cluster resource problems. Here are a few examples of use cases that are less suitable for HPA:
The primary difference between HPA and VPA is the scaling method: HPA scales by adding or removing pods, while VPA scales by allocating additional CPU and memory resources to existing pod containers, or reducing the resource available to them. Another difference is that VPA only supports CPU and memory as scaling metrics, while HPA supports additional custom metrics.
HPA and VPA are not necessarily two competing options – they are often used together. This can help you achieve a good balance, optimally distributing workloads across available nodes, while fully utilizing the computing resources of each node.
However, you should be aware that HPA and VPA can conflict with each other – for example, if they both use memory as the scaling metric, they can try to scale workloads vertically and horizontally at the same time, which can have unpredictable consequences. To avoid such conflict, make sure that each mechanism uses different metrics. Typically, you will set VPA to scale based on CPU or memory, and use custom metrics for HPA.
Here are some of the ways you can ensure the most effective use of HPA in Kubernetes.
Start with CPU and memory utilization, as they are built-in and widely supported. To tailor HPA to your application’s needs, consider adding custom metrics that reflect your application’s workload more accurately. For example, web applications might benefit from metrics like HTTP request rates, response times, or the number of active sessions.
For data processing applications, metrics such as the length of the message queue or processing time per task might be more appropriate. Custom metrics can be integrated using tools like Prometheus, which exports metrics from your application to the Kubernetes metrics server. Ensure that your monitoring system is reliable and that the metrics are collected and reported with low latency.
Configuring the minimum and maximum number of replicas ensures that your application maintains a balance between performance and resource usage. The minimum number of replicas should be set to handle the baseline traffic and to provide fault tolerance. For example, if your application should always be highly available, set a higher minimum number of replicas.
The maximum number of replicas should be based on the maximum expected load and the capacity of your Kubernetes cluster. If you set the maximum too low, the application might not scale enough to handle peak traffic, leading to performance degradation. Setting it too high can exhaust cluster resources. Analyze historical traffic patterns and resource utilization to decide.
Scaling delays prevent HPA from reacting too quickly to short-term fluctuations, a phenomenon known as thrashing. Thrashing can lead to unnecessary scaling actions, increased resource consumption, and instability. Kubernetes provides parameters like --horizontal-pod-autoscaler-upscale-delay and --horizontal-pod-autoscaler-downscale-delay to control the delay before applying scaling actions.
--horizontal-pod-autoscaler-upscale-delay
--horizontal-pod-autoscaler-downscale-delay
The upscale delay should be set to allow the system to confirm that increased load is sustained before adding new pods. The downscale delay should be long enough to ensure that a drop in metrics is not temporary, avoiding premature pod termination.
Monitoring requires continuous effort to ensure your HPA configuration remains effective. Use tools like Prometheus and Grafana to visualize performance metrics and identify trends. Regularly review and analyze these metrics to understand how your application behaves under different loads.
Adjust the target utilization thresholds based on this analysis. For example, if the current target CPU utilization is set at 70%, but you notice that performance issues arise when utilization exceeds 60%, lower the threshold to 60%. If you find that your application is underutilized, raising the threshold might reduce unnecessary scaling actions.
Predictive scaling involves anticipating future load based on historical data and known events. Tools like the Kubernetes Event-Driven Autoscaler (KEDA) allow you to scale applications based on event data from various sources, including message queues, databases, or custom metrics.
For example, if you know that your web application experiences a traffic spike every day at noon, predictive scaling can proactively increase the number of pods just before the spike occurs. This helps maintain performance and avoid the latency associated with reactive scaling.
Resource requests specify the minimum resources required for a pod, influencing how Kubernetes schedules the pod. Limits define the maximum resources a pod can use, preventing it from consuming more than its fair share and potentially impacting other pods.
To set accurate resource requests and limits, profile your application to understand its resource usage under different loads. Use this data to configure requests that reflect typical usage and limits that accommodate peak usage without causing resource contention. This ensures that HPA scales your application based on realistic resource needs.
The Cluster Autoscaler adjusts the number of nodes in your cluster based on resource requirements, adding nodes when pods cannot be scheduled due to resource constraints and removing nodes when they are underutilized. To integrate effectively, configure the Cluster Autoscaler to work alongside your HPA.
Ensure that the autoscaler settings, such as the minimum and maximum number of nodes, align with your cluster’s capacity and workload demands. This integration prevents situations where the HPA cannot scale pods due to insufficient node resources, maintaining application performance and availability.
Readiness probes and liveness probes help ensure that only healthy pods receive traffic and that unresponsive pods are restarted. Properly configured probes help maintain application stability during scaling operations.
Readiness probes determine if a pod is ready to handle requests. They should be configured to reflect the actual conditions required for your application to serve traffic, such as successful connections to dependent services or initialization of necessary resources. This ensures that new pods are not added to the load balancer until they are fully ready.
Liveness probes detect if a pod is still running and responsive. They should be configured to catch conditions where the pod is stuck or unresponsive, triggering a restart to restore functionality. Fine-tuning these probes helps avoid disruptions and ensures smooth scaling transitions.
A common challenge with HPA is that it takes time to scale up a workload by adding another pod. Loads can sometimes change sharply, and during the time it takes to scale up, the existing pod can reach 100% utilization, resulting in service degradation and failures.
For example, consider a pod that can handle 800 requests with under 80% CPU utilization, and HPA is configured to scale up when the 80% CPU threshold is reached. Let’s say it takes 10 seconds for the new pod to start up.
If loads increase by 100 requests per second, the pod will reach 100% utilization within 2 seconds, while it takes 8 more seconds for the second pod to start receiving requests.
Possible solutions
When a workload experiences brief spikes in CPU utilization (or any other scaling metrics), you might expect that HPA will immediately spin up an additional pod. However, if the spikes are short enough, this will not happen.
To understand why, consider that:
--metric-resolution
For example, assume HPA is set to scale when CPU utilization exceeds 80%. If CPU utilization suddenly spikes to 90%, but this occurs for only 2 seconds out of a 30 second metric resolution window, and in the rest of the 30-second period utilization is 20%, the average utilization is:
(2 * 90% + 28 * 90%) / 30 = 27%
When HPA polls for the CPU utilization metric, it will observe a metric of 27%, which is not even close to the scaling threshold of 80%. This means HPA will not scale – even though in reality, the workload experienced high load.
limits
requests
Related content: Read our guide to Readiness Probes
In some cases, HPA might scale an application so much that it could consume almost all the resources in the cluster. You could set up HPA in combination with Cluster Autoscaler to automatically add more nodes to the cluster. However, this might sometimes get out of hand.
Consider these scenarios:
In these, and many similar scenarios, it is better not to scale the application beyond a certain limit. However, HPA does not know this and will continue to scale the application even when this does not make business sense.
The best solution is to limit the number of replicas that can be created by HPA. You can define this in the spec:maxReplicas field of the HPA configuration:
spec:maxReplicas
apiVersion: autoscaling/v2beta2kind: HorizontalPodAutoscalermetadata: name: my-app-hpaspec: minReplicas: 2 maxReplicas: 10
In this configuration, maxReplicas is set to 10. Calculate the maximum expected load of your application and ensure you set a realistic maximal scale, with some buffer for surprise peaks in traffic.
Kubernetes troubleshooting relies on the ability to quickly contextualize the problem with what’s happening in the rest of the cluster. More often than not, you will be conducting your investigation during fires in production. The major challenge is correlating service-level incidents with other events happening in the underlying infrastructure.
Komodor can help with its ‘Node Status’ view, built to pinpoint correlations between service or deployment issues and changes in the underlying node infrastructure. With this view you can rapidly:
Beyond node error remediations, Komodor can help troubleshoot a variety of Kubernetes errors and issues. As the leading Continuous Kubernetes Reliability Platform, Komodor is designed to democratize K8s expertise across the organization and enable engineering teams to leverage its full value.
Komodor’s platform empowers developers to confidently monitor and troubleshoot their workloads while allowing cluster operators to enforce standardization and optimize performance. Specifically when working in a hybrid environment, Komodor reduces the complexity by providing a unified view of all your services and clusters.
By leveraging Komodor, companies of all sizes significantly improve reliability, productivity, and velocity. Or, to put it simply – Komodor helps you spend less time and resources on managing Kubernetes, and more time on innovating at scale.
If you are interested in checking out Komodor, use this link to sign up for a Free Trial.
Share:
and start using Komodor in seconds!