Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
The Kubernetes Controller Manager (also called kube-controller-manager) is a daemon that acts as a continuous control loop in a Kubernetes cluster. The controller monitors the current state of the cluster via calls made to the API Server, and changes the current state to match the desired state described in the cluster’s declarative configuration.
The Controller Manager does not directly modify resources in the Kubernetes cluster. Instead, it manages multiple controllers responsible for specific activities—including replication controllers, endpoint controllers, namespace controllers, and service account controllers.
This is part of our series of articles about Kubernetes troubleshooting.
The Kubernetes control plane consists of a core component called kube-controller-manager. This component is responsible for running multiple controllers that maintain the desired state of the cluster. These controllers are packaged together with the kube-controller-manager daemon.
Kubernetes resources are defined by a manifest file written in YAML. When the manifest is deployed, an object is created that aims to reach the desired state within the cluster. From that point, the appropriate controller watches the object and updates the cluster’s existing state to match the desired state.
The controller adjusts resources in the cluster based on the existing nodes, the resources they have available, the currently running workloads, and the policies defined for their behavior.
The controller has two main components: Informer (or Shared Informer) and Workqueue. It is useful to understand these implementation details in case you want to write your own custom controller, and to better understand the inner workings of the Kubernetes reconciliation mechanism.
All Kubernetes controllers need to monitor the state of an object, and send instructions to bring it closer to the desired state. In order to get information about the object, the controller needs to query the API Server. However, it is inefficient for the controller to continuously poll the API Server. Controllers have two strategies to avoid too many calls to the API Server:
This logic is packaged by a component called an Informer. Modern Kubernetes controllers more commonly use a SharedInformer, explained below.
A regular Informer creates a local cache of resource for the user of one controller. However, Kubernetes has many controllers, and some of them might oversee the same resources. This means the same resource might be handled by multiple controllers.
SharedInformer creates a single cache that can be shared by several controllers. Cached cached resources are not copied, reducing memory overhead. Additionally, each SharedInformer creates only one watch on the upstream API Server, regardless of how many events the downstream consumers read, reducing the load on the API Server.
SharedInformer provides several convenient capabilities for controllers:
A SharedInformer relies on an external queuing system, called the Kubernetes Workqueue, because it cannot track each controller’s activity. Workqueue supports several types of queues, including:
Every time an object or resource changes, the resource event handler adds a key to the object, including its namespace to the Workqueue. It can then be picked up by SharedInformers to distribute the cluster changes to controllers.
Itiel Shwartz
Co-Founder & CTO
In my experience, here are tips that can help you better handle Kubernetes Controller Manager:
Use monitoring tools to track the health and performance of controllers.
Adjust controller parameters based on workload requirements and performance data.
Extend Kubernetes functionality with custom controllers using Operator Framework.
Enable leader election to ensure high availability of controllers.
Enable detailed logging for controllers to facilitate troubleshooting.
The kube-controller-manager shows the metrics obtained by Prometheus by default. These metrics show data about requests received by the API and workqueues. For example, run the following command from a pod having network access in the master nodes to show the monitoring metrics:
curl http://localhost:10353/metrics
The command returns output like this, displaying the metrics:
The output suggests that a scraper can use the command and gather the information without tedious calculations directly through the API’s endpoint.
To configure a Prometheus job and scrape the API’s endpoint:
1. Add the following job to the targets to create a job that scrapes the API’s endpoint:
- job_name: 'monitoring kubernetes-controller' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_demo_kubernetes_pod_label_(.+) - source_labels: [__meta_demo_kubernetes_namespace] action: replace target_label: demo_kubernetes_namespace - source_labels: [__meta_demo_kubernetes_pod_name] action: replace target_label: demo_kubernetes_pod_name
2. Modify the manifest in the master node present in /etc/kubernetes/manifests/kube-controller-manager.manifest
/etc/kubernetes/manifests/kube-controller-manager.manifest
and add the following code under annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "10252"
Prometheus can return different metrics and run calculations on them. However, monitoring the kube-controller-manager means focusing on a few crucial metrics.
Number of kube-controller-manager InstancesThis metric provides an overview of the kubelet in the node’s health. Its value is expected to equal the number of nodes within a cluster.
To find the number of kube-controller-manager instances:
Use the following PromQL query that returns a single stat graph:
sum(up{k8s_app="kube-controller-manager"})
The query counts the targets which Prometheus found. An alternative method is to check the process, given that the user has low-level access.
This metric shows if the workqueue is facing any obstacles or has trouble processing certain commands. It is possible to find the aggregate value for the metric from all controllers. However, queues of different controllers like the node controller or the AWS controller have other different metrics available.
This metric is the time it takes the kube-controller-manager to take measures required to keep the cluster’s needed status.
To find the workqueue latency:
Use the following query to represent the workqueue latency through quantiles:
histogram_quantile(0.99, sum(rate(workqueue_queue_duration_seconds_bucket{k8s_app="kube-controller-manager"}[5m])) by (instance, name, le))
This metric is the number of required actions the kube-controller-manager takes per unit time. A higher rate indicates a problem in some of the nodes’ clusters.
To find the workqueue rate:
Use the following query to retrieve the rate of the workqueue:
sum(rate(workqueue_adds_total{k8s_app="kube-controller-manager"}[5m])) by (instance, name)
This metric is the number of actions lined up in the queue and which the kube-controller-manager has to perform. The ideal values should be low.
To find the workqueue depth:
Use the following query to retrieve the depth of the workqueue:
sum(rate(workqueue_depth{k8s_app="kube-controller-manager"}[5m])) by (instance, name)
This metric helps ensure the connection with the API is functional. Additionally, it provides information about the requests so that it becomes clear that the API is adequately responding to them.
To find the API’s latency in resolving requests:
Use the following query to retrieve the API’s latency through quantiles:
histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{k8s_app="kube-controller-manager"}[5m])) by (url, le))
To find the rate of requests to the API and any errors:
Use the following queries to retrieve how many requests the API receives and if any errors come up during the process:
sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~"2.."}[5m])) sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~"3.."}[5m])) sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~"4.."}[5m])) sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~"5.."}[5m]))
This metric shows how much CPU power the kube-controller-manager is using.
To find the kube-controller-manager’s CPU usage:
rate(process_cpu_seconds_total{k8s_app="kube-controller-manager"}[5m])
This metric shows how much computational memory the kube-controller-manager is using.
To find the kube-controller-manager’s memory usage:
process_resident_memory_bytes{k8s_app="kube-controller-manager"}
The troubleshooting process in Kubernetes is complex and, without the right tools, can be stressful, ineffective and time-consuming. Some best practices can help minimize the chances of things breaking down, but eventually something will go wrong—simply because it can.
This is the reason why we created Komodor, a tool that helps dev and ops teams stop wasting their precious time looking for needles in (hay)stacks every time things go wrong.
Acting as a single source of truth (SSOT) for all of your k8s troubleshooting needs, Komodor offers:
If you are interested in checking out Komodor, use this link to sign up for a Free Trial.
Share:
and start using Komodor in seconds!