Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
Kubernetes troubleshooting is the process of identifying, diagnosing, and resolving issues in Kubernetes clusters, nodes, pods, or containers.
More broadly defined, Kubernetes troubleshooting also includes effective ongoing management of faults and taking measures to prevent issues in Kubernetes components.
Kubernetes troubleshooting can be very complex. This article will focus on:
CreateContainerConfigError
ImagePullBackOff
CrashLoopBackOff
Kubernetes Node Not Ready
This is part of an extensive series of guides about Kubernetes.
There are three aspects to effective troubleshooting in a Kubernetes cluster: understanding the problem, managing and remediating the problem, and preventing the problem from recurring.
In a Kubernetes environment, it can be very difficult to understand what happened and determine the root cause of the problem. This typically involves:
To achieve the above, teams typically use the following technologies:
In a microservices architecture, it is common for each component to be developed and managed by a separate team. Because production incidents often involve multiple components, collaboration is essential to remediate problems fast.
Once the issue is understood, there are three approaches to remediating it:
Successful teams make prevention their top priority. Over time, this will reduce the time invested in identifying and troubleshooting new issues. Preventing production issues in Kubernetes involves:
To achieve the above, teams commonly use the following technologies:
Itiel Shwartz
Co-Founder & CTO
In my experience, here are tips that can help you better troubleshoot Kubernetes:
Organize your clusters by namespaces to isolate different environments (e.g., dev, staging, prod) for more targeted troubleshooting.
Turn on Kubernetes audit logging to capture detailed events for security and debugging purposes.
Set resource quotas to prevent a single application from exhausting cluster resources, which can lead to easier identification of resource contention issues.
Implement automated health checks and alerts for node conditions such as disk pressure, memory pressure, and network availability.
Use operators to automate complex application lifecycles and maintain stateful applications more reliably.
Kubernetes is a complex system, and troubleshooting issues that occur somewhere in a Kubernetes cluster is just as complicated.
Even in a small, local Kubernetes cluster, it can be difficult to diagnose and resolve issues, because an issue can represent a problem in an individual container, in one or more pods, in a controller, a control plane component, or more than one of these.
In a large-scale production environment, these issues are exacerbated, due to the low level of visibility and a large number of moving parts. Teams must use multiple tools to gather the data required for troubleshooting and may have to use additional tools to diagnose issues they detect and resolve them.
To make matters worse, Kubernetes is often used to build microservices applications, in which each microservice is developed by a separate team. In other cases, there are DevOps and application development teams collaborating on the same Kubernetes cluster. This creates a lack of clarity about division of responsibility – if there is a problem with a pod, is that a DevOps problem, or something to be resolved by the relevant application team?
In short – Kubernetes troubleshooting can quickly become a mess, waste major resources and impact users and application functionality – unless teams closely coordinate and have the right tools available.
If you are experiencing one of these common Kubernetes errors, here’s a quick guide to identifying and resolving the problem:
This error is usually the result of a missing Secret or ConfigMap. Secrets are Kubernetes objects used to store sensitive information like database credentials. ConfigMaps store data as key-value pairs, and are typically used to hold configuration information used by multiple pods.
Run kubectl get pods .
kubectl get pods
Check the output to see if the pod’s status is CreateContainerConfigError
$ kubectl get pods NAME READY STATUS RESTARTS AGE pod-missing-config 0/1 CreateContainerConfigError 0 1m23s
To get more information about the issue, run kubectl describe [name] and look for a message indicating which ConfigMap is missing:
kubectl describe [name]
$ kubectl describe pod pod-missing-config Warning Failed 34s (x6 over 1m45s) kubelet Error: configmap "configmap-3" not found
Now run this command to see if the ConfigMap exists in the cluster.
For example $ kubectl get configmap configmap-3
$ kubectl get configmap configmap-3
If the result is null, the ConfigMap is missing, and you need to create it. See the documentation to learn how to create a ConfigMap with the name requested by your pod.
null
Make sure the ConfigMap is available by running get configmap [name] again. If you want to view the content of the ConfigMap in YAML format, add the flag -o yaml.
get configmap [name]
-o yaml
Once you have verified the ConfigMap exists, run kubectl get pods again, and verify the pod is in status Running:
Running
$ kubectl get pods NAME READY STATUS RESTARTS AGE pod-missing-config 0/1 Running 0 1m23s
This status means that a pod could not run because it attempted to pull a container image from a registry, and failed. The pod refuses to start because it cannot create one or more containers defined in its manifest.
Run the command kubectl get pods
command kubectl get pods
Check the output to see if the pod status is ImagePullBackOff or ErrImagePull:
ErrImagePull
$ kubectl get pods NAME READY STATUS RESTARTS AGE mypod-1 0/1 ImagePullBackOff 0 58s
Run the kubectl describe pod [name] command for the problematic pod.
kubectl describe pod [name]
The output of this command will indicate the root cause of the issue. This can be one of the following:
docker pull
This issue indicates a pod cannot be scheduled on a node. This could happen because the node does not have sufficient resources to run the pod, or because the pod did not succeed in mounting the requested volumes.
Run the command kubectl get pods.
Check the output to see if the pod status is CrashLoopBackOff
$ kubectl get pods NAME READY STATUS RESTARTS AGE mypod-1 0/1 CrashLoopBackOff 0 58s
Run the kubectl describe pod [name] command for the problematic pod:
The output will help you identify the cause of the issue. Here are the common causes:
When a worker node shuts down or crashes, all stateful pods that reside on it become unavailable, and the node status appears as NotReady.
NotReady
If a node has a NotReady status for over five minutes (by default), Kubernetes changes the status of pods scheduled on it to Unknown, and attempts to schedule it on another node, with status ContainerCreating.
Unknown
ContainerCreating
Run the command kubectl get nodes.
kubectl get nodes
Check the output to see is the node status is NotReady
NAME STATUS AGE VERSION mynode-1 NotReady 1h v1.2.0
To check if pods scheduled on your node are being moved to other nodes, run the command get pods.
get pods
Check the output to see if a pod appears twice on two different nodes, as follows:
NAME READY STATUS RESTARTS AGE IP NODE mypod-1 1/1 Unknown 0 10m [IP] mynode-1 mypod-1 0/1 ContainerCreating 0 15s [none] mynode-2
If the failed node is able to recover or is rebooted by the user, the issue will resolve itself. Once the failed node recovers and joins the cluster, the following process takes place:
If you have no time to wait, or the node does not recover, you’ll need to help Kubernetes reschedule the stateful pods on another, working node. There are two ways to achieve this:
kubectl delete node [name]
kubectl delete pods [pod_name] --grace-period=0 --force -n [namespace]
Learn more about Node Not Ready issues in Kubernetes.
If you’re experiencing an issue with a Kubernetes pod, and you couldn’t find and quickly resolve the error in the section above, here is how to dig a bit deeper. The first step to diagnosing pod issues is running kubectl describe pod [name].
Here is example output of the describe pod command, provided in the Kubernetes documentation:
Name: nginx-deployment-1006230814-6winp Namespace: default Node: kubernetes-node-wul5/10.240.0.9 Start Time: Thu, 24 Mar 2016 01:39:49 +0000 Labels: app=nginx,pod-template-hash=1006230814 Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-1956810328","uid":"14e607e7-8ba1-11e7-b5cb-fa16" ... Status: Running IP: 10.244.0.6 Controllers: ReplicaSet/nginx-deployment-1006230814 Containers: nginx: Container ID: docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149 Image: nginx Image ID: docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707 Port: 80/TCP QoS Tier: cpu: Guaranteed memory: Guaranteed Limits: cpu: 500m memory: 128Mi Requests: memory: 128Mi cpu: 500m State: Running Started: Thu, 24 Mar 2016 01:39:51 +0000 Ready: True Restart Count: 0 Environment: [none] Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-5kdvl (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: default-token-4bcbi: Type: Secret (a volume populated by a Secret) SecretName: default-token-4bcbi Optional: false QoS Class: Guaranteed Node-Selectors: [none] Tolerations: [none] Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 54s 54s 1 {default-scheduler } Normal Scheduled Successfully assigned nginx-deployment-1006230814-6winp to kubernetes-node-wul5 54s 54s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Pulling pulling image "nginx" 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Pulled Successfully pulled image "nginx" 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Created Created container with docker id 90315cc9f513 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Started Started container with docker id 90315cc9f513
We bolded the most important sections in the describe pod output:
describe pod
Name
Status
Containers
Containers:State
Volumes
Events
Continue debugging based on the pod state.
If a pod’s status is Pending for a while, it could mean that it cannot be scheduled onto a node. Look at the describe pod output, in the Events section. Try to identify messages that indicate why the pod could not be scheduled. For example:
If a pod’s status is Waiting, this means it is scheduled on a node, but unable to run. Look at the describe pod output, in the ‘Events’ section, and try to identify reasons the pod is not able to run.
Most often, this will be due to an error when fetching the image. If so, check for the following:
If a pod is not running as expected, there can be two common causes: error in pod manifest, or mismatch between your local pod manifest and the manifest on the API server.
It is common to introduce errors into a pod description, for example by nesting sections incorrectly, or typing a command incorrectly.
Try deleting the pod and recreating it with kubectl apply --validate -f mypod1.yaml
kubectl apply --validate -f mypod1.yaml
This command will give you an error like this if you misspelled a command in the pod manifest, for example if you wrote continers instead of containers:
continers
containers
46757 schema.go:126] unknown field: continers 46757 schema.go:129] this may be a false alarm, see https://github.com/kubernetes/kubernetes/issues/5786 pods/mypod1
It can happen that the pod manifest, as recorded by the Kubernetes API Server, is not the same as your local manifest—hence the unexpected behavior.
Run this command to retrieve the pod manifest from the API server and save it as a local YAML file:
kubectl get pods/[pod-name] -o yaml > apiserver-[pod-name].yaml
You will now have a local file called apiserver-[pod-name].yaml, open it and compare with your local YAML. There are three possible cases:
apiserver-[pod-name].yaml
If you weren’t able to diagnose your pod issue using the methods above, there are several additional methods to perform deeper debugging of your pod:
You can retrieve logs for a malfunctioning container using this command:
kubectl logs [pod-name] [container-name]
If the container has crashed, you can use the --previous flag to retrieve its crash log, like so:
--previous
kubectl logs --previous [pod-name] [container-name]
Many container images contain debugging utilities—this is true for all images derived from Linux and Windows base images. This allows you to run commands in a shell within the malfunctioning container, as follows:
kubectl exec [pod-name] -c [container-name] -- [your-shell-commands]
There are several cases in which you cannot use the kubectl exec command:
kubectl exec
The solution, supported in Kubernetes v.1.18 and later, is to run an “ephemeral container”. This is a container that runs alongside your production container and mirrors its activity, allowing you to run shell commands on it, as if you were running them on the real container, and even after it crashes.
Create an ephemeral container using kubectl debug -it [pod-name] --image=[image-name] --target=[pod-name].
kubectl debug -it [pod-name] --image=[image-name] --target=[pod-name]
The --target flag is important because it lets the ephemeral container communicate with the process namespace of other containers running on the pod.
--target
After running the debug command, kubectl will show a message with your ephemeral container name—take note of this name so you can work with the container:
Defaulting debug container name to debugger-8xzrl
You can now run kubectl exec on your new ephemeral container, and use it to debug your production container.
If none of these approaches work, you can create a special pod on the node, running in the host namespace with host privileges. This method is not recommended in production environments for security reasons.
Run a special debug pod on your node using kubectl debug node/[node-name] -it --image=[image-name].
kubectl debug node/[node-name] -it --image=[image-name]
After running the debug command, kubectl will show a message with your new debugging pod—take note of this name so you can work with it:
Creating debugging pod node-debugger-mynode-pdx84 with container debugger on node [node-name]
Note that the new pod runs a container in the host IPC, Network, and PID namespaces. The root filesystem is mounted at /host.
When finished with the debugging pod, delete it using kubectl delete pod [debug-pod-name].
kubectl delete pod [debug-pod-name]
The first step to troubleshooting container issues is to get basic information on the Kubernetes worker nodes and Services running on the cluster.
To see a list of worker nodes and their status, run kubectl get nodes --show-labels. The output will be something like this:
kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS worker0 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker0 worker1 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker1 worker2 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker2
To get information about Services running on the cluster, run:
kubectl cluster-info
The output will be something like this:
Kubernetes master is running at https://104.197.5.247 elasticsearch-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy kibana-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kibana-logging/proxy kube-dns is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kube-dns/proxy
To diagnose deeper issues with nodes on your cluster, you will need access to logs on the nodes. The following table explains where to find the logs.
/var/log/kube-apiserver.log
/var/log/kube-scheduler.log
/var/log/kube-controller-manager.log
/var/log/kubelet.log
/var/log/kube-proxy.log
Let’s look at several common cluster failure scenarios, their impact, and how they can typically be resolved. This is not a complete guide to cluster troubleshooting, but can help you resolve the most common issues.
The troubleshooting process in Kubernetes is complex and, without the right tools, can be stressful, ineffective and time-consuming. Some best practices can help minimize the chances of things breaking down, but eventually, something will go wrong – simply because it can.
This is the reason why we created Komodor, a tool that helps dev and ops teams stop wasting their precious time looking for needles in (hay)stacks every time things go wrong.
Acting as a single source of truth (SSOT) for all of your k8s troubleshooting needs, Komodor offers:
If you are interested in checking out Komodor, use this link to sign up for a Free Trial.
There’s a lot more to learn about Kubernetes troubleshooting. Check out some of the most common errors, their causes, and how to fix them.
CrashLoopBackOff appears when a pod is constantly crashing in an endless loop in Kubernetes. Understand and learn how to quickly fix the CrashLoopBackOff error (diagnosis and resolution).
Read more: How to Fix CrashLoopBackOff Kubernetes Error
ImagePullBackOff / ErrImagePull error means that a pod cannot pull an image from a container registry. In this article, we walk through the steps you should take to troubleshoot the error.
Read more: How to Fix ErrImagePull and ImagePullBackoff
Kubernetes errors such as CreateContainerConfigError and CreateContainerError occur when a container is created in a pod and fails to enter the Running state. Learn more about these errors and how to fix them quickly.
Read more: How to Fix CreateContainerError & CreateContainerConfigError
Node Not Ready error indicates a machine in a K8s cluster that cannot run pods. This error is frequently caused by a lack of resources on the node, an issue with the kubelet, or a kube-proxy error.
Read more: How to Fix Kubernetes ‘Node Not Ready’ Error
OOMKilled (exit code 137) occur when K8s pods are killed because they use more memory than their limits. OOM stands for “Out Of Memory”, a tool available on Linux systems that keeps track of how much memory each process uses.
Read more: How to Fix OOMKilled Kubernetes Error (Exit Code 137)
This article discusses how to set up a reliable health check process and why health checks are essential for K8s troubleshooting. We go through the different types of health checks including kubelet, liveness, readiness probes, and more.
Read more: Kubernetes Health Checks: Everything You Need to Know
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of Kubernetes.
Read our guide to Kubernetes resource quota
Read our guide to change intelligence
Authored by Komodor
Authored by Granulate
Share:
How useful was this post?
Click on a star to rate it!
Average rating 5 / 5. Vote count: 7
No votes so far! Be the first to rate this post.
and start using Komodor in seconds!