Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
An elite DevOps team from Komodor takes on the Klustered challenge; can they fix a maliciously broken Kubernetes cluster using only the Komodor platform? Let’s find out!
Watch Komodor’s Co-Founding CTO, Itiel Shwartz, and two engineers – Guy Menahem and Nir Shtein leverage the Continuous Kubernetes Reliability Platform that they’ve built to showcase how fast, effortless, and even fun, troubleshooting can be!
Below is an auto-generated transcript of the video:
0:00 so starting from left to right starting with you guy could you please say hello and introduce yourself and share a little bit more if you wish yeah hey
0:07 everyone I’m guy I’m a solution AR commodor I’m here for the last two years
0:12 very excited to join cluster cool everyone my name is
0:19 newstein I’m social engineer I came week after guy and
0:25 it and hey everyone my name is Schwarz I’m the CTO of commodor watch and
0:30 clustered happy to be here all right thank you all very much so this is a
0:35 special edition of clustered and I have broken the cluster personally myself with a handful of braks to hopefully
0:42 reveal and show us the power of commodore as a tool for understanding and debugging problems within your
0:47 customer so with that being said I will now let guy share his screen and we will begin the debugging process best of luck
0:54 team commodor thank you let’s let’s start cool there you go so we got our
1:00 cluster with a condor install so I will consider this cluster
1:07 fixed if we can port forward and visit the duple website that is your
1:13 mission okay sounds easy okay easy
1:18 right you know the reason is it’s scheduling yeah
1:24 so we can go into like the service in Commodore and see the timeline right like how did the service change over
1:30 time and you can see that it’s currently broken and you can try and understand why is it broken so why is it broken I
1:37 see there is a deploy like in 15 25 open
1:43 so do you mean like oh to zoom in okay there you go I think it’s a great story
1:49 of what happened yeah and we can see there was a change in GitHub yeah there was a like an awesome
1:57 update update was it an awesome upate we will
2:03 not and here David remove some environment variables I think they’re
2:08 crucial and change the CL even worse cool so let’s take a look about
2:14 the problem let’s see if they are like connected to one another oh so we definitely can see that the volume the
2:21 PVC not found it’s definitely the problem with the p let’s try to do the roll back right yeah let’s cck R back
2:29 you don’t have permission to do it because it’s your cluster
2:35 David it’s also a getup’s cluster so the roll back wouldn’t actually work it would be overwritten a few minutes later
2:41 however the first you’ve discovered uh essentially the second problem there is
2:48 something that’s more current with this deployment but you yes we will have to fix this too so let’s you mentioned the
2:55 rate at the start somebody said scheduling right let’s let’s look at the pots guy
3:02 the pods yeah let’s look
3:08 atending other than the previous thing let SCH not ready yeah let’s go back we
3:15 have information in here by the way like on the first scheduling of the event would be much easier to see from
3:22here ah I think that one node is not available
3:28no maybe let’s check the pods list first yes let’s try check on the resources
3:35node and see that there is a node that is not ready and schedule disabled maybe
3:41let’s try to unone it it’s C condone so yeah I don’t think we will
3:47have it maybe let’s go to the terminal try to fix it from the terminal you add yourself as a yeah as a
3:56user try to add my my person
4:01do you want you to I okay so we’re doing like a switch just because of security
4:08permissions and because David created the cluster and the account basically H then we need to give like to add another
4:16team member to the commod account so David if you can invite Neil it can be
4:23great yes so let’s go to the notes let’s take
4:28the action and home
4:37unle and also let’s do the wall everything at the same time let’s
4:44do so now nearly like rolling back the service oh maybe the first deploy was
4:51not in Commodore that’s why I didn’t say
4:58that try yeah no
5:04fory yeah we can also take the changes from the GitHub and we can also take the
5:10CH from commod right andth yeah from GitHub I think the service was deleted and then reinitiated
5:17right like it’s generation one it basically means that like David played with the service apparently then he
5:25deleted the deployment then he recreated the deployment with the fil yeah with the failed configuration and that is the
5:32reason that the came fall back because for us it’s a new kind of service B that
5:39it has a different unique ID and this is the first generation of the new Pro
5:45deployment so we need to wall out this workload yeah is the notes okay now
5:53let’s check it yeah they are okay do you like check it on your screen or no no I’m asking I I can no it’s still not
6:01ready ready why is that and contain network is not
6:11[Music]
6:16ready it looks like more like a like a3s issue maybe yeah we we need
6:24the N plugin to be ready can we check maybe on the service as what is
6:30configure do the netor plug in maybe let’s try to take a look in the
6:46Y the network unavailable okay so what’s the reason CN is not
6:58initialized it’s k3s right no these are bare metal
7:04cube admin clusters bare metal
7:11here
7:16um maybe those are the things this is a 48 core 64 gig ram bare
7:24metal machine okay okay so you can have some fun with it right
7:32okay so let’s recap where we are right now using commodor we explored the broken service we identified two bugs
7:38one is that my awesome update in git which you were able to visualize and see right away uh potentially broke the PVC
7:45claim name which we’re going to come back to I would assume I also highlighted that the cluster couldn’t
7:52schedule or pod and you went to the node dashboard and identified that the node
7:57was cordoned and you were able to un coordinate directly from commodor moving us past the scheduling problem however
8:05we now have the node being not ready because of a potential issue with the cni networking
8:12plug-in yeah yeah like we can see that there are like I don’t know
8:18like four different plugins that are installed CSI plugins that are install
8:24and C we looking for cni not CS sorry sorry sorry
8:29maybe should I describe maybe the node what sorry
8:35describe it looks like that we have the celium operator installed yeah uh in in this cluster yeah it might
8:44be with the operator yeah there is maybe the crds ofum like there is the operator maybe
8:52the helm yeah it’s like using hel oh this fail
8:59fail deploy in here fail deploy yeah yeah so we can see there is Agent true
9:06agent not ready as well minimum replica
9:11unavailable yeah but it’s just just the operator itself
9:19on the deploy let take a
9:24look deployment version one
9:30there is a spec Affinity of like label match label ium
9:36operator it’s the P template that is UN unmatch the deployment do you think like
9:42the relevant part in here maybe oh it’s it’s funny it’s running
9:48and ready but it’s like the node is not ready it’s
9:55always fun watching people f a broken cluster
10:02no maybe like look at the hel dashb no like in the hel dashboard we can see
10:08like the current isum like we can see quite a lot on this like what annotation does this the
10:15cler not no I think the not found yeah I just found it
10:24exactly maybe let’s check if there is there is the clust the wall and cl
10:30wall binding in the cluster do do you mean like those like resources which are
10:36not exist M I think we need to create something I’m not sure maybe let’s check
10:41the log of the which is running no I think it’s like one The annotation right like he doesn’t find The annotation on
10:47the Node this why doesn’t inst it on it’s running on the
10:54no so this may be a little bit harder to debug because I think I found a bug and
10:59commodor but try comparing the values from the release three to release
11:04two okay obum yeah you it
11:13okay so there are changes but they don’t actually show up here yeah maybe met
11:20[Music] changes we have only the three version
11:27we don’t have the second no we do have it we do of the operator
11:34ah in the hand does only show changes doesn’t show anything no do two
11:41and then compare with rision two it’s two
11:47compared with division three
11:53no yeah I don’t know why it’s not showing the change for me show the change great then
12:00manifest then compare with version two here when you do here’s the changes you
12:05deleted the service account and all of those do guy I will
12:14do need to do the don’t have permission to I just
12:19perform but well maybe it’s a permission thing
12:25yeah I think the Watcher doesn’t have a permission maybe for that m possible yeah let’s see if also here it
12:32doesn’t have secret FS let’s do also W back to the we can’t
12:39we can’t we can’t our own agent we need the access to the to the class yeah so
12:47we will use it so do all to our agent and then we’ll do to to the seni okay
12:53soorry we sh my screen I’ll stop here yes so we found out that we are missing permission inside Commodore and
13:00it was installed without the possibility of like a roll
13:15back
13:27okay that’s
13:33it I to
13:39just okay you okay yeah that’s that’s
13:44it
13:55okay cool cool cool okay now let’s go back and check if the not is ready now
14:02yeah he is is ready okay and now let’s
14:07check out our so before we continue the upgrad to the Commodore I did in
14:13commodor because it turned on dashboard but I see that it moved the secret access which is probably why the values
14:19didn’t show yeah reason okay I just wanted to make sure I understood what
14:25happened there okay cool and so now the node is ready let’s go back to Services
14:32only thing remaining is the verion for the okay so we have a working node and
14:39you fix the deploy nice work what we yeah now we need to roll it back so what
14:44we we can’t it back because what so we need to edit like let’s edit it yeah I
14:49think that I need to show because remember this is a get UPS P Lanes so you might want me just to push a fix if
14:54you can tell me what you want that fix to be so be reverse yeah let’s just Che the
15:02latest oh I don’t know how to fix it I mean I I just did aw some updates you don’t need to tell me how to fix
15:09it so please your bed Cod yeah so yeah
15:18let’s just get check out to the like this revision if nothing else change in
15:24between then that’s probably the easiest solution you check out with the ref for the
15:30change are you doing the I have pushed an update to
15:36get I’m sharing I’m sharing I’m sharing do you have like a pipeline that
15:42know how to like it automatically deployed yes flux CD is running in the cluster it will detect this change and
15:48it will push it out we can speed up the process and I will do so now just so it’s a bit quicker
15:53yeah so what we can see is that see in near there like the the PVC
16:01change and we got some Environ variables which can be missing and what David
16:07changeed yeah it’s only the P so maybe maybe we still miss those yeah so let’s
16:15maybe start from the let’s wait for the roller to happen
16:21yeah we should see it in commod once the happen you can take look on the walk SPS
16:27to see this still yeah yeah but it’s the previous one yeah it wasn’t
16:34any so we looking for the new
16:41one what so I that push the update however our get offs pipeline is broken due to
16:47the fourth break in the cluster so good luck so there’s another break maybe I’ll
16:55go right like a let yeah let's check Aro is there Aro
17:02flux sorry Source control notification all of them look
17:09healthy what do we check sorry
17:14yeah but maybe it’s misconfigured or something like that seems like thex is
17:20working fine let’s check maybe logs of one of the workflows the controller or some other service The Source controller
17:27like the log message look
17:33good maybe is it updated by Source controller
17:40or I think there is still problem with the like one of the parts are unhealthy
17:46in the sour control yeah the C operator is
17:51pending scheduled because it didn’t match part Infinity Wes if you go to the
17:58walk on the
18:03white click on the the operator okay it’s just because when you did the roll back I set the replicas to one because
18:11we were a single node cluster so you can ignore that pending pod no take the first no he saying like
18:19it’s it’s not theity no no it’s like you said like in the logs of the source control yeah yeah it was
18:26there lo there’s like message s artifa foric let me go back and then garbage
18:33collected one artifact why did Garbage collected it and then a lot of changes but why did the garbage
18:39collected one artifact maybe it’s related to that I don’t know
18:46yeah Chang like this is the change this is
18:52what you mean right again yeah and then like one
18:57afterward remove typo in PVC name yeah this is the
19:05commit like d [Music]
19:10question yeah but what does it mean let’s see if we got any warnings in
19:15here or you can do like maybe
19:24like so what happen is one point in the it it find out that there there was a
19:30change M but for some reason the garbage
19:36collected it we need to change something in FL
19:44yeah let’s check the configuration maybe it’s something about this configuration
19:50CH yeah this by way in the customized controller it always failed the FL CD
19:57name is changed from system to and what is the name in The
20:07Log saw that yeah yeah okay so your rollback for cium actually fixed this
20:13problem but there’s a 10 minute sync time on the customization so I’ve just encouraged it to run again
20:21so so we don’t need to do anything as long as this customization runs no it’s
20:27still failing it’s is in networking and cluster is not working yeah I don’t know if your RB back for celum fixed the
20:34problem I think the RO of C didn’t no like there if you look at the logs of the customized controller there are
20:40really bad logs there and it says that it failed on like HTTP faed call in web let me just show
20:50that everyone can see yeah you consideration fail after second has the
20:55cnpg service who is it name
21:07rout the cpg thing is I think the
21:20network what is this service the cpg yeah there is like one thing here I’m
21:27looking at logs of the cpg is it a p it’s there is a pod but
21:35like the latest message is like periodic PLS certificate
21:41maintenance which I don’t really
21:51know e on this series what was the in it doesn’t likeed like with the
21:58relevant service basically yeah so let me give you context on that selum break right because you did a rule back but
22:04you didn’t really identify what the problem was and what changed and uh I don’t want is to debug
22:11something that you can’t have visibility into right now because of that secret values thing so in theum health chart
22:18what I did was disable the agent which is definitely rolled back because we can see the agent is now deployed next to the operator however I also disabled the
22:26ebpf cube proxy replacement and you may notice there’s no Cube proxy in this cluster so in the interest of not
22:34debugging something that we’re not entirely sure if it’s been fixed or not I’m going to redeploy celium right now and assume the r back hopefully fixed it
22:41properly and if we still have an issue then I’m debugging with you because I’m not really sure what the problem will be
22:46after that let let’s no maybe
22:55worse it’s not that okay so my my update for celium has
23:02triggered a redeploy of celium so the config map definitely changed so we may
23:08be moving in to a better
23:19situation yeah maybe delete the latest cium operator oh who can delete
23:25things delete the celium operator
23:32okay the previous one yeah
23:38the the operator wait a sec the one that is
23:43pending no not the one that is the other one what will happen yeah so go to the C operator to
23:51the are you sure yeah I’m going to delete the
23:56oldum oh that’s a bold move I like it yeah
24:02yeah we’re not playing around here you know so now the the new version is
24:09running and should or we won’t have anyone there
24:14rning right it seem like it’s face SCH hey that
24:20worked did it work yeah yeah well we had no doubts
24:26about
24:32seems like the new of theum doesn’t I think that’s okay because he
24:37has like two replica but now like it’s a new one that is running great so
24:44now I can scare it to one
24:49keep issues you know I’m scaling the the c one no no I
24:56think the no I think now it’s okay now let read the the logs of the flux thingy
25:01there the customized one I think right let’s The Source I think the C oh the Dr
25:08is just sinking it is yeah you see it’sing wall yeah let see
25:15it when a doubt delete stum operators fixes
25:21everything oh now it’s healthy look on the Yeahs and I do like
25:27a say okay let me share my screen and I’ll
25:33test the website for you right moment
25:38yeah and you understand like all you get is like druple working that’s
25:44like that’s like the best scenario Drupal is running we have a
25:49problem with our database configuration but maybe we don’t need it so an interest of testing we can go port
25:56forward not
26:03do also
26:09have okay so it’s almost working let’s see if we can actually open it in a
26:22Brer don’t be too happy
26:30you try to save it now he’s going to try to use the
26:35database so this shouldn’t actually be needed but the net script is unable to run for the same reason that this
26:42command will fail oh no our duple instance is unable
26:49to communicate with the postgress database back over to you and this is the last
26:54break because maybe the enir right it’s going to time out it cannot post
27:00dle cannot speak to post G there we go temp failure DNS
27:05resolution yeah back back to you last break there we go so it cannot resolve
27:12the database and okay so let’s check the
27:19events of the
27:24everything elction Network policy maybe you did a lot of network policy
27:31changes indeed why did you do it the event you can see policy changes
27:42and less Network policy [Music] change
27:48was I
27:54scraping so we saw that there are a lot of network policy changes and it look like someone changed
28:01the Untitled policy yeah there was a policy that prevent us for executing
28:09request the cluster there is a policy type of igress so let’s try to take on
28:15action and I mean what I love about comar here right is this the vent log as a gold mine of information and you can see this
28:22network policy was created in the last 24 hours it’s obviously well intended but you know mistakes are easy to make
28:28in kubernetes very easy
28:34then all right if you can stop sharing your screen I will give the application another spin I think we should be
28:40sitting pretty now Cas I still have my portf running if we remove the install script
28:48yeah we’re holding the view and if we make
28:56sure okay it completed 16 seconds ago the
29:03database is now running oh I shouldn’t have to do this
29:10but we run through it anyway that’s
29:16it woo well done you fix all the brakes on the cluster and duple is now working
29:23as [Music] intended
29:29so you know a small recap and then I’ll e get back up day right but that’s was a whole lot of fun for me right um I
29:36actually found it really difficult to break the developer the consumer API of
29:42kubernetes in a way that commodor couldn’t show right up front what the problem was with the GE integration the
29:49diffs the helm charts the node information even revealing all the labels and annotations everything was
29:55just there in front of me and I think that’s just superow for people that have to operate kubernetes so I’ll thank you
30:00all for your work it made it harder to break but I hope you enjoyed each of the breaks that were presented to you and uh
30:06yeah any final remarks from anyone no it was super
30:13fun
Share:
How useful was this post?
Click on a star to rate it!
Average rating 5 / 5. Vote count: 5
No votes so far! Be the first to rate this post.
and start using Komodor in seconds!