Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
It’s 3 AM. You’re the developer on call and you get woken up by an alert. It’s the third time you’re on call this month and even though you’ve been able to manage all the alerts that have come your way so far, it’s still really stressful . All you want is to deal with it quickly and get back to sleep. This time, the alert is for an SQS issue. It’s not your expertise, but you think you can handle it. You open git in one tab, datadog in another, and Slack in another, trying to figure out who released what, and when. Time passes and you realize you have zero context for what might have happened to cause the issue. Stressed, frustrated, and annoyed, you wake up the next senior person on call. More time passes, and the two of you can’t figure it out.
You ultimately realize you need to wake up the one person in the organization who knows everything and anything about the system. In about 10 minutes, he solves the problem. Three hours have passed since the alert. Tired and embarrassed, you go back to sleep.
I’m sure you’re familiar with this kind of story. I am. It’s a painful, common experience for many developers.
On-call today is moving from ops/ SRE teams to developers. Most developers, and even SRE’s, don’t have the right context or tools to troubleshoot. Yet they’re expected to provide 24/7 coverage for systems, and respond in real time quickly and effectively, even though it’s not within their expertise and not what they were hired to do.
Today’s systems are distributed, complex, and changing rapidly. The information needed to operate them effectively is scattered across different tools and teams which all need to coordinate harmoniously, and they affect each other when they don’t. With the number of alerts tripling over the last three years, troubleshooting has become a chaotic process that wastes precious time and requires a deep understanding of a system’s dependencies, activities, and metrics – an understanding that only a few, highly paid people in any given organization have.
Troubleshooting becomes mission impossible, especially when 85% of incidents can be traced to system changes, according to a recent Gartner recent report. It’s a chaotic process.
When it comes to solving real issues, you need context over the entire system.
The SRE who ultimately solved the problem described above wasn’t a better developer than the others who spent hours working on it – he was simply able to check the usual suspects quickly, as he had seen the issue before. He was one of the few people in the organization who was familiar with the system’s structure.
It’s possible for a developer to be as effective as someone who has the context over a system – companies can provide better training and allocate more energy to support their employees on call. Some companies do go down this route, and focus on improving the process side of being on call. With enough training, you’re going to get good outcomes. However, it requires a tremendous amount of resources and time – for developers and the company.
Other companies choose to invest resources in on-call tools. They develop enhanced dashboards, sophisticated alerts, detailed playbooks, and lengthy post-mortem procedures. Identifying and implementing appropriate tools can be a full time job, as it takes constant work to keep on-call monitoring under control. But it’s not so simple – there are many black holes in a system and it can be difficult to ascertain correlations between different tools. Since there are typically very few people with complete knowledge and an accurate overview of how a system is built, it’s hard to set up the monitoring tools effectively. An alert in one service of the system can actually originate in another service and it can be difficult to effectively trace an alert back to its origins. To implement the appropriate tools, a comprehensive, detailed overview of the entire system is required. In reality, many companies do not invest in the training nor the tools needed to solve the on-call problem.
So, what would be helpful?
The first step is to admit we have a problem. The tech sector needs to acknowledge that the current way of dealing with on-call is not working. It’s ineffective, broken, and causes a lot of anxiety and stress for developers asked to take on the burden of making sure everything is ok. With nearly 35% of devops engineers leaving their jobs due to burnout, this is a real problem.
When it comes to alerts, it’s commonly known that in the middle of the night you have to solve the problem as fast as possible, and the next day you need to make sure the problem never happens again. This impacts developers, managers, and the entire developer team. We know the current way of troubleshooting is not working and that there’s a better way to handle the problem.
That’s why we started Komodor. Stay tuned for more news from us.
Share:
How useful was this post?
Click on a star to rate it!
Average rating 5 / 5. Vote count: 7
No votes so far! Be the first to rate this post.
and start using Komodor in seconds!