How to Ensure Continuous Kubernetes Reliability in 2024?

Marino Wijay. How to Ensure Continuous Kubernetes Reliability in 2024?
Marino Wijay
Solutions Engineer at Komodor & CNCF Ambassador

Hi everyone, and welcome to our monthly webinar. Today we’re going to be discussing continuous Kubernetes reliability, what this concept means, and how to achieve it using the right tooling and methodologies. To teach us all of that, we have the one and only Marino Wijay, a Solutions Architect at Komodor, Co-organizer of KubeHuddle, founder of EmpathyOps, CNCF Ambassador, and an awesome human being and dear friend.

I’m Udi, this is Nikki, and before Marino takes us on this reliability adventure, just a quick note: if anyone has any questions throughout the session, please leave them in the chat, and we’ll address them either mid-session or at the end. Throughout the session, we have quick polls, super easy. Marino will walk you through it, but be prepared. Finally, we will have a hands-on demo of Komodor, and you are all welcome to follow along if you want to start using Komodor for your reliability optimization.

Without further ado, the man of the hour himself, Marino, take it away.

Thank you, Nikki and Udi. I really appreciate you both setting this up. It takes a lot of work to run any kind of event—coordination, making people aware, and just being very continuous about it. Massive appreciation for you both, round of applause for them if you’re in the audience. Thank you, thank you, thank you.

Well, hello everyone, thank you for coming out to this webinar. My name is Marino Wijay. I am a Solutions Architect here at Komodor, and we’re going to talk a little bit about continuous reliability. I’m going to go ahead and share my screen. As you are all just listening in, feel free to share some of the challenges that you’ve been facing when it comes to reliability in your own environments. If you have any questions about Kubernetes or the tools you’ve been using, feel free to ask, and we’ll be happy to answer them.

Again, my name is Marino Wijay, part of the Komodor team here, and an organizer for KubeHuddle Toronto. Also, a CNCF Ambassador, and I’ve been part of the CNCF ecosystem for quite a while. I’ve contributed to various projects and have been involved with Kubernetes from way back. I love working with the community as much as I possibly can.

Let’s talk a little bit about reliability. To set the context and agenda here, a conversation I had a while back brought up the idea that load balancing seemed to solve all reliability challenges, or a good chunk of them. That’s partially true, depending on how you look at it. With load balancers, we have this system that pools resources together and distributes the load or request access to resources so that a single entity is not responsible for providing the response.

The problem with this model is it doesn’t address all aspects of reliability. When I posted something on social media about this, I got a response from a good friend, Dan Finneran from Cisco Isovalent, claiming that load balancing is the way. It’s funny because he created a technology called Cilium, which is used in many environments for protecting the Kubernetes control plane and providing services like load balancer capabilities.

Before we dig into it, I want to ask a few questions and do a quick survey. I want you all to participate by heading over to Slido, hit that QR code there, scan it, and I’ll look at the responses. The first question is, what does reliability mean to you? Feel free to scan that QR code and add your words. Think of the first thing that comes to mind regarding reliability and incidents. Just throw it up there, and I’ll give us a couple of minutes to see some words.

I’ll even participate too. Let’s scan that QR code there if you can see me. My word is uptime. I’m going to add another response: Disaster Recovery. Okay, let’s see. Feel free to add your words, don’t be shy. Add more words, it’s totally fine.

Alright, on to the next one. Well, we have one more. I’ll show it later on. I love that uptime is the key word. We want to make sure our systems are always available and that we can be confident they will provide that uptime. There are many considerations when it comes to achieving that level of reliability, so let’s dig into it. The list I’m about to show you is not exhaustive; it’s just some of the things I’ve seen over the years that add to the reliability metric.

Let’s start with right-sizing our workloads. Many years ago, back in my Ops days, developers wanted to deploy and test applications, so they would request a system—a server, Windows, Linux, whatever. In 2008, it was Windows. It wasn’t easy to hand over the right-sized system to a developer because I’d have to procure, ship, image, and set it up. The problem was they wouldn’t always use the server fully, leading to idle capacity.

Today, with Kubernetes, we deploy clusters, pick namespaces, and deploy workloads, but we’re not truly right-sizing them. We need to revisit that and think about right-sizing to avoid wasting resources.

Policies and guardrails are another consideration. It’s not just about creating a firewall policy. For example, disabling root access to your nodes across your cluster can prevent malicious workloads from causing a denial-of-service attack. Role-based access control is a way to address guardrails by limiting access to what is necessary.

Reducing service latency is crucial. Back in 2008, we didn’t have fast circuits. 100 Mbps to your office was a luxury. Today, we have tools like WAN optimizers and high-speed circuits to reduce latency. Right-sizing workloads and tools like service mesh can help reduce service latency in Kubernetes.

Reducing Mean Time to Recovery (MTTR) involves minimizing tooling and login fatigue. Consolidating platforms and reducing the number of tools to log into can significantly help.

Being proactive with maintenance is essential. We need to maintain a compatibility matrix, manage version drift, and ensure systems are in lockstep to prevent issues during upgrades.

DNS is a critical component of reliability. It’s tied to your network, infrastructure, and load balancers. DNS failures can cause significant issues, so it must be managed carefully.

Retry resiliency and service invocation are important. Applications need to handle failures and retries. Service meshes like Istio and API gateways can help manage this at the service level.

Consolidating platforms is key. We need a platform that consolidates the Kubernetes dashboard and provides reliability capabilities. A single platform simplifies pinpointing issues quickly and efficiently.

Let’s move to another survey. How many reliability incidents have you had in the last year? It could be related to Kubernetes or something else. Hit that QR code and share your experiences.

Let’s count down: 10, 9, 8, 7, 6, 5, 4, 3, 2, 1. About 70% of you have had one to five incidents in the last year. I don’t believe you. Let’s chat afterward, feel free to reach out and we can discuss more.

Many reliability issues are tied to DNS. The biggest failures in the global internet space have been related to DNS.

I’ve got some more slides to show and then a brief demo, followed by Q&A. Let’s talk about where we’ve gone with Kubernetes. We’ve gone beyond deploying simple clusters. Kubernetes is used at scale across data centers, the edge, and the cloud. There are so many moving pieces, and we need to accommodate for them.

Let’s pull up another survey. After an outage occurs, what do you feel? What’s going through your mind? Add your feelings. For me, it’s defeated. Sometimes I feel defeated when something happens. It makes you think about what could have been done to prevent it and how to be more proactive.

To handle this, we need a platform that ties all these pieces together, correlates data, and helps us manage reliability. The Komodor platform does just that. It helps with troubleshooting, ensures reliability, and simplifies the Kubernetes experience.

Let’s jump into the demo. You’ve onboarded your clusters into Komodor, and now you see all these services. The workspace feature helps you scope down services, so teams focus on their specific areas. You can see issues, optimization scores, and more. The events timeline lets you go back in time to see what happened during incidents.

As a developer, you can see logs, pod details, and service dependencies. Komodor makes it easier to troubleshoot and solve issues, reducing the cognitive load on operational teams. This empowerment leads to fewer tickets and more focus on innovation.

We also need to think about upgrades. When Kubernetes reaches end-of-life, we must update APIs and resources. Komodor helps you stay proactive and maintain reliability.

In summary, continuous Kubernetes reliability involves right-sizing workloads, monitoring them, managing version drift, and consolidating platforms. Komodor provides the tools and platform to achieve this, enabling both developers and operators.

Back to the presentation, let’s check for any questions. Does the freeware of Komodor include all the mentioned features? Yes, the freeware includes many features. Visit vague-comma.flywheelstaging.com, sign up for free, and start exploring. You get a lot of functionality to test and improve your Kubernetes reliability.

Another question, do you recommend using open-source projects in combination with Komodor? Absolutely, there’s no one-size-fits-all. Open-source projects like ArgoCD, Prometheus, and others complement Komodor. Use what’s best for your environment and needs.

That’s all the time we have today. Thank you all for joining. Thank you, Udi, Nikki, and the Komodor team. If you have more questions or want to learn more, visit vague-comma.flywheelstaging.com. Have a great day, and keep your systems reliable.