In our previous post [Part I: 4 Best Practices to Migrate to Kubernetes], we focused on tips for making the transition and migration to Kubernetes a smoother, and less painful process. In this post, we’d like to now provide some tips from the operational trenches for future-proofing your Kubernetes operation, after making the move. Kubernetes, as a software-driven system, has many benefits for engineers and DevOps teams to take advantage of. However, as with all code, there will always be issues, and things will always break - so you want to make the process of getting to the root cause as quick and easy as possible. By applying some good practices from the design phases, pre-deployment, and even in supporting tools and systems, you can achieve faster recovery. Deployment is Easy - Management is Hard One of the primary goals when it comes to deploying code or applications is to make the management and maintenance easy for engineers. Pressing deploy is the easy part, managing applications in production in the long-term is the hard part, and even more so with a complex system like Kubernetes. This part will focus on the 5 quick wins you should make a point of implementing to make your Kubernetes system easier to troubleshoot. 1. Maintaining Good YAML Hygiene (AKA Your K8s Deployment Manifest) When working with YAML files there is a lot of metadata you can include to help Kubernetes help you when troubleshooting. These include setting the right labels and annotations, environment variables, secrets, and config maps that point to the proper objects and volumes, configuring liveness and readiness probes. These will help K8s know when your app is healthy and ready to accept traffic, or otherwise alert you when there is an issue. Other important data includes cloud-specific configurations (e.g. node taints, tolerations, readiness gates etc.) depending on your cloud provider you choose to use. 2. Stateless Apps FTW! With Kubernetes, a good practice is to build your applications to be stateless from the start. Explaining stateless applications requires a deeper dive (read more on stateful vs. stateless applications), but what this means from an operational perspective is that when you restart your application, it does not require any external information or data to run the same way as before the restart, making your application a lot easier to manage. When K8s was first introduced, it was initially designed for stateless applications. Therefore, applications that rely on external state for initialization and startup will have a lot more difficulty adapting to Kubernetes operations, as this requires much more engineering expertise to enable similar capabilities (elegantly restarting after a crash, for example), and many times this cannot be done at all. Therefore, in order to remove a lot of the risk around your applications, and also gain the benefits of elasticity, stateless apps are a good practice. Statelessness basically makes it possible to scale up or down by defining this simply in the code. It also helps with troubleshooting - for example, when K8s detects a problem, it will just restart your application and container forever if something goes wrong, and enable you to resume operations as usual, seamlessly. This type of approach is much more prone to breakage with stateful applications. 3. Logging, but Specifically for Kubernetes Logging is always an integral part of any application, but becomes exceedingly more important with the complexity that K8s brings with it. Seeing as K8s is infinitely scalable and distributed, aggregating and collecting logs from a multitude of containers, clusters, and machines is a daunting operation. You will be doing yourself a favor for future troubleshooting processes by being much more proactive with your logging by making sure to tag and label your logs properly. Some good practices are to include the proper service name (and not rely on the pod names that are volatile), the version, and cluster environment information. There are also many K8s-specific tags that define the environment that the application is currently running in. This can help with troubleshooting and understanding where an issue originated from. For example, if you have an application with 500 replicas, in one of these replicas an error occurs and you’d like to try and get to the root cause, understanding which container this came from will be a nearly impossible feat if you don’t properly tag and label your containers and logs. 4. Separation of Environments There are a lot of best practices around that provide insights for managing environments in Kubernetes. The obvious and simplest would be to create an environment for each stage of development: development, QA, staging, production. Another option is to separate environments logically using a special Kubernetes resource called namespaces, which solely segregates the environment by ‘names' that point to specific objects in isolation, while still sharing the same underlying infrastructure. This also maintains these different environments on the same shared resources, such as nodes. This means that if one of the environments is using too much CPU or memory, your production environment will have less memory to work with. One way to work around this is with node taints and tolerations, which allows you to let the K8s scheduler decide, according to your configuration, in which node it should live. This enables you to decide which node or specific environment lives on which machine, based on the resources it normally requires. For someone just getting started with Kubernetes, separating the cluster entirely is likely a better practice. While it is more expensive, it will deliver greater safety overall, in addition to being easier to launch and manage. 5. Invest in Proper Monitoring There are a lot of different ways to monitor applications on Kubernetes. To do this, you should first decide whether you would want to use one of the many open source projects available or a commercial monitoring solution, each providing its own type of benefits and challenges. There are several open source projects you can integrate with, the most popular being Prometheus with Grafana dashboards. While these open source projects provide a lot of flexibility and customizability, their learning curve is high. It’s not easy to configure them properly from the start to receive the correct and actionable information for monitoring your cluster. This may come with the cost of not having accurate information in real-time when there is an issue, or even missing critical failures. Commercial options like Datadog and New Relic will provide extensive documentation and a lot of features out of the box for getting started rapidly. As an example, the moment you deploy the Datadog agent on your cluster, you’ll get an integration with your cloud provider out of the box, and be greeted with a default dashboard with useful metrics about your cluster, and relevant data about your cloud provider as well. Getting open source tooling to do this is absolutely possible, but takes more effort and familiarity with configuring this tooling. Once you’ve chosen the right type of monitoring solution to meet your business needs, you’ll want to start monitoring three main aspects on K8s: Resources - CPU / Memory Usage Container Status - Up / Down / Errors / Probe Data / Restart Count Application Metrics - APMs - Per-Request or Action Metrics / Latency / Stack Traces / Custom Metadata While the first two bullets provide important information about your cluster, which is critical to have, the third point - APMs, provide you business critical information about your application, which is the most important. This is because Kubernetes is a distributed platform that is infinitely scalable, so if you go from a couple to 10s to hundreds of servers, running hundreds to thousands of applications, finding the root cause of an issue is going to make you cry. It becomes hard and even impossible to do it manually. You’ll need the help of these monitoring tools to help you to detect, alert, and understand the business logic of your applications. Keep Calm and Manage Your Kubernetes For those who have taken the Kubernetes plunge, we hope these operations and management tips will come in handy when maintaining your systems in the long-term. By optimizing your applications, processes, and tools for this complex system - from the config, to the code and architecture, through the logging & monitoring, you’ll be much better equipped to handle failure and troubleshoot when incidents occur. This is also the mission that Komodor set for itself when building its dev tool - to automate and codify these best practices into its platform to enable developers and ops teams alike by pre-baking troubleshooting context and actionable data into deployments, in advance, not after the fact when every second matters. I hope these good practices will help you bypass some of the pains that I had to learn the hard way with every issue and failure - and you're also welcome to take the happy path, and try Komodor as a quick way to get started.