Kubernetes Prometheus: How It Works and 4 Critical Best Practices

What Is Prometheus? 

Prometheus, commonly used for Kubernetes but which also supports other cloud native environments, is a set of open-source tools for monitoring and alerting in containerized and microservices-based environments. With Kubernetes Prometheus monitoring, you can configure live notification feeds and run flexible monitoring queries. It provides visibility into containerized applications, APIs, and workloads, which are typically difficult to observe given their complex, distributed nature. 

Prometheus can also help you implement security for cloud-native applications by detecting unusual traffic and behaviors that may indicate a threat or escalate into a cyberattack.

This is part of a series of articles about Kubernetes monitoring.

Monitoring Kubernetes Clusters with Prometheus 

Prometheus uses pull requests to retrieve information. It works by sending HTTP scrape requests based on the deployment file’s defined configuration. It then parses and stores the responses to scrape requests alongside the relevant Kubernetes metrics. Prometheus uses a custom database to store cluster information, allowing it to handle large volumes of data. This system allows you to simultaneously monitor thousands of virtual machines from the same server.

Exporters

Before Prometheus can collect data, you need to make sure that it is correctly formatted and exposed. It can retrieve data directly from an application’s client library or through an exporter. An exporter is a software component that sits alongside an application to help manage data that you cannot fully control. Exporters can accept the HTTP requests that Prometheus sends, ensure that the data is in a Prometheus-supported format, and return the relevant data to the central server.

Service Discovery

With exporters attached, each application can return data to Prometheus, but you still need to tell Prometheus where to find the data. Prometheus uses service discovery to identify targets for scraping data. 

A Kubernetes cluster should already have labels and tags, making it easier to keep track of each element’s status and any changes. The Kubernetes API enables Prometheus to discover data targets, including: 

  • Nodes
  • Endpoints
  • Services
  • Pods
  • Ingress

Metrics

Prometheus retrieves machine and application metrics separately. It is therefore necessary to use node exporters to expose information such as CPU, memory, bandwidth, and disk metrics. You also need to expose metrics related to cgroups. The easiest option for this is cAdvisor, a built-in node-level exporter in Kubernetes. 

After Prometheus has collected the data, you can use PromQL to view and share it. This query language lets you export data to a graphical interface like Grafana or send alerts using Alertmanager.

expert-icon-header

Tips from the expert

Itiel Shwartz

Co-Founder & CTO

Itiel is the CTO and co-founder of Komodor. He’s a big believer in dev empowerment and moving fast, has worked at eBay, Forter and Rookout (as the founding engineer). Itiel is a backend and infra developer turned “DevOps”, an avid public speaker that loves talking about things such as cloud infrastructure, Kubernetes, Python, observability, and R&D culture.

In my experience, here are tips that can help you maximize the effectiveness of Prometheus for monitoring Kubernetes:

Leverage Prometheus Federation

Use Prometheus federation to scale your monitoring setup by aggregating data from multiple Prometheus instances. This allows for centralized monitoring of large environments and ensures that data collection remains efficient and manageable.

Optimize Data Retention Policies

Set appropriate data retention policies to balance the need for historical data with storage costs. Use remote storage solutions for long-term retention if necessary, to offload the main Prometheus server and improve performance.

Implement High Availability (HA) Setup

Deploy Prometheus in a high-availability configuration to ensure continuous monitoring even during server failures. This typically involves running multiple Prometheus instances with the same configuration and using a load balancer to distribute scrape requests.

Use Recording Rules for Pre-Computed Metrics

Define recording rules in Prometheus to pre-compute frequently queried metrics. This reduces query load and speeds up dashboard rendering, especially for complex or resource-intensive queries.

Monitor Prometheus Itself

Set up monitoring for your Prometheus servers to track their health and performance. Use alerting rules to notify you of any issues such as high CPU usage, large query execution times, or storage bottlenecks.

Kubernetes Prometheus Pros and Cons 

Among the advantages of Prometheus are:

  • Kubernetes integration: Prometheus is tightly integrated with Kubernetes thanks to the embedded exporters. Both Kubernetes and Prometheus are CNCF projects, making Prometheus the obvious choice for Kubernetes monitoring.
  • Open-source: Operation is easier due to the open-source support. 
  • User friendly: The API and query language are easy to use .
  • Range of metrics: Prometheus has a large number of libraries and exporters to collect various application and machine metrics, including community-built exporters that extend Prometheus’ monitoring coverage. 
  • Simplified data collection: Prometheus uses a standardized pull-based approach to collect time-series data. 

Among the disadvantages of Prometheus are:

  • Limited data model: Prometheus uses a time-series data collection model by default, which could be constraining and result in missing context. It is only suitable for monitoring pure telemetry.
  • Non-granular data: The pulled-based data collection model relies on exporters to summarize and provide data, which might be limited in terms of granularity. Prometheus scrapes data periodically, so you might be missing some information between the scrapes.
  • Compatibility: Prometheus works well for Kubernetes but it is not always compatible with multi-generational or legacy infrastructure. For example, the pull-based approach requires poking holes in a traditional network’s firewalls.
  • Ephemeral infrastructure: Prometheus uses service discovery libraries to keep updated with the monitored platform (i.e., Kubernetes), but this can introduce some lag due to the ephemerality of resources and the intervals between metric scrapes.
  • Lack of encryption: Data collection in Prometheus is unencrypted and unauthenticated, meaning that any entity with access to your network can also see the telemetry, including metrics and labels. 

4 Best Practices for Monitoring Kubernetes with Prometheus 

Here are some best practices to make the most of Prometheus to monitor Kubernetes. 

1. Restrict the Use of Labels

Labels allow you to specify the data and context for your metrics. However, each set of labels takes up resources, such as CPU, RAM, bandwidth, and disk space. While insignificant at a small scale, the resources consumed can build up for large Kubernetes projects, driving up costs. 

It is best to limit the labels for each metric to 10. Few metrics will need any labels. If some of your metrics have too many labels, you might benefit from using a dedicated analysis tool instead. 

2. Take Care When Using Timestamps 

A common mistake is to use timestamps to indicate the time lapsed since an event occurred. When tracking the timing of events, you should only use timestamps that mark when each event happened. This approach will eliminate the need to update the logic and minimize errors. You can also determine the time lapses since by calculating: 

time() - my_timestamp_metric 


3. Limit the Metrics Called in Inner Loops

You should limit the number of metrics included in critical or frequently called code (i.e., over 100k calls per second). It usually takes 12-17ns for a Java application to increment counters, resulting in performance issues when compounded. By limiting the number of metrics called in the inner loops (and using fewer labels), you can prevent such issues. 

When you do need labels, you can cache the label results to minimize their impact. It is also important to be careful when using time and duration metrics because these measurements require syscalls.

4. Understand the Metrics Available

There are four key types of metric to use in Prometheus, and it is important to know when to use each type of metric to ensure the most accurate, complete insights: 

  • Counters: These increment upwards and can be reset. Counters are useful to count the number of events or measure amounts at the start of an event. 
  • Gauges: These measure positive and negative changes, making them useful for providing point-in-time values like temperature, requests in progress, or memory usage.
  • Histograms: These sample and categorize events by providing a sum of all the values observed. A histogram is useful to aggregate data.
  • Summaries: These work the same way as a histogram, but they also calculate quantiles based on the total number of events over sliding time windows. A summary can be useful when you need an accurate range for metrics with unknown buckets. Thus, summaries are the least commonly used metric type.

Managing Kubernetes Prometheus with Komodor

Prometheus has great advantages, but the biggest benefits are that it’s free, relatively easy to understand, and provides good Kubernetes observability into your stack. On the other hand, you have to invest time in manually managing Prometheus as well as money in hosting it. You’ll also have to spend quite a bit of engineering resources to scale Prometheus beyond a certain range – and as we all know, with scale comes complexity – meaning, the chances of things breaking down are more likely to occur.

This is where Komodor comes in as a native K8s platform, helping you monitor your entire K8s stack, identify issues, uncover their root cause and understand the necessary action to troubleshoot efficiently and independently. To learn more about how Komodor can make it easier to empower you and your teams to troubleshoot K8s, sign up for our free trial.

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 7

No votes so far! Be the first to rate this post.