This is one part of my coverage of KubeCon Austin 2017. Other articles include:

An overview of KubeCon + CloudNativeCon

Docker without Docker at Red Hat

Demystifying Container Runtimes

Monitoring with Prometheus 2.0 (this article)

Changes in Prometheus 2.0

The cost of hosting in the cloud

Monitoring with Prometheus and Grafana
Alerting and high availability
Issues and limitations

Prometheus is a monitoring tool built from scratch by SoundCloud in 2012. It works by pulling metrics from monitored services and storing them in a time series database (TSDB). It has a powerful query language to inspect that database, create alerts, and plot basic graphs. Those graphs can then be used to detect anomalies or trends for (possibly automated) resource provisioning. Prometheus also has extensive service discovery features and supports high availability configurations. That's what the brochure says, anyway; let's see how it works in the hands of an old grumpy system administrator. I'll be drawing comparisons with Munin and Nagios frequently because those are the tools I have used for over a decade in monitoring Unix clusters.

Monitoring with Prometheus and Grafana

What distinguishes Prometheus from other solutions is the relative simplicity of its design: for one, metrics are exposed over HTTP using a special URL (/metrics) and a simple text format. Here is, as an example, some network metrics for a test machine:

    $ curl -s http://curie:9100/metrics | grep node_network_.*_bytes
    # HELP node_network_receive_bytes Network device statistic receive_bytes.
    # TYPE node_network_receive_bytes gauge
    node_network_receive_bytes{device="eth0"} 2.720630123e+09
    # HELP node_network_transmit_bytes Network device statistic transmit_bytes.
    # TYPE node_network_transmit_bytes gauge
    node_network_transmit_bytes{device="eth0"} 4.03286677e+08

In the above example, the metrics are named node_network_receive_bytes and node_network_transmit_bytes. They have a single label/value pair(device=eth0) attached to them, along with the value of the metrics themselves. This is only a couple of hundreds of metrics (usage of CPU, memory, disk, temperature, and so on) exposed by the "node exporter", a basic stats collector running on monitored hosts. Metrics can be counters (e.g. per-interface packet counts), gauges (e.g. temperature or fan sensors), or histograms. The latter allow, for example, 95th percentiles analysis, something that has been missing from Munin forever and is essential to billing networking customers. Another popular use for histograms is maintaining an Apdex score, to make sure that N requests are answered in X time. The various metrics types are carefully analyzed before being stored to correctly handle conditions like overflows (which occur surprisingly often on gigabit network interfaces) or resets (when a device restarts).

Those metrics are fetched from "targets", which are simply HTTP endpoints, added to the Prometheus configuration file. Targets can also be automatically added through various discovery mechanisms, like DNS, that allow having a single A or SRV record that lists all the hosts to monitor; or Kubernetes or cloud-provider APIs that list all containers or virtual machines to monitor. Discovery works in real time, so it will correctly pick up changes in DNS, for example. It can also add metadata (e.g. IP address found or server state), which is useful for dynamic environments such as Kubernetes or containers orchestration in general.

Once collected, metrics can be queried through the web interface, using a custom language called PromQL. For example, a query showing the average bandwidth over the last minute for interface eth0 would look like:

    rate(node_network_receive_bytes{device="eth0"}[1m])

Notice the "device" label, which we use to restrict the search to a single interface. This query can also be plotted into a simple graph on the web interface:

What is interesting here is not really the node exporter metrics themselves, as those are fairly standard in any monitoring solution. But in Prometheus, any (web) application can easily expose its own internal metrics to the monitoring server through regular HTTP, whereas other systems would require special plugins, on both the monitoring server and the application side. Note that Munin follows a similar pattern, but uses its own text protocol on top of TCP, which means it is harder to implement for web apps and diagnose with a web browser.

However, coming from the world of Munin, where all sorts of graphics just magically appear out of the box, this first experience can be a bit of a disappointment: everything is built by hand and ephemeral. While there are ways to add custom graphs to the Prometheus web interface using Go-based console templates, most Prometheus deployments generally use Grafana to render the results using custom-built dashboards. This gives much better results, and allows graphing multiple machines separately, using the Node Exporter Server Metrics dashboard:

All this work took roughly an hour of configuration, which is pretty good for a first try. Things get tougher when extending those basic metrics: because of the system's modularity, it is difficult to add new metrics to existing dashboards. For example, web or mail servers are not monitored by the node exporter. So monitoring a web server involves installing an Apache-specific exporter that needs to be added to the Prometheus configuration. But it won't show up automatically in the above dashboard, because that's a "node exporter" dashboard, not an Apache dashboard. So you need a separate dashboard for that. This is all work that's done automatically in Munin without any hand-holding.

Even then, Apache is a relatively easy one; monitoring some arbitrary server not supported by a custom exporter will require installing a program like mtail, which parses the server's logfiles to expose some metrics to Prometheus. There doesn't seem to be a way to write quick "run this command to count files" plugins that would allow administrators to write quick hacks. The options available are writing a new exporter using client libraries, which seems to be a rather large undertaking for non-programmers. You can also use the node exporter textfile option, which reads arbitrary metrics from plain text files in a directory. It's not as direct as running a shell command, but may be good enough for some use cases. Besides, there are a large number of exporters already available, including ones that can tap into existing Nagios and Munin servers to allow for a smooth transition.

Unfortunately, those exporters will only give you metrics, not graphs. To graph metrics from a third-party Postfix exporter, a graph must be created by hand in Grafana, with a magic PromQL formula. This may involve too much clicking around in a web browser for grumpy old administrators. There are tools like Grafanalib to programmatically create dashboards, but those also involve a lot of boilerplate. When building a custom application, however, creating graphs may actually be a fun and distracting task that some may enjoy. The Grafana/Prometheus design is certainly enticing and enables powerful abstractions that are not readily available with other monitoring systems.

Alerting and high availability

So far, we've worked only with a single server, and did only graphing. But Prometheus also supports sending alarms when things go bad. After working over a decade as a system administrator, I have mixed feelings about "paging" or "alerting" as it's called in Prometheus. Regardless of how well the system is tweaked, I have come to believe it is basically impossible to design a system that will respect workers and not torture on-call personnel through sleep-deprivation. It seems it's a feature people want regardless, especially in the enterprise, so let's look at how it works here.

In Prometheus, you design alerting rules using PromQL. For example, to warn operators when a network interface is close to saturation, we could set the following rule:

    alert: HighBandwidthUsage
    expr: rate(node_network_transmit_bytes{device="eth0"}[1m]) > 0.95*1e+09
    for: 5m
    labels:
      severity: critical
    annotations:
      description: 'Unusually high bandwidth on interface {{ $labels.device }}'
      summary: 'High bandwidth on {{ $labels.instance }}'

Those rules are regularly checked and matching rules are fired to an alertmanager daemon that can receive alerts from multiple Prometheus servers. The alertmanager then deduplicates multiple alerts, regroups them (so a single notification is sent even if multiple alerts are received), and sends the actual notifications through various services like email, PagerDuty, Slack or an arbitrary webhook.

The Alertmanager has a "gossip protocol" to enable multiple instances to coordinate notifications. This design allows you to run multiple Prometheus servers in a federation model, all simultaneously collecting metrics, and sending alerts to redundant Alertmanager instances to create a highly available monitoring system. Those who have struggled with such setups in Nagios will surely appreciate the simplicity of this design.

The downside is that Prometheus doesn't ship a set of default alerts and exporters do not define default alerting thresholds that could be used to create rules automatically. The Prometheus documentation also lacks examples that the community could use, so alerting is harder to deploy than in classic monitoring systems.

Issues and limitations

Prometheus is already well-established: Cloudflare, Canonical and (of course) SoundCloud are all (still) using it in production. It is a common monitoring tool used in Kubernetes deployments because of its discovery features. Prometheus is, however, not a silver bullet and may not the best tool for all workloads.

In particular, Prometheus is not designed for long-term storage. By default, it keeps samples for only two weeks, which seems rather small to old system administrators who are used to RRDtool databases that efficiently store samples for years. As a comparison, my test Prometheus instance is taking up as much space for five days of samples as Munin, which has samples for the last year. Of course, Munin only collects metrics every five minutes while Prometheus samples all targets every 15 seconds by default. Even so, this difference in sizes shows that Prometheus's disk requirements are much larger than traditional RRDtool implementations because it lacks native down-sampling facilities. Therefore, retaining samples for more than a year (which is a Munin limitation I was hoping to overcome) will be difficult without some serious hacking to selectively purge samples or adding extra disk space.

The project documentation recognizes this and suggests using alternatives:

Prometheus's local storage is limited in its scalability and durability. Instead of trying to solve long-term storage in Prometheus itself, Prometheus has a set of interfaces that allow integrating with remote long-term storage systems.

Prometheus in itself delivers good performance: a single instance can support over 100,000 samples per second. When a single server is not enough, servers can federate to cover different parts of the infrastructure. And when that is not enough sharding is possible. In general, performance is dependent on avoiding variable data in labels, which keeps the cardinality of the dataset under control, but the dataset size will grow with time regardless. So long-term storage is not Prometheus' strongest suit. But starting with 2.0, Prometheus can finally write to (and read from) external storage engines that can be more efficient than Prometheus. InfluxDB, for example, can be used as a backend and supports time-based down-sampling that makes long-term storage manageable. This deployment, however, is not for the faint of heart.

Also, security freaks can't help but notice that all this is happening over a clear-text HTTP protocol. Indeed, that is by design, "Prometheus and its components do not provide any server-side authentication, authorisation, or encryption. If you require this, it is recommended to use a reverse proxy." The issue is punted to a layer above, which is fine for the web interface: it is, after all, just a few Prometheus instances that need to be protected. But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection. It would be nice to have at least IP-level blocking in the node exporter, although this could also be accomplished through a simple firewall rule.

There is a large empty space for Prometheus dashboards and alert templates. Whereas tools like Munin or Nagios had years to come up with lots of plugins and alerts, and to converge on best practices like "70% disk usage is a warning but 90% is critical", those things all need to be configured manually in Prometheus. Prometheus should aim at shipping standard sets of dashboards and alerts for built-in metrics, but the project currently lacks the time to implement those.

The Grafana list of Prometheus dashboards shows one aspect of the problem: there are many different dashboards, sometimes multiple ones for the same task, and it's unclear which one is the best. There is therefore space for a curated list of dashboards and a definite need for expanding those to feature more extensive coverage.

As a replacement for traditional monitoring tools, Prometheus may not be quite there yet, but it will get there and I would certainly advise administrators to keep an eye on the project. Besides, Munin and Nagios feature-parity is just a requirement from an old grumpy system administrator. For hip young application developers smoking weird stuff in containers, Prometheus is the bomb. Just take for example how GitLab started integrating Prometheus, not only to monitor GitLab.com itself, but also to monitor the continuous-integration and deployment workflow. By integrating monitoring into development workflows, developers are immediately made aware of the performance impacts of proposed changes. Performance regressions can therefore be trivially identified quickly, which is a powerful tool for any application.

Whereas system administrators may want to wait a bit before converting existing monitoring systems to Prometheus, application developers should certainly consider deploying Prometheus to instrument their applications, it will serve them well.

This article first appeared in the Linux Weekly News.

Comments on this page are closed.

Created 2018-01-16 19:00. Edited 2020-03-25 22:31.