This is one part of my coverage of KubeCon Austin 2017. Other articles include:

  1. Changes in Prometheus 2.0
    1. What changed
    2. The migration path
    3. Remaining limitations and future

2017 was a big year for the Prometheus project, as it published its 2.0 release in November. The new release ships numerous bug fixes, new features and, notably, a new storage engine that brings major performance improvements. This comes at the cost of incompatible changes to the storage and configuration-file formats. An overview of Prometheus and its new release was presented to the Kubernetes community in a talk held during KubeCon + CloudNativeCon. This article covers what changed in this new release and what is brewing next in the Prometheus community; it is a companion to this article, which provided a general introduction to monitoring with Prometheus.

What changed

Orchestration systems like Kubernetes regularly replace entire fleets of containers for deployments, which means rapid changes in parameters (or "labels" in Prometheus-talk) like hostnames or IP addresses. This was creating significant performance problems in Prometheus 1.0, which wasn't designed for such changes. To correct this, Prometheus ships a new storage engine that was specifically designed to handle continuously changing labels. This was tested by monitoring a Kubernetes cluster where 50% of the pods would be swapped every 10 minutes; the new design was proven to be much more effective. The new engine boasts a hundred-fold I/O performance improvement, a three-fold improvement in CPU, five-fold in memory usage, and increased space efficiency. This impacts container deployments, but it also means improvements for any configuration as well. Anecdotally, there was no noticeable extra load on the servers where I deployed Prometheus, at least nothing that the previous monitoring tool (Munin) could detect.

Prometheus 2.0 also brings new features like snapshot backups. The project has a longstanding design wart regarding data volatility: backups are deemed to be unnecessary in Prometheus because metrics data is considered disposable. According to Goutham Veeramanchaneni, one of the presenters at KubeCon, "this approach apparently doesn't work for the enterprise". Backups were possible in 1.x, but they involved using filesystem snapshots and stopping the server to get a consistent view of the on-disk storage. This implied downtime, which was unacceptable for certain production deployments. Thanks again to the new storage engine, Prometheus can now perform fast and consistent backups, triggered through the web API.

Another improvement is a fix to the longstanding staleness handling bug where it would take up to five minutes for Prometheus to notice when a target disappeared. In that case, when polling for new values (or "scraping" as it's called in Prometheus jargon) a failure would make Prometheus reuse the older, stale value, which meant that downtime would go undetected for too long and fail to trigger alerts properly. This would also cause problems with double-counting of some metrics when labels vary in the same measurement.

Another limitation related to staleness is that Prometheus wouldn't work well with scrape intervals above two minutes (instead of the default 15 seconds). Unfortunately, that is still not fixed in Prometheus 2.0 as the problem is more complicated than originally thought, which means there's still a hard limit to how slowly you can fetch metrics from targets. This, in turn, means that Prometheus is not well suited for devices that cannot support sub-minute refresh rates, which, to be fair, is rather uncommon. For slower devices or statistics, a solution might be the node exporter "textfile support", which we mentioned in the previous article, and the pushgateway daemon, which allows pushing results from the targets instead of having the collector pull samples from targets.

The migration path

One downside of this new release is that the upgrade path from the previous version is bumpy: since the storage format changed, Prometheus 2.0 cannot use the previous 1.x data files directly. In his presentation, Veeramanchaneni justified this change by saying this was consistent with the project's API stability promises: the major release was the time to "break everything we wanted to break". For those who can't afford to discard historical data, a possible workaround is to replicate the older 1.8 server to a new 2.0 replica, as the network protocols are still compatible. The older server can then be decommissioned when the retention window (which defaults to fifteen days) closes. While there is some work in progress to provide a way to convert 1.8 data storage to 2.0, new deployments should probably use the 2.0 release directly to avoid this peculiar migration pain.

Another key point in the migration guide is a change in the rules-file format. While 1.x used a custom file format, 2.0 uses YAML, matching the other Prometheus configuration files. Thankfully the promtool command handles this migration automatically. The new format also introduces rule groups, which improve control over the rules execution order. In 1.x, alerting rules were run sequentially but, in 2.0, the groups are executed sequentially and each group can have its own interval. This fixes the longstanding race conditions between dependent rules that create inconsistent results when rules would reuse the same queries. The problem should be fixed between groups, but rule authors still need to be careful of that limitation within a rule group.

Remaining limitations and future

As we saw in the introductory article, Prometheus may not be suitable for all workflows because of its limited default dashboards and alerts, but also because of the lack of data-retention policies. There are, however, discussions about variable per-series retention in Prometheus and native down-sampling support in the storage engine, although this is a feature some developers are not really comfortable with. When asked on IRC, Brian Brazil, one of the lead Prometheus developers, stated that "downsampling is a very hard problem, I don't believe it should be handled in Prometheus".

Besides, it is already possible to selectively delete an old series using the new 2.0 API. But Veeramanchaneni warned that this approach "puts extra pressure on Prometheus and unless you know what you are doing, its likely that you'll end up shooting yourself in the foot". A more common approach to native archival facilities is to use recording rules to aggregate samples and collect the results in a second server with a slower sampling rate and different retention policy. And of course, the new release features external storage engines that can better support archival features. Those solutions are obviously not suitable for smaller deployments, which therefore need to make hard choices about discarding older samples or getting more disk space.

As part of the staleness improvements, Brazil also started working on "isolation" (the "I" in the ACID acronym) so that queries wouldn't see "partial scrapes". This hasn't made the cut for the 2.0 release, and is still work in progress, with some performance impacts (about 5% CPU and 10% RAM). This work would also be useful when heavy contention occurs in certain scenarios where Prometheus gets stuck on locking. Some of the performance impact could therefore be offset under heavy load.

Another performance improvement mentioned during the talk is an eventual query-engine rewrite. The current query engine can sometimes cause excessive loads for certain expensive queries, according the Prometheus security guide. The goal would be to optimize the current engine so that those expensive queries wouldn't harm performance.

Finally, another issue I discovered is that 32-bit support is limited in Prometheus 2.0. The Debian package maintainers found that the test suite fails on i386, which lead Debian to remove the package from the i386 architecture. It is currently unclear if this is a bug in Prometheus: indeed, it is strange that Debian tests actually pass in other 32-bit architectures like armel. Brazil, in the bug report, argued that "Prometheus isn't going to be very useful on a 32bit machine". The position of the project is currently that "'if it runs, it runs' but no guarantees or effort beyond that from our side".

I had the privilege to meet the Prometheus team at the conference in Austin and was happy to see different consultants and organizations working together on the project. It reminded me of my golden days in the Drupal community: different companies cooperating on the same project in a harmonious environment. If Prometheus can keep that spirit together, it will be a welcome change from the drama that affected certain monitoring software. This new Prometheus release could light a bright path for the future of monitoring in the free software world.


This article first appeared in the Linux Weekly News.

Created . Edited .