This is one part of my coverage of KubeCon Austin 2017. Other articles include:

An overview of KubeCon + CloudNativeCon

Docker without Docker at Red Hat (this article)

Demystifying Container Runtimes

Monitoring with Prometheus 2.0

Changes in Prometheus 2.0

The cost of hosting in the cloud

The Docker (now Moby) project has done a lot to popularize containers in recent years. Along the way, though, it has generated concerns about its concentration of functionality into a single, monolithic system under the control of a single daemon running with root privileges: dockerd. Those concerns were reflected in a talk by Dan Walsh, head of the container team at Red Hat, at KubeCon + CloudNativeCon. Walsh spoke about the work the container team is doing to replace Docker with a set of smaller, interoperable components. His rallying cry is "no big fat daemons" as he finds them to be contrary to the venerated Unix philosophy.

The quest to modularize Docker

As we saw in an earlier article, the basic set of container operations is not that complicated: you need to pull a container image, create a container from the image, and start it. On top of that, you need to be able to build images and push them to a registry. Most people still use Docker for all of those steps but, as it turns out, Docker isn't the only name in town anymore: an early alternative was rkt, which led to the creation of various standards like CRI (runtime), OCI (image), and CNI (networking) that allow backends like CRI-O or Docker to interoperate with, for example, Kubernetes.

These standards led Red Hat to create a set of "core utils" like the CRI-O runtime that implements the parts of the standards that Kubernetes needs. But Red Hat's OpenShift project needs more than what Kubernetes provides. Developers will want to be able to build containers and push them to the registry. Those operations need a whole different bag of tricks.

It turns out that there are multiple tools to build containers right now. Apart from Docker itself, a session from Michael Ducy of Sysdig reviewed eight image builders, and that's probably not all of them. Ducy identified the ideal build tool as one that would create a minimal image in a reproducible way. A minimal image is one where there is no operating system, only the application and its essential dependencies. Ducy identified Distroless, Smith, and Source-to-Image as good tools to build minimal images, which he called "micro-containers".

A reproducible container is one that you can build multiple times and always get the same result. For that, Ducy said you have to use a "declarative" approach (as opposed to "imperative"), which is understandable given that he comes from the Chef configuration-management world. He gave the examples of Ansible Container, Habitat, nixos-container, and Smith (yes, again) as being good approaches, provided you were familiar with their domain-specific languages. He added that Habitat ships its own supervisor in its containers, which may be superfluous if you already have an external one, like systemd, Docker, or Kubernetes. To complete the list, we should mention the new BuildKit from Docker and Buildah, which is part of Red Hat's Project Atomic.

Building containers with Buildah

[Buildah
logo] Buildah's name apparently comes from Walsh's colorful Boston accent; the Boston theme permeates the branding of the tool: the logo, for example, is a Boston terrier dog (seen at right). This project takes a different approach from Ducy's decree: instead of enforcing a declarative configuration-management approach to containers, why not build simple tools that can be used by your favorite configuration-management tool? If you want to use regular command-line commands like cp (instead of Docker's custom COPY directive, for example), you can. But you can also use Ansible or Puppet, OS-specific or language-specific installers like APT or pip, or whatever other system to provision the content of your containers. This is what building a container looks like with regular shell commands and simply using make to install a binary inside the container:

    # pull a base image, equivalent to a Dockerfile's FROM command
    buildah from redhat

    # mount the base image to work on it
    crt=$(buildah mount)
    cp foo $crt
    make install DESTDIR=$crt

    # then make a snapshot
    buildah commit

An interesting thing with this approach is that, since you reuse normal build tools from the host environment, you can build really minimal images because you don't need to install all the dependencies in the image. Usually, when building a container image, the target application build dependencies need to be installed within the container. For example, building from source usually requires a compiler toolchain in the container, because it is not meant to access the host environment. A lot of containers will also ship basic Unix tools like ps or bash which are not actually necessary in a micro-container. Developers often forget to (or simply can't) remove some dependencies from the built containers; that common practice creates unnecessary overhead and attack surface.

The modular approach of Buildah means you can run at least parts of the build as non-root: the mount command still needs the CAP_SYS_ADMIN capability, but there is an issue open to resolve this. However, Buildah shares the same limitation as Docker in that it can't build containers inside containers. For Docker, you need to run the container in "privileged" mode, which is not possible in certain environments (like GitLab Continuous Integration, for example) and, even when it is possible, the configuration is messy at best.

The manual commit step allows fine-grained control over when to create container snapshots. While in a Dockerfile every line creates a new snapshot, with Buildah commit checkpoints are explicitly chosen, which reduces unnecessary snapshots and saves disk space. This is useful to isolate sensitive material like private keys or passwords which sometimes mistakenly end up in public images as well.

While Docker builds non-standard, Docker-specific images, Buildah produces standard OCI images among other output formats. For backward compatibility, it has a command called build-using-dockerfile or buildah bud that parses normal Dockerfiles. Buildah has a enter command to inspect images from the inside directly and a run command to start containers on the fly. It does all the work without any "fat daemon" running in the background and uses standard tools like runc.

Ducy's criticism of Buildah was that it was not declarative, which made it less reproducible. When allowing shell commands anything can happen: for example, a shell script might download arbitrary binaries, without any way of subsequently retracing where those come from. Shell command effects may vary according to the environment. In contrast to shell-based tools, configuration-management systems like Puppet or Chef are designed to "converge" over a final configuration that is more reliable, at least in theory: in practice you can call shell commands from configuration-management systems. Walsh, however, argued that existing configuration management can be used on top of Buildah, but it doesn't force users down that path. This fits well with the classic "separation" principle of the Unix philosophy ("mechanism not policy").

At this point, Buildah is in beta and Red Hat is working on integrating it into OpenShift. I have tested Buildah while writing this article and, short of some documentation issues, it generally works reliably. It could use some polishing in error handling, but it is definitely a great asset to add to your container toolbox.

Replacing the rest of the Docker command-line

Walsh continued his presentation by giving an overview of another project that Red Hat is working on, tentatively called libpod. The name derives from a "pod" in Kubernetes, which is a way to group containers inside a host, to share namespaces, for example.

Libpod includes the kpod command to inspect and manipulate container storage directly. Walsh explained this can be useful if, for example, dockerd hangs or if a Kubernetes cluster crashes. kpod is basically an independent re-implementation of the docker command-line tool. There is a command to list running containers (kpod ps) or images (kpod images). In fact, there is a translation cheat sheet documenting all Docker commands with a kpod equivalent.

One of the nice things with the modular approach is that when you run a container with kpod run, the container is directly started as a subprocess of the current shell, instead of a subprocess of dockerd. In theory, this allows running containers directly from systemd, removing the duplicate work dockerd is doing. It enables things like socket-activated containers, which is something that is not straightforward to do with Docker, or even with Kubernetes right now. In my experiments, however, I have found that containers started with kpod lack some fundamental functionality, namely networking (!), although there is an issue in progress to complete that implementation.

A final command we haven't covered is push. While the above commands provide a good process for working with local containers, they don't cover remote registries, which allow developers to actively collaborate on application packaging. Registries are also an essential part of a continuous-deployment framework. This is where the skopeo project comes in. Skopeo is another Atomic project that "performs various operations on container images and image repositories", according to the README file. It was originally designed to inspect the contents of container registries without actually downloading the sometimes voluminous images as docker pull does. Docker refused patches to support inspection, suggesting the creation of a separate tool, which led to Skopeo. After pull, push was the logical next step and Skopeo can now do a bunch of other things like copying and converting images between registries without having to store a copy locally. Because this functionality was useful to other projects as well, a lot of the Skopeo code now lives in a reusable library called containers/image. That library is in turn used by Pivotal, Google's container-diff, kpod push, and buildah push.

kpod is not directly tied to Kubernetes, so the name might change in the future — especially since Red Hat legal has not cleared the name yet. (In fact, just as this article was going to "press", the name was changed to podman.) The team wants to implement more "pod-level" commands which would allow operations on multiple containers, a bit like what docker compose might do. But at that level, a better tool might be Kompose which can execute Compose YAML files into a Kubernetes cluster. Some Docker commands (like swarm) will never be implemented, on purpose, as they are best left for Kubernetes itself to handle.

It seems that the effort to modularize Docker that started a few years ago is finally bearing fruit. While, at this point, kpod is under heavy development and probably should not be used in production, the design of those different tools is certainly interesting; a lot of it is ready for development environments. Right now, the only way to install libpod is to compile it from source, but we should expect packages coming out for your favorite distribution eventually.

This article first appeared in the Linux Weekly News.

Comments on this page are closed.

Created 2017-12-20 12:00. Edited 2020-03-25 22:31.