Easier container security with entitlements
This article is part of a series on KubeCon Europe 2018.
- Diversity, education, privilege and ethics in technology
- Autoscaling for Kubernetes workloads
- Updates in container isolation
- Securing the container image supply chain
- Easier container security with entitlements (this article)
During KubeCon + CloudNativeCon Europe 2018, Justin Cormack and Nassim Eddequiouaq presented a proposal to simplify the setting of security parameters for containerized applications. Containers depend on a large set of intricate security primitives that can have weird interactions. Because they are so hard to use, people often just turn the whole thing off. The goal of the proposal is to make those controls easier to understand and use; it is partly inspired by mobile apps on iOS and Android platforms, an idea that trickled back into Microsoft and Apple desktops. The time seems ripe to improve the field of container security, which is in desperate need of simpler controls.
The problem with container security
Cormack first stated that container security is too complicated. His slides stated bluntly that "unusable security is not security" and he pleaded for simpler container security mechanisms with clear guarantees for users.
"Container security" is a catchphrase that actually includes all sorts
of measures, some of which we have previously
covered. Cormack presented an
overview of those mechanisms, including capabilities, seccomp, AppArmor,
SELinux, namespaces, control groups — the list goes on. He showed how
docker run --help
has a "ridiculously large number of options"; there are around one
hundred on my machine, with about fifteen just for security mechanisms.
He said that "most developers don't know how to actually apply those
mechanisms to make sure their containers are secure". In the best-case
scenario, some people may know what the options are, but in most cases
people don't actually understand each mechanism in detail.
He gave the example of capabilities; there are about forty possible
values that can be provided for the --cap-drop
option, each with its
own meaning. He described some capabilities as "understandable", but
said that others end up in overly broad boxes. The kernel's data
structure limits the system to a maximum of 64 capabilities, so a bunch
of functionality was lumped together into CAP_SYS_ADMIN
, he said.
Cormack also talked about namespaces and seccomp. While there are fewer namespaces than capabilities, he said that "it's very unclear for a general user what their security properties are". For example, "some combinations of capabilities and namespaces will let you escape from a container, and other ones don't". He also described seccomp as a "long JSON file" as that's the way Kubernetes configures it. Even though he said those files could "usefully be even more complicated" and said that the files are "very difficult to write".
Cormack stopped his enumeration there, but the same applies to the other
mechanisms. He said that while developers could sit down and write those
policies for their application by hand, it's a real mess and makes their
heads explode. So instead developers run their containers in
--privileged
mode. It works, but it disables all the nice security
mechanisms that the container abstraction provides. This is why
"containers do not contain", as Dan Walsh famously
quipped.
Introducing entitlements
There must be a better way. Eddequiouaq proposed this simple idea: "provide something humans can actually understand without diving into code or possibly even without reading documentation". The solution proposed by the Docker security team is "entitlements": the ability for users to choose simple permissions on the command line. Eddequiouaq said that application users and developers alike don't need to understand the low-level security mechanisms or how they interact within the kernel; "people don't care about that, they want to make sure their app is secure."
Entitlements divide resources into meaningful domains like "network",
"security", or "host resources" (like devices). Behind the scenes,
Docker translates those into whatever security mechanisms are
available. This implies that the actual mechanism deployed will vary
between runtimes, depending on the implementation. For example, a
"confined" network access might mean a seccomp filter blocking all
networking-related system calls except socket(AF_UNIX|AF_LOCAL)
along
with dropping network-related capabilities. AppArmor will deny network
on some platforms while SELinux would do similar enforcement on others.
Eddequiouaq said the complexity of implementing those mechanisms is the
responsibility of platform developers. Image developers can ship
entitlement lists along with container images created with a regular
docker build
, and sign the whole bundle with docker trust
. Because
entitlements do not specify explicit low-level mechanisms, the resulting
image is portable to different runtimes without change. Such portability
helps Kubernetes on non-Linux platforms do its job.
Entitlements shift the responsibility for configuring sandboxing
environments to image developers, but also empowers them to deliver
security mechanisms directly to end users. Developers are the ones with
the best knowledge about what their applications should or should not be
doing. Image end-users, in turn, benefit from verifiable security
properties delivered by the bundles and the expertise of image
developers when they docker pull
and run
those images.
Eddequiouaq gave a demo of the community's nemesis: Docker inside Docker
(DinD). He picked that use case because it requires a lot of privileges,
which usually means using the dreaded --privileged
flag. With the
entitlements patch, he was able to run DinD with network.admin
,
security.admin
, and host.devices.admin
, which looks like
--privileged
, but actually means some protections are still in place.
According to Eddequiouaq, "everything works and we didn't have to
disable all the seccomp and AppArmor profiles". He also gave a demo of
how to build an image and demonstrated how docker inspect
shows the
entitlements bundled inside the image. With such an image, docker run
starts a DinD image without any special flags. That requires a way to
trust the content publisher because suddenly images can elevate their
own privileges without the caller specifying anything on the Docker
command line.
Goals and future
The specification aims to provide the best user experience possible, so
that people actually start using the security mechanisms provided by the
platforms instead of opting out of security configurations when they get
a "permission denied" error. Eddequiouaq said that Docker eventually
wants to "ditch the --privileged
flag because it is really a bad
habit". Instead, applications should run with the least privileges they
need. He said that "this is not the case; currently, everyone works with
defaults that work with 95% of the applications out there." Those Docker
defaults, he said, provide a "way too big attack surface".
Eddequiouaq opened the door for developers to define custom entitlements
because "it's hard to come up with a set that will cover all needs". One
way the team thought of dealing with that uncertainty is to have
versions of the specification but it is unclear how that would work in
practice. Would the version be in the entitlement labels (e.g.
network-v1.admin
), or out of band?
Another feature proposed is the control of API access and service-to-service communication in the security profile. This is something that's actually available on phones, where an app can only talk with a specific set of services. But that is also relevant to containers in Kubernetes clusters as administrators often need to restrict network access with more granularity than the "open/filter/close" options. An example of such policy could allow the "web" container to talk with the "database" container, although it might be difficult to specify such high-level policies in practice.
While entitlements are now implemented in Docker as a proof of concept,
Kubernetes has the same usability issues as Docker so the ultimate goal
is to get entitlements working in Kubernetes runtimes directly. Indeed,
its PodSecurityPolicy
maps (almost) one-to-one with the Docker
security flags. But as we have previously
reported, another challenge in
Kubernetes security is that the security models of Kubernetes and Docker
are not exactly identical.
Eddequiouaq said that entitlements could help share best security
policies for a pod in Kubernetes. He proposed that such configuration
would happen through the SecurityContext
object.
Another way would be an admission controller that would avoid conflicts
between the entitlements in the image and existing SecurityContext
profiles already configured in the cluster. There are two possible
approaches in that case: the rules from the entitlements could expand
the existing configuration or restrict it where the existing
configuration becomes a default. The problem here is that the pod's
SecurityContext
already provides a widely deployed way to configure
security mechanisms, even if it's not portable or easy to share, so the
proposal shouldn't break existing configurations. There is work in
progress in Docker to allow inheriting entitlements within a Dockerfile.
Eddequiouaq proposed that Kubernetes should implement a simple mechanism
to inherit entitlements from images in the admission controller.
The Docker security team wants to create a "widely adopted standard" supported by Docker swarm, Kubernetes, or any container scheduler. But it's still unclear how deep into the Kubernetes stack entitlements belong. In the team's current implementation, Docker translates entitlements into the security mechanisms right before calling its runtime (containerd), but it might be possible to push the entitlements concept straight into the runtime itself, as it knows best how the platform operates.
Some readers might also notice fundamental similarities between this and
other mechanisms such as OpenBSD's pledge()
, which made me wonder if
entitlements belong in user space in the first place. Cormack observed
that seccomp was such a "pain to work with to do complicated policies".
He said that having eBPF seccomp
filters would make it easier to deal
with conflicts between policies and also mentioned the work done on the
Checmate and
Landlock security modules as
interesting avenues to explore. It seems that none of those kernel
mechanisms are ready for prime time, at least not to the point that
Docker can use them in production. Eddequiouaq said that the proposal
was open to changes and discussion so this is all work in progress at
this stage. The next steps are to make a proposal to the Kubernetes
community before working on an actual implementation outside of Docker.
I have found the core idea of protecting users from all the complicated stuff in container security interesting. It is a recurring theme in container security; we've previously discussed proposals to add container identifiers in the kernel directly for example. Everyone knows security is sensitive and important in Kubernetes, yet doing it correctly is hard. This is a recipe for disaster, which has struck in high profile cases recently. Hopefully having such easier and cleaner mechanisms will help users, developers, and administrators alike.
A YouTube video and slides [PDF] of the talk are available.
This article first appeared in the Linux Weekly News.