A report from Netconf: Day 1
This article is part of a larger series about NetConf/NetDev 2.1.
- A report from Netconf: Day 1
- A report from Netconf: Day 2
- New approaches to network fast paths
- The rise of Linux-based networking hardware
As is becoming traditional, two times a year the kernel networking community meets in a two-stage conference: an invite-only, informal, two-day plenary session called Netconf, held in Toronto this year, and a more conventional one-track conference open to the public called Netdev. I was invited to cover both conferences this year, given that Netdev was in Montreal (my hometown), and was happy to meet the crew of developers that maintain the network stack of the Linux kernel.
This article covers the first day of the conference which consisted of around 25 Linux developers meeting under the direction of David Miller, the kernel's networking subsystem maintainer. Netconf has no formal sessions; although some people presented slides, interruptions are frequent (indeed, encouraged) and the focus is on hashing out issues that are blocked on the mailing list and getting suggestions, ideas, solutions, and feedback from their peers.
Removing ndo_select_queue()
One of the first discussions that elicited a significant debate was the
ndo_select_queue()
function, a key component of the Linux polling
system that determines when and how to send packets on a network
interface (see
netdev_pick_tx
and friends). The general question was whether the use of
ndo_select_queue()
in drivers is a good idea. Alexander Duyck
explained that Intel people were considering using ndo_select_queue()
for receive/transmit queue matching. Intel drivers do not currently use
the hook provided by the Linux kernel and it turns out no one is happy
with ndo_select_queue()
: the heuristics it uses don't really please
anyone. The consensus (including from Duyck himself) seemed to be that
it should just not be used anymore, or at least not used for that
specific purpose.
The discussion turned toward the wireless network stack, which uses it
extensively, but for other purposes. Johannes Berg explained that the
wireless stack uses ndo_select_queue()
for traffic classification, for
example to get voice traffic through even if the best-effort queue is
backed up. The wireless stack could stop using it by doing flow control
completely inside the wireless stack, which already uses the fq_codel
flow-control mechanism for other purposes, so porting away from
ndo_select_queue()
seems possible there.
The problem then becomes how to update all the drivers to change that
behavior, which would be a lot of work. Still, it seems people are
moving away from a generic ndo_select_queue()
interface to
stack-specific or even driver-specific (in the case of Intel) queue
management interfaces.
refcount_t followup
There was a followup discussion on the integration of the refcount_t
type into the network stack, which we covered
recently. This type is meant to be an in-kernel
defense against exploits based on overflowing or underflowing an
object's reference count.
The consensus seems to be that having refcount_t
used for debugging is
acceptable, but it cannot be enabled by default. An issue that was
identified is that the networking developers are fairly sure that
introducing refcount_t
would have a severe impact on performance, but
they do not have benchmarks to prove it, something Miller identified as
a problem that needs to be worked on. Miller then expressed some
openness to the idea of having it as a kernel configuration option.
A similar discussion happened, on the second day, regarding the
KASan
memory error detector which was covered when it was
introduced in 2014. Eric Dumazet warned that there could be a lot of
issues that cannot be detected by KASan because of the way the network
stack often bypasses regular memory-allocation routines for performance
reasons. He also noted that this can sometimes mean the stack may go
over the regular 10% memory limit (the tcp_mem
parameter, described in
the tcp(7) man page)
for certain operations, especially when rebuilding out of order packets
with lots of parallel TCP connections.
Therefore it was proposed that these special memory recycling tricks
could be optionally disabled, at run or compile-time, to instrument
proper memory tracking. Dumazet argued this was a situation similar to
refcount_t
in that we need a way to disable high performance to make
the network stack easier to debug with KAsan.
The problem with optional parameters is that they are often disabled in production or even by default, which, in turn, means that critical bugs cannot actually be found because the code paths are not tested. When I asked Dumazet about this, he explained that Google performs integration testing of new kernels before putting them in production, and those toggles could be enabled there to find and fix those bugs. But he agreed that certain code paths are then not tested until the code gets deployed in production.
So it seems the status quo remains: security folks wants to improve the reliability of the kernel, but the network folks can't afford the performance cost. Yet it was clear in the discussions that the team cares about security issues and wants those issues to be fixed; the impact of some of the solutions is just too big.
Lightweight wireless management packet access
Berg explained that some users need to have high-performance access to certain management frames in the wireless stack and wondered how to best expose those to user space. The wireless stack already allows users to clone a network interface in "monitor" mode, but this has a big performance cost, as the radiotap header needs to be constructed from scratch and the packet header needs to be copied. As wireless improves and the bandwidth rises to gigabit levels, this can become significant bottleneck for packet sniffers or reporting software that need to know precisely what's going on over the air outside of the regular access point client operation.
It seems the proper way to do this is with an eBPF program. As Miller summarized, just add another API call that allows loading a BPF program into the kernel and then those users can use a BPF filtering point to get the statistics they need. This will require an extra hook in the wireless stack, but it seems like this is the way that will be taken to implement this feature.
VLAN 0 inconsistencies
Hannes Frederic Sowa brought up the seemingly innocuous question of "how do we handle VLAN 0?" In theory, VLAN 0 means "no VLAN". But the Linux kernel currently handles this differently depending on whether the VLAN module is loaded and whether a VLAN 0 interface was created. Sometimes the VLAN tag is stripped, sometimes not.
It turns out the semantics of this were accidentally changed last time there was a change here and this was originally working but is now broken. Sowa therefore got the go-ahead to fix this to make the behavior consistent again.
Loopy fun
Then it came the turn of Jamal Hadi Salim, the maintainer of the
kernel's traffic-control (tc) subsystem. The first
issue he brought up is a problem in the tc
REDIRECT
action that can
create infinite loops within the kernel. The problem can be easily
alleviated when loops are created on the same interface: checks can be
added that just drop packets coming from the same device and rate-limit
logging to avoid a denial-of-service (DoS) condition.
The more serious problem occurs when a packet is forwarded from (say)
interface eth0
to eth1
which then promptly redirects it from eth1
back to eth0
. Obviously, this kind of problem can only be created by a
user with root access so, at first glance, those issues don't seem that
serious: admins can shoot themselves in the foot, so what?
But things become a little more serious when you consider the container case, where an untrusted user has root access inside a container and should have constrained resource limitations. Such a loop could allow this user to deploy an effective DoS attack against a whole group of containers running on the same machine. Even worse, this endless loop could possibly turn into a deadlock in certain scenarios, as the kernel could try to transmit the packet on the same device it originated from and block, progressively filling the queues and eventually completely breaking network access. Florian Westphal argued that a container can already create DoS conditions, for example by doing a ping flood.
According to Salim, this whole problem was created when two bits used
for tracking such packets were reclaimed from the skb
structure used
to represent packets in the kernel. Those bits were a simple TTL (time
to live) field that was incremented on each loop and dropped after a
pre-determined limit was reached, breaking infinite loops. Salim asked
everyone if this should be fixed or if we should just forget about this
issue and move on.
Miller proposed to keep a one-behind state for the packet, fixing the
simplest case (two interfaces). The general case, however, would requite
a bitmap of all the interfaces to be scanned, which would impose a large
overhead. Miller said an attempt to fix this should somehow be made. The
root of the problem is that the network maintainers are trying to reduce
the size of the skb
structure, because it's used in many critical
paths of the network stack. Salim's position is that, without the TTL
fields, there is no way to fix the general case here, and this
constitutes a security issue. So either the bits need to be brought
back, or we need to live with the inherent DoS threat.
Dumping large statistics sets
Another issue Salim brought up was the question of how to export large
statistics sets from the kernel. It turns out that some use cases may
end up dumping a lot of data. Salim mentioned a real-world tc use case
that calls for reading six-million entries. The current netlink-based
API provides a way to get only 20 entries at a time, which means it
takes forever to dump the state of all those policy actions. Salim has a
patch that changes the dump size be eight times the NLMSG_GOOD_SIZE
,
which improves performance by an order of magnitude already, although
there are issues with checking the user-space buffer size there.
But a more complete solution is needed. What Salim proposed was a way to
ask only for the states that changed since the last dump was requested.
He has a patch to add a last_access
field to the netlink_callback
structure used by netlink_dump()
to output data; that raised the
question of how to actually use that field. Since Salim fetches that
data every five seconds, he figured he could just tell the kernel to
return all the nodes that changed in that period. But then if the dump
takes more than five seconds to complete, the next dump may be missing
states that changed during the extra delay. An alternative mechanism
would be for the user-space utility to keep the time stamp it requested
and use that as a delta for the next dump.
It turns out this is a larger problem than just tc
. Dumazet mentioned
this was an issue with fq_codel
classes: he would even like to be able
to dump those statistics faster than every five seconds. Roopa Prabhu
mentioned that Cumulus also has similar problems dumping stats from
bridges, so clearly a more generic solution is needed here. There is,
however, a fundamental problem with dumping large statistics sets from
the kernel: those statistics are constantly changing while the dump is
created and unless versioning or locking mechanisms are used — which
would slow things down — the data returned is bound to be only an
approximation of reality. Salim promised to send a set of RFC patches to
further discussions regarding this issue, but during the following
Netdev conference, Berg published a
patch to fix
this ten-year-old issue, which brought cheers from the audience.
The author would like to thank the Netconf and Netdev organizers for travel to, and hosting assistance in, Toronto. Many thanks to Berg, Dumazet, Salim, and Sowa for their time taken for a technical review of this article.
Note: this article first appeared in the Linux Weekly News.