A report from Netconf: Day 2
This article is part of a larger series about NetConf/NetDev 2.1.
- A report from Netconf: Day 1
- A report from Netconf: Day 2
- New approaches to network fast paths
- The rise of Linux-based networking hardware
This article covers the second day of the informal Netconf discussions, held on on April 4, 2017. Topics discussed this day included the binding of sockets in VRF, identification of eBPF programs, inconsistencies between IPv4 and IPv6, changes to data-center hardware, and more. (See this article for coverage from the first day of discussions).
How to bind to specific sockets in VRF
One of the first presentations was from David Ahern of Cumulus, who presented a few interesting questions for the audience. His first was the problem of binding sockets to a given interface. Right now, there are four different ways this can be done:
- the old
SO_BINDTODEVICE
generic socket option (see socket(7)) - the
IP_PKTINFO
, IP-specific socket option (see ip(7)), introduced in Linux 2.2 - the
IP_UNICAST_IF
flag, introduced in Linux 3.3 for WINE - the IPv6 scope ID suffix, part of the IPv6 addressing standard
So there's a problem of having too many ways of doing the same thing, something that cannot really be fixed without breaking ABI compatibility. But even worse, conflicts between those options are not reported by the kernel so it's possible for a user to set up socket flags in a way that certain flags override others and there are no checks made or errors reported. It was agreed that the user should get some notification of conflicting changes here, at least.
Furthermore, binding sockets to a specific VRF (Virtual Routing and Forwarding) device is not currently possible, so Ahern asked what the best way to do this would be, considering the many options available. A use case example is a UDP multicast socket that could be bound to a specific interface within a VRF.
This is an old problem: Tom Herbert explained that there were previous
discussions about making the bind()
system call more programmable so
that, for example, you could bind()
a UDP socket to a discrete list of
IP addresses or a subnet. So he identified this issue as a broader
problem that should be addressed by making the interfaces more generic.
Ahern explained that it is currently possible to bind sockets to the slave device of a VRF even though that should not be allowed. He also raised the question of how the kernel should tell which socket should be selected for incoming packets. Right now, there is a scoring mechanism for UDP sockets, but that cannot be used directly in this more general case.
David Miller said that there are already different ways of specifying
scope: there is the VRF layer and the namespace ("netns") layer. A long
time ago, Miller reluctantly accepted the addition of netns keys
everywhere, swallowing the performance cost to gain flexibility. He
argued that a new key should not be added and instead existing
infrastructure should be reused. Herbert argued this was exactly the
reason why this should be simplified: "if we don't answer the question,
people will keep on trying this". For example, one can use a VRF to
limit listening addresses, but it gets complicated if we need a device
for every address. It seems the consensus evolved towards using,
IP_UNICAST_IF
, added back in 2012, which is accessible for non-root
users. It is currently limited to UDP and RAW sockets, but it could be
extended for TCP.
XDP and eBPF program identification
Ahern then turned to the problem of extracting BPF programs from the kernel. He gave the example of a simple cBPF (classic BPF) filter that checks for ARP packets. If the filter is read back from the kernel, the user gets a blob of binary data, which is hard to interpret. There is an kernel verifier that can show C-like output, but that is also difficult to interpret. Ahern then added annotations to his slide that showed what the original program actually does, which was a good demonstration of why such a feature is needed.
Ahern explained that, at least for cBPF, it should be possible to
recover the original plaintext, or at least something close to the
original program. A first step would be to replace known constants (like
0x806
for ARP). Even with eBPF, it should be possible to improve the
output. Alexei Starovoitov, the BPF maintainer, explained that it might
make sense to start by returning information about the maps used by an
eBPF program. Then more complex data structures could be inspected once
we know their type.
The first priority is to get simple debugging tools working but, in the
long term, the goal is a full decompiler that can reconstruct
instructions into a human-readable program. The question that remains is
how to return this data. Ahern explained that right now the bpf()
system call copies the data to a different file descriptor, but it could
just fill in a buffer. Starovoitov argued for a file descriptor; that
would allow the kernel to stream everything through the same descriptor
instead of having many attach points. Netlink cannot be used for this
because of its asynchronous nature.
A similar issue regarding the way we identify express data path (XDP) programs (which are also written in BPF) was raised by Daniel Borkmann from Covalent. Miller explained that users will want ways to figure out which XDP program was installed, so XDP needs an introspection mechanism. We currently have SHA-1 identifiers that can be internally used to tell which binary is currently loaded but those are not exposed to user space. Starovoitov mentioned it is now just a boolean that shows if a program is loaded or not.
A use case for this, on top of just trying to figure out which BPF program is loaded, is to actually fetch the source code of a BPF program that was deployed in the field for which the source was lost. It is still uncertain that it will be possible to extract an exact copy that could then be recompiled into the same program. Starovoitov added that he needed this in production to do proper reporting.
IPv4/IPv6 equivalency
The last issue — or set of issues — that Ahern brought up was the question of inconsistencies between IPv4 and IPv6. It turns out that, because both protocols were (naturally) implemented separately, there are inconsistencies in how they are handled in the Linux kernel, which affect, among other things, the VRF framework. The first example he gave was the fact that IPv6 addresses added on the loopback interface generate unreachable routes in the main routing table, yet this doesn't happen with IPv4 addresses. Hannes Frederic Sowa explained this was part of the IPv6 specification: there are stronger restrictions on loopback interfaces in IPv6 than IPv4. Ahern explained that VRF loopback interfaces do not implement these restrictions and wanted to know if this was a problem.
Another issue is that anycast routes are added to the wrong interface. This is apparently not specific to VRF: this was done "just because Java", and has been there from day one. It seems that the Java Virtual Machine builds its own routing table and assumes this behavior, so changing this would break every JVM out there, which is obviously not acceptable.
Finally, Martin Kafai Lau asked if work should be done to merge the IPv4 and IPv6 FIB (forwarding information base) trees. The FIB tree is the data structure that represents routing tables in the Linux kernel. Miller explained that the two trees are not semantically equivalent: while IPv6 does source-address lookup and routing, IPv4 does not. We can't remove the source lookups from IPv6, because "people probably use that". According to Alexander Duyck, adding source tables to IPv4 would degrade performance to the level of IPv6 performance, which was jokingly referred to as an incentive to switch to IPv6.
More seriously, Sowa argued that using the same compressed tree IPv4 uses in IPv6 could make sense. People may want to have source routing in IPv4 as well. Miller argued that the kernel is optimized for 32-bit addresses in IPv4, and conceded that it could be scaled to 64-bit subnets, but 128-bit addresses would be much harder. Sowa suggested that they could be limited to 64 bits, as global routes that are announced over BGP usually have such a limit, and more specific routes are usually at discrete prefixes like /65, /127 (for interconnect links) or /128 for (for point-to-point links). He expressed concerns over the reliability of such an implementation so, at this point, it is unlikely that the data structures could be merged. What is more likely is that the code path could be merged and simplified, while keeping the data structures separate.
Modules options substitutions
The next issue that was raised was from Jiří Pírko, who asked how to pass configuration options to a driver before the driver is initialized. Some chips require that some settings be sent before the firmware is loaded, which leads to a weird situation where there is a need to address a device before it's actually recognized by the kernel. The question then can be summarized as to how to pass information to a device that doesn't exist yet.
The answer seems to be that
devlink
could
do this, as it has access to the full device tree and, therefore, to
devices that can be addressed by (say) PCI identifiers. Then a possible
devlink
command could look something like:
devlink dev pci/0000:03:00.0 option set foo bar
This idea raised a bunch of extra questions: some devices don't have a
one-to-one mapping with the PCI bridge identifiers, for example, meaning
that those identifiers cannot be used to access such devices. Another
issue is that you may want to send multiple settings in a single
transaction, which doesn't fit well in the devlink
model. Miller then
proposed to let the driver initialize itself to some state and wait for
configuration to be sent when necessary. Another way would be to
unregister the driver and re-register with the given configuration.
Shrijeet Mukherjee explained that right now, Cumulus is doing this using
horrible startup script magic by retrying and re-registering, but it
would be nice to have a more standard way to do this.
Control over UAPI patches
Another issue that came up was the problem of changes in the user-space API (UAPI) which break backward compatibility. Pírko said that "we have to be more careful about those changes". The problem is that reviewers are not always available to make detailed reviews of such changes and may not notice API-breaking changes. Pírko proposed creating a bot to check if a given patch introduces UAPI changes, changes in structs, or in netlink enums. Miller said he could block merges until discussions happen and that patchwork, which Miller uses to process patches from the mailing list, does some of this. He also pointed out there aren't enough test cases in the first place.
Starovoitov argued UAPI isn't special, there are other ways of breaking backward compatibility. He expressed concerns that such a bot could create a false sense that everything is fine while a patch could break compatibility and not be detected. Miller countered that UAPI is special in that "we're stuck with it forever". He then went on to propose that, since there's a maintainer (or more) for each module, he can make sure that each maintainer explicitly approves changes to those modules.
Data-center hardware changes
Starovoitov brought up the issue of a new type of hardware that is currently being deployed in data centers called a "multi-host NIC" (network interface card). It's a single NIC that is connected to multiple servers. Facebook, for example, uses this in its Yosemite platform that shoves twelve servers into a 2U rack mount, in three modules. Each module is made of four servers connected to the traditional switch fabric with a single NIC through PCI-Express. Mellanox and and Broadcom also have similar devices.
One question is how to manage those devices. Since they are connected through a PCI-Express bus, Linux will see them as a NIC, yet they are also a little like switches, in that they interconnect multiple servers. Furthermore, the kernel security model assumes that a NIC is trusted, and gladly opens its own memory to NICs through DMA; this can become a huge security issue when the NIC is under the control of another server. This can especially become problematic if we consider that there could be TLS hardware offloading in the future with the introduction of in-kernel TLS stacks.
The other problem is the question of reliability: since those devices are currently "dumb", they need to be managed just like a regular NIC. If the host managing the card crashes, it could disable a whole set of servers that rely on the same NIC. There could be an election process among the servers, but that complicates significantly what used to be a simple PCI connection.
Mukherjee pointed out that the model Cisco uses for this is that the "smart NIC" is a "slave" of the main switch fabric. It's a daughter card, which makes it easier to manage from a network perspective. It is clear that Linux will need a way to represent those devices, probably through the newly introduced switchdev or DSA (distributed switch architecture), but it will be something to keep an eye on as density increases in the data center.
There were many more discussions during Netconf, too many to cover here, but in the end, Miller thanked everyone for all the interesting topics as the participants dispersed for a day off to travel to Montreal to attend the following Netdev conference.
The author would like to thank the Netconf and Netdev organizers for travel to, and hosting assistance in, Toronto. Many thanks to Alexei Starovoitov for his time taken for a technical review of this article.
Note: this article first appeared in the Linux Weekly News.