This article is part of a larger series about NetConf/NetDev 2.1.

A report from Netconf: Day 1

A report from Netconf: Day 2

New approaches to network fast paths

The rise of Linux-based networking hardware

This article covers the second day of the informal Netconf discussions, held on on April 4, 2017. Topics discussed this day included the binding of sockets in VRF, identification of eBPF programs, inconsistencies between IPv4 and IPv6, changes to data-center hardware, and more. (See this article for coverage from the first day of discussions).

How to bind to specific sockets in VRF
XDP and eBPF program identification
IPv4/IPv6 equivalency
Modules options substitutions
Control over UAPI patches
Data-center hardware changes

How to bind to specific sockets in VRF

One of the first presentations was from David Ahern of Cumulus, who presented a few interesting questions for the audience. His first was the problem of binding sockets to a given interface. Right now, there are four different ways this can be done:

the old SO_BINDTODEVICE generic socket option (see socket(7))
the IP_PKTINFO, IP-specific socket option (see ip(7)), introduced in Linux 2.2
the IP_UNICAST_IF flag, introduced in Linux 3.3 for WINE
the IPv6 scope ID suffix, part of the IPv6 addressing standard

So there's a problem of having too many ways of doing the same thing, something that cannot really be fixed without breaking ABI compatibility. But even worse, conflicts between those options are not reported by the kernel so it's possible for a user to set up socket flags in a way that certain flags override others and there are no checks made or errors reported. It was agreed that the user should get some notification of conflicting changes here, at least.

Furthermore, binding sockets to a specific VRF (Virtual Routing and Forwarding) device is not currently possible, so Ahern asked what the best way to do this would be, considering the many options available. A use case example is a UDP multicast socket that could be bound to a specific interface within a VRF.

This is an old problem: Tom Herbert explained that there were previous discussions about making the bind() system call more programmable so that, for example, you could bind() a UDP socket to a discrete list of IP addresses or a subnet. So he identified this issue as a broader problem that should be addressed by making the interfaces more generic.

Ahern explained that it is currently possible to bind sockets to the slave device of a VRF even though that should not be allowed. He also raised the question of how the kernel should tell which socket should be selected for incoming packets. Right now, there is a scoring mechanism for UDP sockets, but that cannot be used directly in this more general case.

David Miller said that there are already different ways of specifying scope: there is the VRF layer and the namespace ("netns") layer. A long time ago, Miller reluctantly accepted the addition of netns keys everywhere, swallowing the performance cost to gain flexibility. He argued that a new key should not be added and instead existing infrastructure should be reused. Herbert argued this was exactly the reason why this should be simplified: "if we don't answer the question, people will keep on trying this". For example, one can use a VRF to limit listening addresses, but it gets complicated if we need a device for every address. It seems the consensus evolved towards using, IP_UNICAST_IF, added back in 2012, which is accessible for non-root users. It is currently limited to UDP and RAW sockets, but it could be extended for TCP.

XDP and eBPF program identification

Ahern then turned to the problem of extracting BPF programs from the kernel. He gave the example of a simple cBPF (classic BPF) filter that checks for ARP packets. If the filter is read back from the kernel, the user gets a blob of binary data, which is hard to interpret. There is an kernel verifier that can show C-like output, but that is also difficult to interpret. Ahern then added annotations to his slide that showed what the original program actually does, which was a good demonstration of why such a feature is needed.

Ahern explained that, at least for cBPF, it should be possible to recover the original plaintext, or at least something close to the original program. A first step would be to replace known constants (like 0x806 for ARP). Even with eBPF, it should be possible to improve the output. Alexei Starovoitov, the BPF maintainer, explained that it might make sense to start by returning information about the maps used by an eBPF program. Then more complex data structures could be inspected once we know their type.

The first priority is to get simple debugging tools working but, in the long term, the goal is a full decompiler that can reconstruct instructions into a human-readable program. The question that remains is how to return this data. Ahern explained that right now the bpf() system call copies the data to a different file descriptor, but it could just fill in a buffer. Starovoitov argued for a file descriptor; that would allow the kernel to stream everything through the same descriptor instead of having many attach points. Netlink cannot be used for this because of its asynchronous nature.

A similar issue regarding the way we identify express data path (XDP) programs (which are also written in BPF) was raised by Daniel Borkmann from Covalent. Miller explained that users will want ways to figure out which XDP program was installed, so XDP needs an introspection mechanism. We currently have SHA-1 identifiers that can be internally used to tell which binary is currently loaded but those are not exposed to user space. Starovoitov mentioned it is now just a boolean that shows if a program is loaded or not.

A use case for this, on top of just trying to figure out which BPF program is loaded, is to actually fetch the source code of a BPF program that was deployed in the field for which the source was lost. It is still uncertain that it will be possible to extract an exact copy that could then be recompiled into the same program. Starovoitov added that he needed this in production to do proper reporting.

IPv4/IPv6 equivalency

The last issue — or set of issues — that Ahern brought up was the question of inconsistencies between IPv4 and IPv6. It turns out that, because both protocols were (naturally) implemented separately, there are inconsistencies in how they are handled in the Linux kernel, which affect, among other things, the VRF framework. The first example he gave was the fact that IPv6 addresses added on the loopback interface generate unreachable routes in the main routing table, yet this doesn't happen with IPv4 addresses. Hannes Frederic Sowa explained this was part of the IPv6 specification: there are stronger restrictions on loopback interfaces in IPv6 than IPv4. Ahern explained that VRF loopback interfaces do not implement these restrictions and wanted to know if this was a problem.

Another issue is that anycast routes are added to the wrong interface. This is apparently not specific to VRF: this was done "just because Java", and has been there from day one. It seems that the Java Virtual Machine builds its own routing table and assumes this behavior, so changing this would break every JVM out there, which is obviously not acceptable.

Finally, Martin Kafai Lau asked if work should be done to merge the IPv4 and IPv6 FIB (forwarding information base) trees. The FIB tree is the data structure that represents routing tables in the Linux kernel. Miller explained that the two trees are not semantically equivalent: while IPv6 does source-address lookup and routing, IPv4 does not. We can't remove the source lookups from IPv6, because "people probably use that". According to Alexander Duyck, adding source tables to IPv4 would degrade performance to the level of IPv6 performance, which was jokingly referred to as an incentive to switch to IPv6.

More seriously, Sowa argued that using the same compressed tree IPv4 uses in IPv6 could make sense. People may want to have source routing in IPv4 as well. Miller argued that the kernel is optimized for 32-bit addresses in IPv4, and conceded that it could be scaled to 64-bit subnets, but 128-bit addresses would be much harder. Sowa suggested that they could be limited to 64 bits, as global routes that are announced over BGP usually have such a limit, and more specific routes are usually at discrete prefixes like /65, /127 (for interconnect links) or /128 for (for point-to-point links). He expressed concerns over the reliability of such an implementation so, at this point, it is unlikely that the data structures could be merged. What is more likely is that the code path could be merged and simplified, while keeping the data structures separate.

Modules options substitutions

The next issue that was raised was from Jiří Pírko, who asked how to pass configuration options to a driver before the driver is initialized. Some chips require that some settings be sent before the firmware is loaded, which leads to a weird situation where there is a need to address a device before it's actually recognized by the kernel. The question then can be summarized as to how to pass information to a device that doesn't exist yet.

The answer seems to be that devlink could do this, as it has access to the full device tree and, therefore, to devices that can be addressed by (say) PCI identifiers. Then a possible devlink command could look something like:

    devlink dev pci/0000:03:00.0 option set foo bar

This idea raised a bunch of extra questions: some devices don't have a one-to-one mapping with the PCI bridge identifiers, for example, meaning that those identifiers cannot be used to access such devices. Another issue is that you may want to send multiple settings in a single transaction, which doesn't fit well in the devlink model. Miller then proposed to let the driver initialize itself to some state and wait for configuration to be sent when necessary. Another way would be to unregister the driver and re-register with the given configuration. Shrijeet Mukherjee explained that right now, Cumulus is doing this using horrible startup script magic by retrying and re-registering, but it would be nice to have a more standard way to do this.

Control over UAPI patches

Another issue that came up was the problem of changes in the user-space API (UAPI) which break backward compatibility. Pírko said that "we have to be more careful about those changes". The problem is that reviewers are not always available to make detailed reviews of such changes and may not notice API-breaking changes. Pírko proposed creating a bot to check if a given patch introduces UAPI changes, changes in structs, or in netlink enums. Miller said he could block merges until discussions happen and that patchwork, which Miller uses to process patches from the mailing list, does some of this. He also pointed out there aren't enough test cases in the first place.

Starovoitov argued UAPI isn't special, there are other ways of breaking backward compatibility. He expressed concerns that such a bot could create a false sense that everything is fine while a patch could break compatibility and not be detected. Miller countered that UAPI is special in that "we're stuck with it forever". He then went on to propose that, since there's a maintainer (or more) for each module, he can make sure that each maintainer explicitly approves changes to those modules.

Data-center hardware changes

Starovoitov brought up the issue of a new type of hardware that is currently being deployed in data centers called a "multi-host NIC" (network interface card). It's a single NIC that is connected to multiple servers. Facebook, for example, uses this in its Yosemite platform that shoves twelve servers into a 2U rack mount, in three modules. Each module is made of four servers connected to the traditional switch fabric with a single NIC through PCI-Express. Mellanox and and Broadcom also have similar devices.

One question is how to manage those devices. Since they are connected through a PCI-Express bus, Linux will see them as a NIC, yet they are also a little like switches, in that they interconnect multiple servers. Furthermore, the kernel security model assumes that a NIC is trusted, and gladly opens its own memory to NICs through DMA; this can become a huge security issue when the NIC is under the control of another server. This can especially become problematic if we consider that there could be TLS hardware offloading in the future with the introduction of in-kernel TLS stacks.

The other problem is the question of reliability: since those devices are currently "dumb", they need to be managed just like a regular NIC. If the host managing the card crashes, it could disable a whole set of servers that rely on the same NIC. There could be an election process among the servers, but that complicates significantly what used to be a simple PCI connection.

Mukherjee pointed out that the model Cisco uses for this is that the "smart NIC" is a "slave" of the main switch fabric. It's a daughter card, which makes it easier to manage from a network perspective. It is clear that Linux will need a way to represent those devices, probably through the newly introduced switchdev or DSA (distributed switch architecture), but it will be something to keep an eye on as density increases in the data center.

There were many more discussions during Netconf, too many to cover here, but in the end, Miller thanked everyone for all the interesting topics as the participants dispersed for a day off to travel to Montreal to attend the following Netdev conference.

The author would like to thank the Netconf and Netdev organizers for travel to, and hosting assistance in, Toronto. Many thanks to Alexei Starovoitov for his time taken for a technical review of this article.

Note: this article first appeared in the Linux Weekly News.

Comments on this page are closed.

Created 2017-04-11 13:00. Edited 2020-06-05 20:31.