TAP virtual networking antics

…where we explore virtual networking using TUN/TAP for QEMU and its relation with network namespaces, or “HTF does the packet flow works on this??!??/!”

Last edit on 2024-04-29

PS1: I appreciate contributions from anyone. Feel free to correct any misconception in the comments.
PS2: this post compiles a lot of in-depth content from various sources, and my objective in doing so is to make sense of all of the lingo related to networking everywhere. Possibly, by understanding how this all works, everything else related should be easier.
PS3: The scope of this post takes into account network in Linux. There is also 4 main points to consider when reading:
- The relation between devices and interfaces under Linux (specifically for networking)
- The different packages to achieve this: the old net-tools and iproute2. This guide tries to use iproute2 tooling.
- Distro-specific scripts that wraps around tooling for virtual network setups. An example is Debian’s ifup(8) and ifdown(8) which is used by LFS (Linux From Scratch) and referenced by the QEMU networking docs. These are cited by a lot of QEMU guides for bridging, also having some late versions distributing similar scripts under /var
- Deprecated tools that were used in tutorials the last 15 years or so, like brctl that wraps around net-tools.

With that being said…

QEMU can use TAP networking where the networking traffic is routed by a bridge (called sometimes a multi-port switch, or just treated like a switch. The distinction comes from its virtual counterpart having the same name of the old bridge, a hardware that end up evolving into the switch.)

The bridge routes traffic from the OSI Layer 2 of Data Link which contains protocols such as MAC (Media Access Control) and ARP (Address Resolution Protocol). BTW, one of the other ways to do this with QEMU is via SLIRP, another protocol from Layer 2, where nowadays its evolution reached Rootless OCI Containers with slirp4netns, having surpassed most of its drawbacks.

The Linux kernel basically have TUN/TAP since around 2000, resulted from a merging between BSD, Solaris and Linux networking tools. In layer 2, the kind of data flowing through the network is an Ethernet Frame, where we have TAP being used for Ethernet Tunneling. TUN on the other side is used for IP Tunneling, dealing with IP frames.

The kernel exposes TAP as:

a character device: /dev/tapX
a Virtual Ethernet Interface (VNI): tapX

to talk to the kernel, TAP in userland sends an Ethernet Frame to the character device /dev/tapX and is read by the kernel. For the TAP in kernel land communicate with userland, it sends also an Ethernet Frame to the Virtual Ethernet Interface (VNI) that can be read by userland

There is a distinction between Virtual Ethernet Interface (VNI) and Virtual Ethernet Device (VETH). The second is used to create a pair of Virtual Network Interfaces that serves as a connection between Network Namespaces, so you can say VETH is a superset of VNI used specifically for Network Namespaces, or netns. They ARE NOT the same thing because when you create a VNI, is NOT NECESSARILY A PAIR, and also you are not specifying which network namespace they are attached to with.

image about veth etc

An example by this guide on redhat is if you wanted to bridge networking for 2 VMs and one OCI Container. Here we have two TAPs, say, for QEMU VM, and a VETH for a network namespace which could be a container.

But how is it treated in the kernel? I mean, even iproute2 just abstracts the kernel data structures in the C programming language. So, The network stack on Linux is part of the VFS (Virtual Filesystem). The packet enters the NIC and is allocated in memory as an sk_buff (SKB) structure, which is the main networking structure representing a packet (it enters the networking via the method netif_receive_skb() ). Each transmitted (TX) SKB has a sock object associated to it (sk). If the packet is a forwarded packet (RX), then sk is NULL, because it was not generated on the local host.

It is then processed on CPU. The packet is composed of frames, part of these frames are tagging such as VLAN (IEE 802.1Q or dot1q).

In the stack, there are network interfaces, network devices (read the Linux Device Drivers book to understand more). There are also network namespaces. When you use any linux distribution without configure network, you are probably using the global root network namespace, but you can create new ones. Every new subset of the network, created as a virtual network, is created under a network namespace.

So in this concept, both bridge and veth would be a subset of the network namespace. What is the difference?

The true difference lies in how the data structures are handled by the kernel, and all the API is written in C. So again: they are part of network namespaces, and are constructed using the net_device struct, defined in include/linux/netdevice.h. This struct represents the network device. It can be a physical device, like an Ethernet device, or it can be a software device, like a bridge device or a VLAN device.

Below are the Network Namespace Local Devices. They can have a feature called NETIF_F_NETNS_LOCAL set, meaning that moving it to another namespace would fail with the error -EINVAL. In Linux 6.7.1, this can be found in [include/linux/netdev_features.h`](https://elixir.bootlin.com/linux/v6.7.1/source/include/linux/netdev_features.h#L126) that is then imported on the netdevice header file.

loopback device
bridge device
VXLAN device
PPP device

Every namespace that don’t have the NETIF_F_NETNS_LOCAL flag set is moved to the default initial network namespace (init_net). Those which don’t have this flag are deleted.

Below are some tunnel drivers. These use the flag NETIF_F_LLTX.

VXLAN (Virtual eXtended LAN)
VETH (Virtual Ethernet)
IPIP (IP over IP)

Now that you know these items are logically a subset of the network namespaces subsystem, you can control aspects of them with the ethtool(8) program.

an example of this using iproute2 with enp4s0 instead of eth0:

; ethtool -k enp4s0
Features for enp4s0:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: off
        tx-scatter-gather: off
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]
;

other tools from iproute2 that can be used are:

bridge: shows/manipulates bridge addresses and devices
genl: information about registered generic netlink families
lnstat: linux network statistics
rtmon: rtnetlink sockets (NETLINK_ROUTE)
- Rtnetlink allows the kernel’s routing tables to be read and altered. It is used within the kernel to communicate between various subsystems and for communication with user-space programs. Network routes, IP addresses, link parameters, neighbor setups, queueing disciplines, traffic classes and packet classifiers may all be controlled through NETLINK_ROUTE sockets.
- The rtnetlink socket is aware of network namespaces; the network namespace object (struct net)

coming back to the OSI layers, the Data Link Layer 2 is where the network device drivers are found. The structure related to this is the net_device structure I described above, representing a network device. But what is a DEVICE?

Device drivers are “black boxes” which makes a piece of hardware respond to a well-defined internal programming interface. You can call it a FUNCTION which can make the HARDWARE respond to an API, this API being written in C. Users make calls that have to be independent of a specific driver, and the device driver maps these calls to device-specific operations that act on real hardware.

a different perspective is to treat the device driver as a software layer between applications and the actual device in hardware.

The three main classes of devices are:

block device
character device
network device

in the network device driver, any network transaction exchange data with other hosts, where the interface can be made of hardware or software. If it is software, is a virtual network interface, as I said at the start of the post.

Again, the interface (virtual or physical) is a type of network device. Examples of virtual network interfaces are:

loopback interface
ethernet interface (eth0 using net-tools, enp4s0 or similar using iproute2)

A network interface, virtual or physical, as a network device sends and receives data packets, driven by the network subsystem of the kernel, without knowing how individual transactions map to the actual packets being transmitted and even in cases of TCP that is stream-oriented, the network driver still doesn’t know about individual connections (or protocols) and just receives and transmits packets. The logic of a protocol is done before it sends the data to the driver.

Different from character devices such as /dev/tty1, the network interfaces (a network device) is not stream-oriented and although in UNIX the system assigns a unique name to them (eth0, enp4s0), it does not have an entry on the filesystem, again meaning that the communication is different from character and block devices where you use the read and write syscalls.

The syscalls used for network interfaces (a network device) are others. In Beej’s guide for network programming, we have the main concepts:

Everything under Linux is a ~file~ stream of bytes
any I/O is done by reading or writing to a file descriptor
a socket is a way to speak to other programs using standard Unix File Descriptors.
a file descriptor is an integer associated with an open file
the open file associated with a file descriptor can be a network connection, a FIFO, a pipe, a terminal, a file on disk, a file on RAM (check out memfd_create()), etc.

So, even with a NETWORK INTERFACE (A NETWORK DEVICE, WHICH CAN BE VIRTUAL OR PHYSICAL), to communicate with other program, first you talk to a file descriptor. And as I’ve said before, this logic happens before it sends the data to the driver.

The way this is done is by calling a syscall which is a method using the kernel API for the userspace be able to request resources from the kernel without changing its privilleges. Under the hood its a bunch of concepts related to concurrency btw.

But the syscalls for networking are faster than the syscalls specifically for read/write I/O data because of the way the intermediate representation of data structures related to network data are handled on the system. So there is a lot of optimization, but it comes down to:

make a call to the socket(2) system routine (a method/function that happens to be a syscall, manpage 2)
a file descriptor is returned
then use recv() and send() to communicate through the file descriptor

Note that with step 2 being a file descriptor, it could be passed using write or read, but recv and send offer a better control over the data transmission. Now, language specific libraries for networking wraps its functions around these syscalls with a logic that comes ready to use for the programmer. There are also various kinds of sockets, like:

DARPA/Internet/BSD sockets
UNIX sockets (path names on the local machine host)

loopback device, network namespaces again

Going back for Network Interfaces (and I will repeat again: a network device, which can be physical or virtual), there are types of device drivers which are modularized in the kernel such as filesystems.

I’ve talked about Network Namespace Local Devices having the loopback device before. It is represented as loopback_dev. Also, Every new network namespace is created with only one network device, the loopback device.

`;, [28/06/2023 21:45]`

General rules about network namespaces, related to the loopback device

Each time you create a network namespace, only the loopback device comes ready, without any sockets.
Also each network device (bridge, VLAN etc) created from a process running in this namespace (like a shell) belongs to this namespace, meaning the namespace is shared by theses devices.
Other from the loopback device, each network device (physical or virtual interfaces, bridges, etc.) can only be present in a single network namespace.
Physical hardware devices cannot be assigned to namespaces other than the root (root network namespace, that is).

`;, [28/06/2023 21:59]`

When you create a new network namespace, it has only the network loopback device. Some ways to create a network namespace are:

with ip netns command of iproute2 (you will shortly see an example).
using the unshare(1) utility of util-linux, with the –net flag.
an userspace app making a call to the clone(2) or unshare(2) system calls, with the flag CLONE_NEWNET for both

The loopback_dev object of a network namespace is assigned in the loopback_net_init() method, found in drivers/net/loopback.c, as as I said before, You cannot move the loopback device from one network namespace to another (it is part of “network namespace local devices” with the flag NETIF_F_NETNS_LOCAL set.).

btw, in the linux kernel network stack, an object of the Network Device strucutre (hehe, which an interface, physical or virtual, is part of >:) ) usually consists of function pointers. Two examples of these objects are:

an object of network device callbacks (net_device_ops object), function pointers for open/stop a device, start transmission, change MTU, etc.
an object of ethtool(8) callbacks, that gets information about the device as response for the CLI program ethtool

In Landley’s video, he says something like: “ext2 is a filesystem. This filesystem serves as a lens to see a block device. Then the loopback device is another one of these lens, which makes the file looks like a block device”.

By making files look like a block device, the loopback device creates a way for interfaces

why read/write syscalls aren’t used: This programming interface is needed in order for the drivers be developed separated from the kernel, and be plugged at runtime when needed.

References

Linux Device Drivers book. Get it on LWN or bootlin
Linux Kernel Networking: Implementation and Theory Book by Rami Rosen
rtnetlink manpage
The net_device struct, line 2093
Beej’s guide for network programming
Redhat introduction for Linux interfaces for virtual networking
biriukov’s GNU/Linux Shell related internals: File descriptor and open file description
Rob Landley’s presentation at the Linux Foundation YT channel: Tutorial: Building the Simplest Possible Linux System
davem’s How SKB’s work
Jake Edge article on LWN.net: Namespaces in operation, part 7: Network namespaces
Landley’s How mount actually works