About huge MTUs in Yggdrasil

13 July 2018 by Neil Alexander

That’s a very big MTU!

You might have noticed that Yggdrasil doesn’t conform to the standard MTUs you might have seen elsewhere. Most Ethernet networks sit somewhere around the 1500 byte mark. Less if VPN tunnels are involved due to the overheads of those VPN protocols. So you might have been surprised to see an interface MTU as high as 65535 bytes in your yggdrasil.conf!

In fact, this is not a mistake. It is very deliberate.

The MTU is a configurable option which determines how big a packet should be before you should break out into a new packet. In addition to that, the operating system maintains a link MTU setting for each network adapter - effectively a value that says “packets equal to or smaller than this number of bytes are safe to send on this link in one piece”. Any larger than that and the operating system will have to fragment the packet down into smaller ones before sending out onto the wire.

With a smaller MTU, you will be forced to re-send the IP headers (and possibly other headers) far more often with every single packet. Those are not only wasted bytes on the wire, but every packet likely requires a new set of system calls to handle. In the case of Yggdrasil, we rely on system calls not just for socket operations but also for TUN/TAP. Each system call requires a context switch, which is a slow operation. On embedded platforms, this can be a real killer for performance - in fact, on an EdgeRouter X, context switching for the TUN/TAP adapter is a greater bottleneck than the cryptographic algorithms themselves!

Selecting TCP instead of UDP

Therefore it is reasonable to surmise that less context switches and less packet headers are better. Using an MTU of 65535 over 1500 packs in almost 43 times more data before the next set of headers. That’s almost 1680 bytes saved from IP headers alone over 65535 bytes! But sending a 65535 byte Yggdrasil packet over a link with an MTU of 1500 would require the packet to be fragmented 43 times, right?

Instead, Yggdrasil uses TCP connections for peerings. This not only allows us to take advantage of SOCKS proxies (and Tor) in a way that we cannot with UDP, but it also gives us stream properties on our connections instead of being manually forced to chunk UDP packets. In fact, we did this in the past, and it was ugly, and actually worse than TCP performance-wise in many cases.

TCP will adjust the window size to match the lower link - in this case a probable 1500 MTU - and will stream the large packet in chunks until it arrives at the remote side in a single piece. For this reason, Yggdrasil never has to care about the MTU of the real link between you and your peers. It just receives a packet up to 65535 bytes, which it then processes and hands off to the TUN/TAP adapter.

But… TCP over TCP?

But that means that you are tunnelling TCP over TCP, I hear you cry. That’s crazy talk, surely? The big problem with TCP-over-TCP is that in the event of congestion or packet loss, TCP will attempt to retransmit the failed packets, but if TCP control packets from the inner connection are retransmitted, reordered etc. by the encapsulating TCP connection, this results in a substantial performance drop whilst the operating system throttles down that inner TCP connection to cope with what it believes to be congestion.

However, by using large MTUs, and therefore larger window sizes on the inner TCP connection, we send far less TCP control messages over the wire - possibly up to as many as 43 times less - therefore there are far less opportunities for control messages on the inner connection to be affected or amplified by retransmission. This helps to stabilise performance on the inner TCP connections.

There are also some other clever things taking place at the Yggdrasil TCP layer. In particular, LIFO queues are used for session traffic, which results in reordering at almost the first sign of any congestion on the encapsulating link. TCP very eagerly backs off in this instance, which helps to deal with real congestion on the link in a sane manner.

Agreeing on an MTU

A big problem we came across was that different operating systems have different maximum MTUs. Linux, macOS and Windows all allow maximum MTUs of 65535 bytes, but FreeBSD only allows half that at 32767 bytes, OpenBSD lower again at 16384 bytes and NetBSD only allowing up to a paltry 9000 bytes! Not to mention that there might also be good reasons for adjusting the MTU by hand, hence the IfMTU option in yggdrasil.conf.

We need a way for any two Yggdrasil nodes to agree on a suitable MTU that works for both parties. To get around this, each node sends its own MTU to the remote side when establishing an encrypted traffic session. Knowing both your own MTU and the MTU of the remote side allows you to select the lower of the two MTUs, as the lower of the two MTUs is guaranteed to be safe for both parties. In fact, you can even see what the agreed MTU of a session is by using yggdrasilctl getSessions or by sending a getSessions request to the admin socket.

Using ICMPv6 to signal userspace

So what happens if you establish a session with someone who has a smaller MTU configured than yourself? The session negotiation and MTU exchange will allow Yggdrasil to select a lower MTU for the session, but your TUN/TAP adapter on your own side still has the larger of the two MTUs. How do we tell userspace about the new MTU?

IPv6 implements a Packet Too Big packet type which is designed to facilitate Path MTU Discovery on standard IPv6 networks. If you send a large packet, a node en-route (like a router) can send back an ICMPv6 Packet Too Big message to the sender, signifying that the packet that you sent exceeds the largest supported packet size (effectively an upstream link MTU). The sending node can then fragment the packet and resend it in smaller chunks so that it matches that lower MTU on the path. Yggdrasil takes advantage of this behaviour by generating Packet Too Big control messages and sending them back to the TUN/TAP adapter when it detects that you have tried to send a packet across an Yggdrasil session that is larger than the agreed MTU.

This causes the operating system to fragment the packets down to the MTU size supported by the session and resend them. In addition, the operating system caches this newly discovered MTU for the future so that it fragments correctly for that destination from that point forward. In effect, Yggdrasil “tricks” the operating system into thinking that Path MTU Discovery is really taking place on the link.

Conclusion

You will see that Yggdrasil performs a number of tricks to allow flexible MTUs between nodes, and takes advantage of the stream nature of TCP to provide the greatest level of performance and flexibility. There are a number of moving parts to this - the session negotiation, ICMPv6 message generation and the choice of TCP over UDP for peering connections. However, I hope this demystifies some of our choices and why we have encouraged the use of larger MTUs on our traffic sessions.