On a few occasions I’ve been asked about how Yggdrasil was started, or what motivated certain things about the design. I’ve talked about the motivation and technical details in other blog posts, but I haven’t talked about the history before, so I thought it’s about time.
The first time I can recall hearing about mesh networks, as a concept, was some time in late 2010 or early 2011, when B.A.T.M.A.N. reached the mainline Linux kernel.
I liked the idea, but since I obsess over how things scale, I was worried about the network’s ability to cope with an internet-like number of users.
In B.A.T.M.A.N., as in most other protocols, nodes must either rely on some externally configured (and coordinated) subnetting, or else every node in a network must know about every other node in the network.
In particular, each node periodically sends a broadcast packet through the network, which allows the rest of the network to find a path back to the originating node.
Other approaches, such as AODV, only search for routes when they’re needed, but the same ~O(n)
cost applies for each node in a network with n
nodes.
At a certain point, particularly in a shared medium wireless network, the cost of protocol traffic can become larger than the resources available to the network, and so the network no longer has room to route any traffic for the user.
I came across cjdns in the summer of 2012. The thing about cjdns that caught my attention was how it used a Distributed Hash Table to allow each node to look up a path to any other node, instead of relying on broadcast traffic. The idea being, if you can use a DHT instead of broadcast traffic, then you can just throw the whole network into one large subnet, with “flat” identifiers (IP addresses) that have nothing to do with the position of a node in the network. Then, since you still need some way to assign addresses, you can derive them from a hash of a node’s public encryption key. That simultaneously addresses the protocol overhead issue, address assignment, and lets you do end-to-end encryption without depending on public key infrastructure.
What could go wrong? Well, the best short example I can give, is to imagine that Alice wants to deliver a package to Carol, and they live in a world without maps or addresses, and where you can’t rely on directions like “go North by any route until you reach X”, so everyone needs to memorize any roads or routes that they care about. Alice doesn’t know where Carol lives, but she knows where Bob lives, and she has reason to believe that Bob knows where Carol lives. So, Alice visits Bob and asks for directions to Carol. Bob tells Alice how to get from Bob’s house to Carol’s house, and Alice memorizes this. Now, any time Alice wants to deliver a package to Carol, she travels form her house to Bob’s house, and then from Bob’s house to Carol’s house. If anyone asks Alice for a path to Carol, she will give them the path from herself to Carol, including the unnecessary detour past Bob. If someone knows enough about the layout of the streets to recognize the detours, or otherwise know that there’s a shorter path between two points somewhere on the route, then they could improve upon this path, but in general this doesn’t happen, because nobody knows enough about the layout of things to see the big picture of where everything is.
That’s basically how cjdns routing worked before supernodes were introduced. Supernodes keep a (centralized) view of the full network, and then other nodes can ask a supernode (instead of doing DHT lookups) for a path. Ignoring any technical complaints I may have about that approach, it sidesteps the problem I’m interested in solving, so I stopped actively contributing to cjdns once the decision was made to go that route, and started looking for other ways to solve the routing problems cjdns had faced.
By around the middle of 2015, I had thrown together a basic skeleton of a network simulator in python, so I could compare the paths that different routing schemes find to the shortest paths through the same networks. Having studied up on the latest and greatest academic works at the time, I had initially been thinking that something resembling Thorup and Zwick’s universal compact routing scheme made the most sense, but I had issues finding a way to implement that securely as a distributed algorithm running on a dynamic network.
To make a long story short, I ultimately took the most inspiration from Robert Kleinberg’s approach, which is to use a greedy embedding. Here’s the thing, the Kleinberg approach grows a spanning tree of a (static) network, and embeds the tree in the hyperbolic plane, then proves that this embedding is always greedy (meaning, if you just forward to the point in the metric space closest to the destination, you’ll never hit a dead end). The only real difference is that Yggdrasil doesn’t bother to embed the tree in the hyperbolic plane. Instead, each node remembers the path from the root to itself, and we use these paths to calculate distance apart on the tree. This saves us the trouble of embedding, and we’d need to know the per-hop tree information anyway to securely build the tree, so this saves us some complexity.
Using a DHT, we can look up who we want to talk to (specified by an IPv6 “address”, which is a flat identifier / hash of a key, as in cjdns), we can learn where they are on the spanning tree. Then, when a node needs to forward a packet, it checks the tree location of each of its peers and forwards to whichever one is closest to the destination (+- a few caveats about congestion control). This is explained in more detail in earlier blog posts, if you’re not familiar with how Yggdrasil routes and care to read more.
In our package delivery example, imagine if the streets in Alice’s town were laid out in a grid, and then named and numbered systematically by blocks, with street signs to label where any off-grid bypasses go. Alice and friends still haven’t bought maps, but they know each other’s addresses instead. So, if Alice wants to contact Carol, she first travels to Bob’s house and asks him for Carol’s address. Now, when she wants to deliver a package to Carol, she can simply follow the block structure of the town until she arrives on Carol’s block, and she has the option to take any bypass she happens to come across if it brings her closer to Carol’s place. That’s basically how routing on the tree, or taking an off-tree shortcut, work in Yggdrasil’s greedy routing scheme, except with a tree instead of a grid (which, in addition to working everywhere, seems to work well in the places we care about).
I had most of the important parts of this working, in simulations, by mid September of 2015. Initially, I also included off-tree distance-vector like routes to nodes where the on-tree path would be too long, but I abandoned this once I saw that it added relatively little (except protocol overhead) for the kinds of networks that tend to show up in practice, including some internet topology maps from CAIDA and DIMES. In particular, it seems to work well any time the network diameter is small and the number of triangles in the network is large, since the former limits the worst case scenario paths that the network can use, and the latter adds many opportunities for off-tree shortcuts.
Having (mostly) finished simulation tests by about spring of 2016, I sat on the idea for a while, trying to work up the motivation to do anything with it. I eventually sat down one weekend and worked through gobyexample. The language seemed fast enough for a reasonable prototype, easy enough to learn/read that other people could pick it up quickly if they want to contribute, and generally made multithreading/multiprocessing bearable for me. Since I wanted to continue playing with the language, and I’d been meaning to implement my routing scheme for a while, I ultimately resolved to rewrite my sim in Go, refactor the important parts into the library, and then add the missing pieces to make it more-or-less a cjdns clone with different routing. Most of the work happened over a couple of long weekends, and I released the first working prototype on GitHub just before the end of 2017.
Changes since then are mostly documented in the git log
, GitHub issues and pull requests, and discussions in our public matrix channel.
Neil joined and started adding support for other platforms, and we started to roll out public nodes and attract more users.
As of writing, a year or so after the first public release, there are around 130-140 nodes in the network, depending on the time of day, with maybe half of them having joined in the last few months.