Practical peering

25 March 2019 by Arceliar

How many peers do I need, and which ones?

Perhaps the most common questions we receive are about peering. If you’re not familiar with how Yggdrasil works, or even if you are but you haven’t tested things carefully, then it’s sometimes easy to do things which seem like they should work right, but lead to higher latency and lower bandwidth for you or nodes that depend on you in the network. This post is meant to explain what happens when the wrong peers are selected, and what you can do to avoid it.

The problem

When building a physical network, the cost of adding a link between two nodes, as well as the benefits that having that link would give, play a role in deciding which nodes are ultimately linked. That cost often correlates with the cost of using the link – long links are more expensive to create and have higher latency than short links of the same type, for example, and there’s no point in adding a link to another node unless it’s worth the cost. Yggdrasil is designed to work well on the kinds of networks we see in the real world, and makes implicit assumptions which benefit from the relationship between the higher cost to both create and use a long link.

However, when peering Yggdrasil nodes over the internet, the performance difference between two links can be dramatic, but the cost of creating them is always the same: there is no difference between adding a link over the internet to a node if it’s 1 km away or 1000. As a result, it’s easy to add links over the internet which would make no sense if deploying dedicated infrastructure, and can violate some of Yggdrasil’s assumptions as a result. This can lead to worse performance for not only the two linked nodes, but other nodes in their area.

Rules of thumb

In an effort to clarify how nodes should connect to public peers, and how public peers should connect to each other, I think it’s helpful if we establish some rules of thumb:

When deciding if to connect to another node, you should only connect to the ones that are “good enough” to be worth the effort. Here, “good enough” means that they have as much (approximately) at least as much bandwidth as your own. A fast node shouldn’t decide to connect to a slow node, instead the slow node should decide if it wants to connect to the fast one.
When connecting to nodes, start with the “closest” (lowest latency) nodes, subject to the above constraint, and work your way out. Try not to skip over (equal or better) nodes if there’s no reason to.

While this may not be the only way to fix the problem, following these rules of thumb should approximate the kinds of constraints that real networks need to deal with. Nodes tend to connect to whoever is closest, and better nodes tend to skip over worse ones to establish a long range “backbone” connection between remote points.

In addition, the number of peers you want to add depends on what you want to do. If you only want to connect to the network, then 1 (better connected) peer is technically enough, but this acts as a single point of failure. Two to four peers adds some redundancy, but keep in mind that you may end up routing traffic between these peers if that ends up being the best route they can find. If your goal is to set up a public peer that can route traffic for the network, and you have enough bandwidth to spare, then keep adding peers. Generally speaking, an asymmetric home internet connection shouldn’t try to route traffic. And, wherever possible, replace internet links with real connections over directional wifi or similar – to avoid having multiple peers share bandwidth over a shared link.

What happens when things go wrong

Let’s imagine we have some nodes in New York, and initially they follow the peering rules outlined above. Now suppose that two of these nodes decide that they want to add connections to London. In Yggdrasil, nodes tend to select parents that minimize latency to the root, which happens to be a node in Paris at the time I’m writing this. As a result, both of the NY nodes are likely to select their respective London peers as their parents. If the nodes are following the peering rules, then at least one of them has also decided to peer with the other, so they have a shortcut they can use to talk to each-other (or any descendants in the tree).

However, if they ignore the peering rules and don’t peer with each other, then they are likely to route through London instead of communicating over their local mesh network. A shorter path exists, through their local mesh network, but it’s not one that the network must know about for routing to work, so they won’t necessarily know about it. As a result, the latency between these two nodes (or decedents thereof) will likely be an order of magnitude more than it needs to be (and probably lower bandwidth as well).

Conclusion

Yggdrasil was designed with scalability in mind, and to that end, it makes some assumptions about how nodes in the network are connected to avoid communicating unnecessary information. Peering over the internet allows you to violate these assumptions. When this happens, it’s possible for network performance to suffer unintended consequences when adding new links. If you prioritize adding new links the same way as you would when building physical links, you can expect lower latency and, in many cases, higher bandwidth, compared to adding peers at random.