Yggdrasil

End-to-end encrypted IPv6 networking to connect worlds

Home
About
Admin API
Blog
Configuration
FAQ
Installation
Platform Notes
Changelog

GitHub
CircleCI
Latest Builds
Public Peers
Public Services

Acting out

1 September 2019 by Arceliar

Overture

We’ve recently rewritten much of Yggdrasil’s internals to change from Go’s native communicating sequential processes (goroutine+channel) style to using an asynchronous actor model approach to concurrency. While this change should be invisible to the average user, it dramatically changes what we developers need to think about when working on the code. I thought it would be useful to explain a little about the motivation for rewriting things this way, and what the consequences are.

Caution: theatre puns and references throughout, because Actors.

Exposition

Yggdrasil is written in the Go programming language. Go makes it easy to start a function running concurrently, and gives developers the tools they need to make concurrently executing functions communicate, but it’s not always easy to use them correctly. To be clear, the things I’m about to rant about are all fixable. Working around them is a normal thing to do in Go. More importantly, it’s a case where doing things the obvious way (which is sometimes even safe in isolation) leads to wrong behavior in a larger program. I prefer models where the obvious thing is still correct, and non-obvious things are only needed as a performance optimization.

Composition

There’s a common pattern that has emerged many times in the Yggdrasil code base. We’ll have a struct with some mutable fields that need reading or updating, such as information about a particular cryptographic session, or the switch’s table of idle peers and buffered traffic. Since shared mutable state is hard, and Go is all about “Share Memory By Communicating”, we’ll have packets get passed to a dedicated worker goroutine that “owns” that particular struct. The worker uses information from the packet and the owned struct to do whatever it is needs to do, updates these things accordingly, and passes the packet along to the next goroutine in the pipeline.

This often results in a “for select” pattern, where goroutines sit in an infinite for loop and select on several channels, to wait for packets to process or various types of signals from other goroutines. There are a few ways around it (with heavy use of reflect or chan interface{}, for example), but in most cases, every select statement needs to fully enumerate every behavior that the goroutine may need to engage in at that point in the code. If there’s a common set of cases that always need to be handled, and then a few exceptional cases that may or may not matter (possibly when the associated structs the workers are using are similar but not exactly the same types, or as the state of a struct’s fields change), then that typically involves multiple select statements with only the addition or modification of one or two cases.

Go embraces composition in its type system, but select statements (and channel operations in general) make execution resistant to composition.

Deadlocks

The “for select” pattern is safe, as far as I know, if the flow of messages through the program form a directed acyclic graph. However, in our case, cycles emerge if we try to handle things in the obvious way. For example, a cryptographic session needs to somehow get outbound encrypted traffic to the switch, but incoming encrypted traffic also needs to make it from the switch to the sessions for decryption (via the router, which is responsible for, among other things, identify which session is associated with the traffic).

When cycles of goroutines naively pass messages over channels, deadlocks are all but inevitable. There are a few ways to address this, but they’re not always appropriate. Ideally, we would change the design to remove cycles, but this is not always possible, and may require significant changes to the workflow in cases where it is possible. In practice, what we’d actually do is either buffer messages (having some dedicated reader goroutine to take the message, add it to a slice, and then pass it to the real destination ASAP) or drop messages entirely (with a select statement that aborts and does cleanup in a default case, or by having a dedicated reader that drops messages more intelligently, such as from the front of the queue, under the assumption that older messages are less useful).

Leaks

Typically, when a goroutine is started, it continues to run until either the function returns or the program exits. For this reason, if a goroutine executes any statements which can block (such as a channel operation), it’s important to include some case which signals that it’s time to return. Forgetting to do this can result in goroutine leaks. Never start a goroutine without knowing how it will stop, or so the experts say.

This is sometimes harder than it needs to be. To be blunt, the single producer N consumer cases are fine, you just close the channel and have all the consumers take this as a signal to exit. Anything involving multiple producers requires some sort of signaling to indicate that all producers have exited. Since you’re using a channel already, the obvious option is a select statement with another channel that closes to signal shutdown, and then something like e.g. a sync.WaitGroup to wait for all producers to exit before closing the channel. Until your number of producers needs to change at runtime, and you realize that this races if you start to Wait before Adding everything to the group, so you need to implement a custom counter, and be careful that additions and subtractions can also race and cause it to shut down early. And have fun solving it, because with how much select resists composition and code reuse, you’re going to be implementing the same patterns over, and over, and over, and over…

It’s not that this is some impossible problem to solve, it’s just that Go’s take on the CSP, combined with the rest of the tools the language gives you, makes it easy and concise to run thing the wrong way, and leads to comparatively complex and delicate code when trying to run it the right way. At least, that’s my personal view of it based on my experience so far, but it probably varies some based on the problem the code is trying to solve.

Rising action

The actor model is another programming paradigm that embraces concurrency with a “share memory by communicating” philosophy.

For our purposes, an actor is basically a data type with a few special properties:

  1. It has an inbox where messages to the actor are placed.
  2. It has an associated unit of execution, such as a thread, which processes messages from the inbox one at a time.
  3. Rather than exposing ordinary functions for other code to call, the actor exposes behaviors. A behavior is a function which has no return value, and is executed only for its side effects. When an actor A calls a behavior of an actor B, what really happens is that A places a message in B’s inbox, and B processes that message by executing some code.

Different implementations differ on details after that, such as what order messages are processed in, if actors are allowed to wait for a particular type of message before continuing, whether actors run locally or are distributed across a cluster, etc., but they tend to all include some version of the broad strokes above.

Turing point

I’m particularly fond of the pony programming language’s take on the actor model. I really can’t being to say enough nice things about their approach, and fully describing it is beyond the scope of this blog post, but if you come out of here with an interest in the actor model, then I highly recommend checking out that language. Maybe watch a few of the talks from the developers that have been posted to youtube, or read their papers about what is easily the most promising approach to garbage collection I’ve ever come across.

Anyway, I don’t actually work on anything written in pony, but I like their version of the actor model so much that I decided to see if I could trick Go’s runtime into faking it. The result is phony, which manages to do most of what I want in under 70 lines of code. When we write code using this asynchronous message passing style, instead of ordinary goroutines+channels, the implications are pretty significant:

  1. There are no deadlocks. Message sends always succeed, and are quite fast (it doesn’t even require CAS instructions in the normal case).
  2. Inbox sizes stay small due to backpressure: if the sender sees that the receiver’s inbox has too many pending messages, it will schedule itself to stop at some deadlock-free safe point in the future, to wait until the receiver signals that it’s handled the message.
  3. Actors are shockingly lightweight: on a modern 64-bit processor, an idle Actor’s only resources are 24 bytes for an empty Inbox, some of which is padding that may not apply if embedded into a struct. In particular, an idle Actor with an empty Inbox has no associated goroutine, so it requires no stack.
  4. The lack of a goroutine also means that idle Actors, even cycles of Actors, can be garbage collected automatically.
  5. Any struct that embeds an Inbox satisfies the Actor interface. Since Actors encapsulate their own unit of execution, it means the range of behaviors that unit of execution can engage in are encoded into the type system and can even be abstracted through interface types. In my opinion, the resulting code is cleaner, easier to read and understand, and far easier to reuse or extend than the for select pattern from goroutine+channel use.

Falling action

I’m happy enough with the current state of phony that I decided to start migrating the yggdrasil-go code base to use it. This is still work in progress (there are some non-Actor goroutines around the edges of the code, mostly in main Accept loops and that sort of thing), but the hot paths are now Actor based.

Most of this was done in a weekend and came together with surprisingly little pain. I had exactly 2 crashes the entire time (1 accidental nil pointer deference and 1 legitimate bug I needed to fix in phony), and more importantly, 0 deadlocks. Most things just worked as intended the first time they compiled. There were a few bugs to work out when I was rewriting the link code, but nothing compared to the mess I had to deal with when writing the old code (which was a couple of horrifying interdependent for select loops to build a state machine).

So by now you’re probably wondering what any of this looks like in practice. Just to give a generic example, suppose we have some struct with an exported function that needs to run code on a worker goroutine. We could end up with something like the following when writing Go in the CSP style:


// This is the function we want the worker to run.
func (n *NonActorStruct) theFunction(arg1 Type1, arg2 Type2) {
    // this is where the code we actually care about goes, the rest is basically boilerplate
}

// This is the struct that we want the worker to own and manipulate.
type NonActorStruct struct {
    inputForTheFunction chan argsForTheFunction
    // fields we care about, plus maybe more channels for other things
}

// Needed to initialize the channel to a working state
func NewNonActorStruct() *NonActorStruct {
    n := NonActorStruct{
        inputForTheFunction: make(chan argsForTheFunction),
    }
    return &n
}

// This is just a helper struct to carry arguments for the function.
type argsForTheFunction struct {
    Arg1 Type1
    Arg2 Type2
}

// This is the function we export.
func (n *NonActorStruct) RunTheFunction(arg1 Type1, arg2 Type2) {
    n.inputForTheFunction<-argsForTheFunction{arg1, arg2}
}

// This is needed to start the worker, otherwise things block.
func (n *NonActorStruct) Start() {
    go func() {
        for {
            select{
            // cases for other things we may need to do would also be here
            // presumably at least one is involved in safely shutting down
            case args := <-n.inputForTheFunction:
                // We could possibly have a switch statement here
                // Then switch on the arg type to pick which function to run
                n.theFunction(args.Arg1, args.Arg2)
            }
        }
    }()
}

// This is needed to stop the worker when we're done.
func (n *NonActorStruct) Stop() {
    // Actual implemenation depends on what else the worker does in its loop,
    // but it probably just sends a specific message and/or closes some channel.
}

// Then to use the code, we have something like:
myStruct := NewNonActorStruct()
myStruct.Start()
defer myStruct.Stop() // Or arrange this to happen somewhere else
myStruct.RunTheFunction(arg1, arg2)

When migrating to the actor model, the basic pattern that emerged was to embed a phony.Inbox into any struct we wanted to make into a phony.Actor, and then define functions of the struct like so:


// This is the function we want the worker to run.
func (a *ActorStruct) theFunction(arg1 Type1, arg2 Type2) {
    // this is where the code we actually care about goes, the rest is basically boilerplate
}

// This is the struct that we want the worker to own and manipulate.
type ActorStruct struct {
    phony.Inbox // This defines the Act function, satisfying the Actor interface
    // fields we care about
}

// This is the function we export.
func (a *ActorStruct) RunTheFunction(from phony.Actor, arg1 Type1, arg2 Type2) {
    a.Act(from, func() {
        a.theFunction(arg1, arg2)
    })
}

// And then to use it, an Actor x would run something like:
myActor := new(ActorStruct)
myActor.RunTheFunction(x, arg1, arg2)

And that’s about it. The first argument to myActor.RunTheFunction also nilable, if we have non-Actor code that needs to send a message, it just means there’s no backpressure to slow down the non-Actor code if it’s sending messages faster than the Actor can handle them. A phony.Block function exists to help non-Actors wait for an Actor to process a message before continuing, since this seems like a common enough use case (especially when a package wants to export a non-Actor interface that uses Actor code internally).

What’s great is that we don’t need to think about starting or stopping workers, deadlocks and leaks are not possible outside of blocking operations (e.g. I/O), and we can add or reuse behaviors just as easily as any function. I find the code easier to read and reason about too.

I/O is one rough spot, since an Actor can block on a Read or a Write and not process incoming messages as a result. This isn’t really any worse than working with normal Go code, and the pattern we’ve adopted is to have separate Actors for Read and Write, where one mostly just sits in a Read loop and sends the results (and/or error) somewhere whenever a Read finishes. These two workers can be children of some parent Actor, which is the only one the rest of the code needs to know about, and then all we need to remember to do is close the ReadWriteCloser (e.g. socket) at some point when we’re done. This is the sort of thing that we’ll eventually want to write a standard struct for, update our code everywhere to use it, and then never have to think about it again. In the mean time, we have a couple of very similar implementations for working with sockets or the tun/tap device.

Dénouement

The Go language makes concurrency easy, but for some problems it can be difficult to do safely out-of-the-box. However, the language provides the tools needed to implement an actor model approach very easily. While I won’t claim that the actor model is a panacea for all development woes, Yggdrasil by its very nature requires us to think about networks of nodes communicating asynchronously, so it makes sense to use a programming paradigm that lets us model that approach more explicitly in our code base. Outside of a couple of corner cases (namely blocking I/O for the network sockets and the tun/tap device), we expect this to obviate any need to even thing about deadlocks, make development easier moving forward, and generally lead to a better user experience as a result. The code migration is still a work in progress, but Actors have replace for select workers along the hot paths through the code (minus 1 crypto worker pool in the session code) and will slowly replace synchronization primitives in the remaining code base. The current code has been merged into our develop branch, and I’m quite excited to see it land in Yggdrasil v0.3.9, along with the usual bug fixes and incremental improvements, which we plan to release in the near future.