How µWebSockets achieves efficient pub/sub
A technical explanation for those interested in the gory details
Imagine being a paperboy, tasked with delivering letters to a set of houses by car. Would you drive around all houses, delivering one letter at every stop, going round all houses many times until you have no more letters to deliver? Or would you sort the letters by house, then drive one round delivering all letters in one go, one bundle of letters per house?
A shockingly huge majority of web servers will drive that car all day long, stopping for one single letter a time. Socket.IO, one of the most popular pub/sub implementations used is such a case. Other popular implementations performing the same madness ritual include SocketCluster, Primus, Mosca, ClusterWS and pretty much the entire remaining Node.js ecosystem.
So how do the above metaphors map to technical counterparts? Going around all houses would be a for-loop over all sockets in a topic. Stopping to deliver one letter would be calling .send on a socket with the contents of the letter as message. This is the naive, “bubble sort” solution. A simple for-loop with .send calls, called with every publish.
If we only have one single letter per house, this naive solution works just fine. In this case we will, for every socket, generate one SSL record, do one syscall, emit one packet. This is about the work necessary to perform.
The problem is, you rarely have one single letter to deliver. Even with a PhD in finger-painting you should realize that this algorithm would perform exponentially poor with increasing number of messages to deliver.
For ten messages per socket, you would loop over all sockets ten times, generating ten SSL records, performing ten syscalls and emitting ten packets, for every single socket. This is a major bottleneck. Here we would like to have the same amount of SSL records, syscalls and packets as if we were sending one message per socket, yes?
But I hear you say, what about latency? If we want to pack things together then surely latency would suffer! Well, no, it’s the opposite actually. To understand why, we need to understand how the kernel and the web server interact.
- The web server is a user space process. It runs in its own, scheduled, time dimension. This is not necessarily real-time.
- The kernel does not run in the same time dimension; it runs more or less in real-time answering interrupts as they happen.
- Networking is handled by the kernel, based on interrupts from the NIC. Incoming data is buffered until the user space process fetches it.
- The user space web server fetches events and subsequently data in batches; the longer between fetches the bigger the batches.
- The more “laggy” the user space web server becomes, the bigger the following batches of events (real-time doesn’t wait for anyone).
- The user space web server works in event loop iterations; one iteration per batch of events from the kernel.
Alright, what does this all mean? It means that one event loop iteration will contain batches of events, not one single event per event loop iteration. So when the user space web server is ready to fetch events from the kernel via epoll_wait syscall, it will fetch potentially thousands of events. Now, let’s say ten of those events result in the publishing of new messages. So now we have ten letters per house to deliver by car.
Do we start driving around all houses delivering one letter per house, or do we sort them and deliver them as batches? Latency is the same; we just fetched the events. But because we can finish earlier with a more efficient algorithm, latency will be lower for the next event loop iteration since we can fetch those events earlier. So latency is significantly better, not worse.
Okay, sounds sane? So how does this technically work then?
µWebSockets implements pub/sub in about 500 lines of code. This is substantially more than the 10-or-so lines of code your typical naive solution would, but not astronomically complex.
µWebSockets keeps a tree of topics, each topic holds a red-black tree of sockets sorted by their memory address. This means subscription and unsubscription is O(log n) while we keep sockets sorted by their “id” in each topic. This allows us to efficiently calculate intersections between multiple “triggered” topics holding the same socket so that we can “merge” outgoing messages across topics, per socket.
A “triggered” topic is a topic that has been published-to in this event loop iteration yet still awaiting more publishes before it flushes. We want to keep topics triggered during the whole event-loop iteration so that we can efficiently collect and merge publishes and send them off at the very end of the event loop iteration. Sometimes we cannot do this, such as when unsubscribing from a triggered topic. Here we need to flush to keep things properly synced, but can immediately restart collecting more publishes.
What all this means in a nutshell is that we use a data structure I call “TopicTree” that allows us to perform many publishes, across many different topics during the event loop iteration, then distill that information down to a single for-loop over all sockets where we know exactly all the messages one socket should send, given to us as a batch.
So it allows us to make one single car trip around all houses, delivering perfectly sorted and packed bundles of letters. “TopicTree” works like a letter sorter and bundler and we simply loop over it to get the bundle. This is how we generate one single SSL record, do one syscall and emit one single packet per socket. Even when publishing cross topics.
This is (partly) why we can achieve such a massive lead in performance compared to, say, Socket.IO as demonstrated by benchmarks: https://github.com/uNetworking/pubsub-benchmark
So what about the case where we only have one single letter per house, again? That’s fine, “TopicTree” performs in O(n) in this case — the same as your typical for-loop. We call that a win-win.
If my explanation was too messy you can always go through the code at https://github.com/uNetworking/uWebSockets