Yes, You Have Been Writing SPSC Queues Wrong Your Entire Life

There was a recent post on Hacker News by Juho Snellman where he claims “I have been writing single producer, single consumer queues wrong my entire life” and presented a solution for it.

The problem seems to be that SPSC queues as usually defined will never be able to fill all N slots because of index wrapping. Imaging a simple queue as defined below.

The write/push function can be defined as

The respective read/pop method is defined as

Notice that next_idx cannot be equal to read_idx, otherwise we will assume the queue is full. So the maximum value for next_idx is read_idx-1. This implies the last slot before the read pointer will always be unused.

Personally, I have never seen this as a problem. Many other implementations of queues waste one divisor node. But let’s assume that this is a big deal in some memory-constrained cases as in embedded.

By the way, I tend not to use modulus in implementations because in most platforms I use a sure integer division is more expensive than the rare branch miss. And PLEASE do not go ape on the const-ness of the proposed C++ code or in the syntax of assignment of the objects. I wrote this in the past half hour just to show how the indices would work, with no regard to usability.

Snelmann’s post offers a solution: instead of wrapping the indices, let them grow unbounded and cap only the array indices. The implementation is the following:

This is very elegant and simple because now you don’t need to be manually cycling the indices. As write_idx is always greater than read_idx, the entire modular arithmetic is thrown away.  The obvious drawback is that both indices can grow unbounded and cause all sort of problems.

One could go crazy and set the indices to uint64_t but that causes two problems: the first is that in most platforms an atomic read or write is guaranteed only for a certain size. On Intel this is 32 bits for the general case of unaligned access. Second, the basic motivation for this article is an embedded case where the memory space of one node is precious so this waste of resources should not be allowed. Snellman’s solution also forces the number of elements to be a power of two.

All these give me the shills just thinking of a production crash. Even if one proves me this is correct in some way, I would still think this is too close to the cliff to be comfortable.


However there is an easy fix for these problems. The basic idea is that we need to reset the two indices at some point in time, but when? There are only two code points we can possibly do this modification: push() or pop(). On a SPSC queue you can only modify write_idx in push() and, respectivelly, can only modify read_idx inside pop(). As you cannot do a simultaneous change on both indices at exactly the same time, we need to cycle one index then the other and in the meantime the difference remains the same.

It is obvious that we need to play some modular arithmetic magic here. First, we need to allow the result of the modular arithmetic to be equal to size. Therefore we cannot mod by size since that would yield zero. The obvious conclusion is to widen the divisor – let’s try with 2*size then. This means we will cycle both indices every 2*size steps.

However, if both indices can roam through the [0,2*size) interval freely, we will have the inconvenient case where write_idx is less than read_idx. That creates a nuisance because if we use unsigned indices such difference will underflow and that will sure give us a segfault for Christmas. So we need somehow make write_idx be always greater than read_idx. How? Just make the two ranges non-overlapping. Would that work? Fortunately, yes!

The simple trick is to employ two strategies:

  1. read_idx is allowed to occupy values in the range [0, 2*size)
  2. write_idx is allowed to occupy values in the range [2*size, 4*size)
  3. mod the resulting difference by 2*size

Using this method, the expression  for the number of elements in the queue

will always be in the range [0, 2*size). This allows us to fill up the queue to the last element because numel can be equal to the queue size.

Also because the ranges do not overlap and the write index range is superior to the read range, we can use unsigned indices: the difference will always be a positive integer or zero.  And because the ranges are contained, none of the two indices can ever overflow.

Let’s run mentally some cases for size=8:

So you can see that the modular difference between write_idx and read_idx with this arrangement will always reflect perfectly the correct number of elements in the queue.

Let’s have a look at the C++ implementation. The constructor for this class has to be slightly different from the previous queues.The only modification from the simple ring is the initialization of write_idx with 2*size:

The push() method for this class is not much different from the previous classes. The only difference is computing how the queue is not full (numel < size) and the wrapping of write_idx.

The read/pop() method is very similar

All these implementations are available on my github help repository.

POST EDIT: we added some contributions users have been doing in the comments. Use of std::atomic<T>, custom allocators and choice of index type. The changes are in the github but were not reflected in this text yet.

Leave a Reply