Hybrid Containers for C++

Scheduling has become a subject of big interest in recent years due to the work of Linux developers to make it more responsive in the desktop. CFS was a major breakthrough with 2.6.23 and other patches followed adjusting the kernel behavior for other latency sensitive applications.

In many trading applications, the scheduler performs a central role. In live trading, the scheduler follows wallclock time and is part of the main event loop but it is subordinated to the subsystem that actually handles events from every possible input and output of the application, many times called the reactor.

On a historical backtesting though, the scheduler is the one that actually drives the process. It might skip time between known events that are often written in a disk file or database (usual event-driven backtesting) or follow a pseudo-time that starts at a given instant in the past and proceeds with a given speed, or time multiplier, through time as if it was occurring in the past (accelerated replay).

In the first situation, the scheduler has to respond with minimal latency, ie it has to handle events as fast as possible, individually. In the second situation, the whole application run has to be minimized. These are two separate requirements.

The Problem

In a very condensed form, the schedule problem can be seen as developing a class with the following interface:

The scheduler takes a subclass of Event and stores it in an appropriate data structure along with the respective time tm and some sort of event-specific data, here simplified by passing an opaque, user-owned pointer user. It is possible that the same object be scheduled multiple times in the future.

At specific intervals the scheduler’s check() is called being passed the current time. Every event that has been scheduled to a time less than or equal to the time passed as argument needs to be executed by calling its virtual callback method.

Trivial Implementations

In a very trivial implementation someone could use STL’s std::priority_queue or perhaps std::multimap or even better boost::flat_multimap. All these suffer a big problem though – they are all at least O(logN) and that becomes fast a bottleneck when dealing with millions of events being scheduled per second.

When dealing with financial data in equities for example, very frequently there is a problem where for each of the 8,000 US tickers we need to schedule one event every millisecond for a given combination of say, strategy and alpha. Very easily there are over 1,000 events per symbol which totals 8,000,000 events being handled at any given time by the scheduler.

Of course this problem can be solved in a different way with multi-layer, hierarchical data structures where there is a global scheduler and local schedulers. However this creates performance bottlenecks by the increased indirection which slows down considerably the combined throughput of the system. It is important to note at this point that the biggest bottlenecks occur not in live trading system because a reasonably written system is able to handle the entire US equities market in a single core these days. The biggest problem happens when running optimizations in historical/backtest mode where multiple simulation runs with several days are piled up on top of each other to be executed with different parameters each time. One perfect example of this is machine learning that is very computing intensive. In our firm we developed a O(1) constant time scheduler that overcomes the O(logN) bottleneck but dwelling over the intrinsics of this development is beyond the scope of this post.

(The intent of this post is to throw another friendly incendiary bomb at the C++ committee for not including intrusive data structures in the STL since its inception in the 90s)

Option 1

The first data structure that is complies with the STL requirements, the most strict being that one object needs to be able to be migrated across different containers without limits, is an external node priority list as shown in the picture below.

Of course this is a very simplified implementation since simple linked lists will have a hard time inserting nodes in the middle. However this simplified view allows us to make the point and it can be extended to most, if not all types of implementations.

In this case, the scheduler allocates an external node, hopefully with a memory pool, that points to the actual event to be fired. As nodes can point back to the same object, the implementation is actually complete, allowing events to be scheduled multiple times.

Assuming the implementation gets better, the actual data structure navigation gets down to the single digit nanoseconds and another problem creeps up: doubling the number of objects in memory conspires to increase the number of memory cache hits.

Option 1 – Baseline Results

To exemplify, we ran a simulation with one of the very first incarnations of our in-house scheduler that used external nodes. Here we keep a given number of objects at all times inside the scheduler and as one event is fired, we immediately replace it with another event, keeping the number of events inside the scheduler constant. This is the first column.

The time distribution for each event follows a uniform distribution over 1 second. We then repeat the process for 10 minutes (simulation time), which corresponds to 600 callbacks for every single object in the scheduler.

This means that in the case where we keep 1,000,000 objects in the scheduler, the entire process will throw 600 x 1,000,000 = 600,000,000 distinct events. As the simulation took 124s to complete, there were around 4.8m events per second, which is not bad but it is far from optimal.

The second column shows the number of cache misses for the entire run. If this number is divided by the total number of events, one can obtain a very informative metric of number of cache misses per event (5th column). It can be seen that as the number of objects in memory increases, the number of cache misses also increases.

It is very interesting to see that once the number of objects reached 100,000 the number of cache misses started to increase almost linearly with the logarithm of the number of objects in memory.

More interesting yet is to see that the total execution time per event is perfectly dependent on the number of cache misses. It can be seen that every event takes 60 nanoseconds plus 54 nanoseconds per every cache miss. Taking in account that the frequency on this machine is 4.2 GHz, every cache miss incurs in a penalty of 226 cycles.

Notice that one “event” is in fact a pair of schedule + callback. Therefore the are around 113 cycles per actual call, which matches very closely to Intel’s optimization manual estimate for the number of cycles to retrieve a 64-bit integer from external memory (~110 cycles). That gives us confidence that we understand the behavior of our algorithm quite accurately.

It can be clearly seen that the algorithm’s bottleneck is memory cache.

Option 2 Intrusive List

A better option to handle the to include the node inside the Event object, in what is called as an intrusive list, as depicted in the figure below.

The obvious problem of this solution is that, as the node is included inside the Event object, there is only room for a single node per event. Therefore, each Event object can only be scheduled once. This is very efficient but very limiting. When designing such scheduler interface, you do not want to be bothered yourself (or your users) to guarantee that every object can only be scheduled once. You could return a flag failing when the object has already been scheduled but what would the application do in that case?

Obviously the idea is to find a compromise. Ideally, when a single object is used per event, we want the performance gain due to the minimized number of cache hits of option 2. But we also want the flexibility to schedule objects more than once.

Option 3 Hybrid

A hybrid approach then comes to mind. A node structure is embedded in the base class of Event. The node points to another nodes. However when schedule() detects that the embedded node has already been used, it allocates a new standalone node from a local memory pool.

With this approach, we minimize memory fragmentation due to external node creation while keeping the flexibility to schedule the same event multiple times. Our current in-house scheduler uses this approach.

The hybrid approach has also the added benefit that events can be scheduled in two separate schedulers/containers, ie the event does not necessarily belong to the container as in the intrusive case.

Option 3 Results

With this approach, the behavior of the algorithm is much better. For one million objects in memory it produces 2.2 cache misses per event while for option 1 the number was 2.9 cache misses per event. This might not mean much but it makes a huge difference: option 1 total execution time for a full run with 1m objects in memory is 124 seconds while option 3 takes a mere 22 seconds.

As for performance as a function of cache misses, our new algorithm takes a fixed 73 nanoseconds (roundtrip schedule+execute) per event plus an additional 40 nanoseconds per cache miss, which was reduced from 60 nanoseconds with option 1.

The fixed time per event is higher due to the increased complexity of the node handling, which takes a toll for small simulations. However, this is largely compensated by the lower slope.

For small sets, say with only 100,000 objects, which might even be considered an enormous amount for the most common applications, we can run over 16 million roundtrip events per second.   For a bigger size though we start to hit cache misses but we still run a very cool 6.1 million roundtrip events (schedule+fire) per second, per core, working on a container with never less than 1 million objects randomly distributed over time. Combined through 8 cpus this hopefully scales up to 49 million events per second (but this needs to be tested because of cache interactions).  Even with 10m objects, occupying almost 1GB of memory, we can process over 3.7m roundtrip events per second, which is excellent scalability.

As we favor the results for very large simulations, this is our currently adopted approach in production.

Conclusion

In this short experiment, intended for students and professionals that are still gaining experience in the HPC field, we tried to give a small glimpse into the methodology we use at our firm to optimize an algorithm with an application requirement at hand.

The main takeaway is that it is essential for very large applications to adopt data structures that maximize performance of memory cache and it is very important to write and test every possible alternative to understand this behavior as the impact of cache often flies in view of the traditional academic Big-O studies.

Writing high performance algorithms is nowadays not much of a task of knowing data structures from a textbook. It is instead more of a task of laboriously generating every possible implementation and testing each and every one of them ad nauseam.

The results very often surprise the most seasoned of the technologists.

Leave a Reply

Your email address will not be published. Required fields are marked *

*