Predictability is the New Optimization

When choosing an algorithm for a task, we often try to minimize the computational work needed to complete it. That’s usually a good idea, but there are situations in which modern processor features cause this approach to backfire. To show an example, I benchmarked the performance of several common sorting algorithms on small, randomized arrays.

Execution time per item versus array size for several common sorting algorithms performed on arrays of 32-bit integers. Data were collected using an i9-9900K CPU.

If you’re familiar with asymptotic complexity, you’ll know that O(n log n) algorithms become much faster than O(n²) algorithms as n increases. Indeed, we see this starting to happen on the right side of the graph. What’s surprising is how long it takes to get there. Insertion sort is faster than quicksort¹ all the way up to n=128.

It’s tempting to assume that insertion sort just happens to do less work up to this point (on account of the constant factor we’re ignoring with big O notation), but running the benchmarks through perf stat shows this to be false. In fact, insertion sort performs over three times as many instructions as quicksort on arrays of 128 elements.

`perf stat` output for benchmark of insertion sort on arrays of 128 32-bit integers. All of the `perf stat` output in this post was produced using 1,000 benchmark iterations, with each benchmark iteration calling the sort function 256 times to dilute the impact of the benchmark loop — i.e., all `perf stat` screenshots show statistics for 256,000 sort calls.

`perf stat` output for quicksort benchmark. The number of instructions executed is much lower than for insertion sort, but that comes at the cost of more branch prediction misses and slower execution time.

So what gives? The primary culprit is branch prediction misses — quicksort suffers from nearly three times as many as insertion sort does. Branch prediction misses are expensive events that force the processor to throw away speculative work and resume from the missed branch instruction. Quicksort uses fewer instructions than insertion sort, but it also wastes more clock cycles due to the processor failing to predict what it should do next. This is not unique to quicksort, as we see in the perf stat output for the other two O(n log n) algorithms.

`perf stat` output for heapsort benchmark. Twice as many instructions and 35% more branch prediction misses compared to quicksort.

`perf stat` output for merge sort benchmark. Slightly fewer branch prediction misses than quicksort, but nearly four times as many instructions.

The problem is that getting a comparison sort to run in O(n log n) time requires each comparison to yield as much information as possible², and more information means less predictability. Insertion sort performs as well as it does precisely because it does a bad job of gaining information from each comparison. On each iteration, it searches through a sorted portion of the array to find out where to insert the next unsorted element, and repeats this until the entire array is sorted. During each search, every comparison yields the same result, except the final one identifying the insertion point. This is wasteful in terms of the number of instructions, but it means the processor can zip through those instructions with blazing speed.

Bubble sort is the worst of both worlds. It’s similar to insertion sort in that it repeatedly searches part of the array to add elements to a sorted portion, resulting in O(n²) complexity. The difference is, instead of searching the sorted portion of the array, it searches the unsorted portion. The result is just terrible.

`perf stat` output for bubble sort benchmark. Nearly twice as many instructions and a whopping 12 times as many branch prediction misses as insertion sort.

The best solution is to combine the strengths of insertion sort and quicksort into a hybrid algorithm. The C++ standard library std::sort algorithm uses introsort, which starts with quicksort for large arrays, then switches to insertion sort when the quicksort partitions get small enough³⁴. The result is near-optimal performance for all but the smallest arrays.

Execution time per item versus array size for insertion sort, quicksort, and `std::sort` performed on arrays of 128 32-bit integers. The overhead of using a hybrid algorithm causes degradation of performance for arrays smaller than 8 elements, but then `std::sort` performs about as well as insertion sort for arrays up to the threshold size of 16 elements. Above the threshold size, it performs slightly better than quicksort due to switching to insertion sort for small partitions.

Sorting arrays of integers is a simple example, but the broader point is this: doing extra work is beneficial if that work can be done in an amount of time that would otherwise have been wasted. This phenomenon is only becoming more common as processors become increasingly reliant on caches and branch prediction to keep their execution units busy. The competing effects at play make it more important than ever to base performance-related decisions on measurements rather than guesswork.

Quicksort technically has a worst-case complexity of O(n²), but it performs in O(n log n) time on average. For sorting random data, it can be safely treated as an O(n log n) algorithm. ↩︎
Decision tree model – Wikipedia ↩︎
Introsort also uses heapsort as a fallback O(n log n) option in case quicksort exhibits its worst-case O(n²) complexity, but that is unlikely to happen for random data. ↩︎
The threshold size where introsort switches to insertion sort for gcc is 16 (_S_threshold in stl_algo.h) ↩︎

Related Posts

Stop Using memcpy

One Container to Rule Them All