One Container to Rule Them All - computing the cosmos

At least, the two sequential containers I’m going to talk about.

Vector or List?¹

The classic use case for a linked list is frequently inserting (or removing²) elements at arbitrary locations. Inserting into a vector anywhere but the back requires moving existing elements out of the way. But a list only needs to update the links held by the new element’s neighbors. This makes insertion an O(n) operation for a vector and an O(1) operation for a list. The following plot shows how the latency of random insertions scales with size for std::vector and std::list.

A log-log plot of "Time per Insert (s)" versus "Container Size (B)" with curves for "std::vector<4B>" and "std::list<4B>." The vector curve begins with a shallow slope, then grows linearly with a slope of 1 starting around 8192 B. The list curve is above the vector curve, and exhibits bumps in insertion time (in addition to linear growth) as the container size approaches the cache sizes. Vertical dashed lines representing the cache sizes are displayed at 32,768 B, 262,144 B, and 16,777,216 B. — Time to insert an element at a random location as a function of the final size of the data³. `std::vector` is faster than `std::list` at all sizes and exhibits linear growth⁴ starting around 8 kiB. `std::list` is heavily encumbered by filling cache levels⁵, becoming nearly 200 times slower than `std::vector` at 16 MiB.

“Uh oh. That’s not O(1) performance. And why is std::list slower than std::vector?”

When people say, “insertion into a list is an O(1) operation,” the fine print is that you need to know where you’re inserting first. Here, we can’t know the insertion point in advance because it’s random. Before we insert each element, we need to traverse the list until we arrive at the right place. Traversing a list is an O(n) operation, and an especially slow one because of the unpredictable memory accesses. The point of using the list was to sacrifice organized data for a simpler insertion algorithm, but we got neither.

Larger Elements

So std::list is terrible for inserting 4-byte integers, but what if we need to insert larger elements? As elements get bigger, so does the cost of moving them. It’s also less helpful to keep elements next to each other if they’re not going to fit in the same cache line anyway. These effects weaken std::vector, but it maintains its lead for a while.

A log-log plot of "Time per Insert (s)" versus "Container Size (B)," similar to the one above, but using 64 B elements instead of 4 B elements. The vector and list curves are closer together with the vector still performing faster at all sizes. The vector is visibly impacted by filling cache levels, similar to the list, but to a lesser extent. — The same benchmark as before, but with elements padded to 64 bytes. `std::vector` slows down when it fills caches, but not as badly as `std::list`.

A log-log plot of "Time per Insert (s)" versus "Container Size (B)," with 256B elements. The vector and list curves now intersect in two places, with std::list<256B> performing better than std::vector<256B> between sizes of 8,192 B and 524,288 B, inclusive. — With 256-byte elements, `std::vector` performs similarly to `std::list`.

A log-log plot of "Time per Insert (s)" versus "Container Size (B)," with 1,024 B elements. The curve for std::list<1024B> is approximately flat up to the L2 cache size of 262,144 B, then grows in a straight line up to the maximum size of 16,777,216 B (the actual growth of insertion time is more than linear due to the slope on the log-log plot being greater than 1). The curve for std::vector<1024B> starts slightly below std::list<1024B>, but curves upward and passes the list at 16,384 B. The vector remains slower than the list for all subsequent sizes. — With 1024-byte elements, `std::list` beats `std::vector` at most sizes.

You might be thinking the takeaway is to use std::list when performing frequent insertions and the elements are large. However, it’s often better to use a std::vector that holds std::unique_ptrs to the data.

Vector of Pointers

“Doesn’t that defeat the purpose of a vector? How is a vector of pointers different from a list?”

The difference is that a vector of pointers allows random-access, which means the processor isn’t forced to read elements in order. While traversing the vector, branch prediction lets the processor start reading the next element before it knows to continue past the current one. You still get cache misses, but now they overlap in time.

The same log-log plot of "Time per Insert (s)" versus "Container Size (B)" for 4 B elements as above, but with an additional curve for the vector of pointers. The new curve starts similar to the list curve, but proceeds with linear (slope of 1) growth while the list sees jumps in latency as it fills caches. The vector of pointers has a small but noticeable bump in latency above 2,097,152 B. At 16,777,216 B, the vector of pointers is ~8.5 times faster than the list and ~22.7 times slower than the ordinary vector. — With 4-byte elements, the vector of pointers initially performs similarly to the list, but isn’t impacted nearly as much by filling caches.

The same log-log plot of "Time per Insert (s)" versus "Container Size (B)" for 1024 B elements as above, but with an additional curve for the vector of pointers. The vector of pointers performs similarly to the list up to the L2 cache size, but then only gradually curves upward while the list sharply turns upward. At 16,777,216 B, the vector of pointers is ~4.5 times faster than the list. — With 1024-byte elements, the vector of pointers beats `std::list` after the L2 cache is filled.

You can also use a hybrid approach, with “hot” data stored directly and “cold” data stored via std::unique_ptr.

Sorting

So far, we’ve been looking at benchmarks of “repeatedly inserting random elements in sorted order,” better known as, “insertion sort.” Since running insertion sort on four million elements is an absurd thing to do⁶, let’s also look at the standard sorting algorithms.

A log-log plot of "Time per Item (s)" versus "Container Size (B)" with curves for "std::sort() on std::vector<4B>," "std::sort() on std::vector<4B*>," and "std::list::sort() on std::list<4B>." Vertical dashed lines represent the cache sizes of 32,768 B, 262,144 B, and 16,777,216 B. The ordinary vector curve is on the bottom and appears to exhibit logarithmic growth. The vector of pointers is just above it and exhibits small bumps in latency after filling caches, on top of logarithmic growth. Both vectors have a small bump in latency going from 64 B to 128 B, due to the underlying algorithm changing between these sizes. The list curve is above both vector curves and grows in roughly a straight line (slope ~1/8, so not actually linear growth) until it has a huge spike in latency as it approaches the L3 cache size. — Performance of standard sorting algorithms on `std::vector`, vector of pointers, and `std::list`, with 4-byte elements⁷. The vector of pointers and list are both slower than the ordinary vector and visibly impacted by filling caches, with the list sharply worsening as it fills the L3 cache.

A log-log plot of sorting performance similar to the one above, but with 64 B elements. The overall patterns are similar to those in the 4 B plot, but with the vector of pointers roughly matching the performance of the ordinary vector between 2,048 B and 131,072 B (inclusive). The vector of pointers has higher latency than the ordinary vector outside this range. — With 64-byte elements, the two `std::vector` implementations perform almost identically from 32 elements⁸ to where they fill the L2 cache. The list is slower than both vectors and, again, sharply worsens as it fills the L3 cache.

A log-log plot of sorting performance similar to those above, but with 1,024 B elements. The vector curves now grow in roughly parallel lines (slope ~1/8) above 32,768 B, with the list curve midway between them with a slightly shallower slope (the ordinary vector is slowest and the vector of pointers is fastest). Below 32,768 B, the vectors exhibit different behavior due to a different underlying algorithm. The ordinary vector gently curves upward before sharply bending to the aforementioned line. The vector of pointers curves more aggressively, reaching a local maximum of time per item at 16,384 B. — With 1,024-byte elements, `std::list::sort()` is faster than `std::sort()` on the ordinary vector, but slower than `std::sort()` on the vector of pointers.

Sorting a list is never the fastest option, and is slower than both vector implementations until the elements are 1,024 bytes. Part of the reason is that std::sort() and std::list::sort() use different algorithms. The former uses introsort while the latter uses merge sort. Maybe you think it’s cheating to use a slower algorithm for std::list, but the point is that we don’t use the same algorithm because we can’t — std::sort() only works on random-access containers. std::list being a bidirectional container limits what we can do with it.

Conclusion

Linked lists don’t make much sense on modern processors. Their strength is that they represent abstract sequences, unconcerned with where the elements are. But modern processors rely on data being organized and predictable to avoid twiddling their thumbs waiting for memory to respond. Vectors represent the simple mappings from sequences to locations that processors need to retrieve data quickly. Maintaining a contiguous array often means more work upfront, but less time waiting⁹.

You can find a use case for anything if you look hard enough¹⁰, but you’re much more likely to regret using a linked list than a vector. Optimizing for “insertion” without considering other operations is often a mistake. If maintaining a vector is too cumbersome, there are ways of lowering the costs. There’s no way to make a list not cause cache misses — it’s what they do. Before you reach for a list, it’s usually a better idea to ask yourself how you could make a vector work, and then do that instead.

In this post, I use C++ terminology. std::vector is the C++ standard library’s dynamic array and std::list is a doubly-linked list. Other languages might use different names. For example, a Python list is a vector of pointers. ↩︎
For simplicity, I only talk about insertion, but removal has similar performance considerations. ↩︎
Each container started empty and was filled by inserting elements from a pre-computed random sequence in sorted order. All benchmarks in this post were performed on an Intel i9-9900K CPU. ↩︎
On a log-log plot, “linear growth” is indicated by the slope of 1, not merely the appearance of a straight line. Quadratic growth would appear as a straight line with a slope of 2. ↩︎
std::list also fills the cache levels early due to the extra memory for the links. With 4-byte elements, the total memory consumption is six times the size of the data (4 B data + 8 B predecessor link + 8 B successor link + 4 B padding). ↩︎
The single std::list insertion benchmark with four million 4-byte elements took three days to complete. ↩︎
The sorting benchmarks use a larger maximum size than the insertion benchmarks because they don’t take several days to finish. ↩︎
std::sort changes behavior between 16 and 32 elements because this is where introsort switches from insertion sort to quicksort. ↩︎
As I pointed out in my previous post, “doing extra work is beneficial if that work can be done in an amount of time that would otherwise have been wasted.” ↩︎
A non-performance reason to use std::list might be to prevent iterator invalidation. ↩︎

Vector or List?1

Larger Elements

Vector of Pointers

Sorting

Conclusion

Related Posts

Stop Using memcpy

Predictability is the New Optimization

Vector or List?¹