Stop Using memcpy - computing the cosmos

Look, I know it’s the fastest way to copy data from one location to another¹. But hear me out:

It breaks your code.
If it doesn’t, you were probably going to get the speed anyway.

The problem is that memcpy copies raw bytes, regardless of what those bytes represent. This is called a “shallow copy.” Almost always, what you really want is a “deep copy.” When memcpy works, it’s usually because a shallow copy happens to the be same as a deep copy, not because memcpy is an honest representation of what you meant to do.

Take the following struct:

struct Person {
  static constexpr int MAX_NAME_LENGTH = 32;

  int age;
  char name[MAX_NAME_LENGTH];
};

This defines Person as a trivial data type. If we want to copy an array of Person objects, either memcpy or std::ranges::copy will work.

auto people = std::array<Person, NUM_PEOPLE>{};
...
auto people_copy = std::array<Person, NUM_PEOPLE>{};
std::memcpy(people_copy.data(), people.data(), sizeof(people));
// OR
std::ranges::copy(people, std::begin(people_copy));

But when we eventually decide that storing Person::name in a fixed-size array was a terrible idea, we’ll want to replace it with a string.

struct Person {
  int age;
  std::string name;
};

std::ranges::copy still does the right thing, but now memcpy produces a “copy” whose name member points to the original underlying array. At minimum, this causes a double free when the objects go out of scope².

So why not take memcpy‘s speed for trivial data, and update the code for deep copies when needed? Firstly, that update is easier said than done for a large codebase. You might have a giant mess to clean up. Secondly, you don’t need to explicitly write memcpy to get a call to memcpy. Optimizing compilers are smart enough to do that for you. The latest GCC and Clang can replace std::ranges::copy with memcpy at -O1³⁴.

Even when pointer aliasing forces the compiler to use memmove instead, you won’t see a difference. The distinction between memcpy and memmove is that calling memcpy on overlapping arrays is undefined behavior. “Undefined behavior” means anything can happen, which includes just calling memmove in the first place. Every platform I checked treats memcpy as an alias for memmove⁵.

memcpy falls into the category of “failed micro-optimizations that only make life harder.” It won’t make your program faster, but it will be another thing you need to update while reorganizing your data structures. It’s especially counterproductive when a trivial layout pollutes the cache with seldom-used data. Moving that data out-of-line might speed things up a lot, but you can’t do that if every function requires a trivial type. It’s tempting to try to maximize performance for a known data layout, but that may just lock you into using an inefficient data structure with no tangible benefit. Write what you actually intend to do first, and don’t micro-optimize unless you know you need to. The more you couple your algorithms to your data structures, the harder it is to fix them later.

So please… stop using memcpy.

On my machine, memcpy is up to 90 times faster than an unoptimized for-loop. ↩︎
Assuming the Small String Optimization (SSO) doesn’t apply. If it does, it will hide the bug. ↩︎
https://godbolt.org/z/sqhYoP4KG. ↩︎
I’ve been told, “We can’t turn on optimization because of undefined behavior.” If that’s the case, then you don’t have a C++ program. Period. ↩︎
On my Linux (x86_64) machines, memcpy gets linked to __memmove_avx512_unaligned_erms or __memmove_avx_unaligned_erms, depending on the CPU (you might see __memcpy..., but these are aliases of the __memmove... versions, which you can confirm by comparing their addresses). On Windows (x86_64) and MacOS (ARM) machines, I got ntdll!memmove and _platform_memmove, respectively. Note that this does not excuse using memcpy in an undefined manner — it just means that forcing a call to memcpy instead of memmove won’t improve performance. ↩︎

Related Posts

One Container to Rule Them All

Predictability is the New Optimization