memcpy vs memmove

A few notes about memcpy vs memmove and some related items as well.

memcpy

The C standard specifies two functions for copying memory regions, memcpy and memmove. The important difference is that it is undefined behavior to call memcpy with overlapping regions. One must use memmove for that. As the names imply, memcpy copies data from one region to another, while memmove moves data within a region. (It’s also perfectly acceptable to memmove between different regions.)

This subtle but important distinction allows memcpy to be optimized more aggressively. In the case of memmove between overlapping regions, care must be taken not to destroy the contents of the source before they are done copying. This is easiest to see with a naive implementation of a copy loop.

int arr[5] = { 11, 22, 33, 44, 55 };

/* shift array backwards one element */
for (int i = 0; i < 4; i++)
	arr[i] = arr[i + 1];
/*
 * this works fine
 * arr == { 22, 33, 44, 55, 55 }
 */

/* now try shifting array forwards */
for (int i = 0; i < 4; i++)
	arr[i + 1] = arr[i];
/*
 * uh, oh, we destroyed the source along the way
 * arr == { 22, 22, 22, 22, 22 }
 * the desired result was instead
 * arr == { 22, 22, 33, 44, 55 }
 */

In order to operate as desired, the second loop must be changed to for (int i = 3; i >= 0; i--). memmove must implement both loops and decide between them at runtime.

bcopy

BSD systems also have another copying function, bcopy. It has the same semantics as memmove, but the source and destination arguments are reversed. It also has one other property: the compiler doesn’t know about it. The compiler can optimize and inline calls to standard C functions, but can’t do so when they are misspelled.

Some time ago, dlg@ noticed that a couple of the bcopy calls in the network stack were impacting packet forwarding performance. Copying an ethernet header is only six bytes. The compiler could easily inline such a copy, but instead the source code required a call to the bcopy function, which would then perform various overlaps tests, etc. Replacing the bcopy calls with calls to memcpy sped things up considerably.

And so began a slow conversion of the many bcopy calls in the kernel to memcpy. This requires inspecting the function arguments to make sure they don’t overlap. If they do, then memmove must be used instead. Before this operation started, there was not a single memmove in the kernel, as evidenced by the fact that several architectures lacked implementations.

libkern

Two things to note about the kernel implementation of these functions. First, they are almost always implemented in assembly for performance. Sometimes, though, that results in some superultraoptimized code, such as the miracle of modern engineering that is arm memcpy.S. Wowzers. For large copies, finding an optimized path is worth it. But all that code burns icache and causes branch mispredictions. No wonder then that having the compiler inline calls works better than dropping into this behemoth. (For the bold, you may also gaze into the abyss that is sparc64/locore.s for another take on these functions.)

Second, on many architectures, the functions share code and overlap. The amd64 memmove.S is easier to read. Long ago, all three functions (memcpy, memmove, bcopy) were the same. And frequently implemented in terms of bcopy. This meant that every call to memcpy or memmove had to reverse the arguments. In accordance with our preference for new code to use the standard functions, this was changed to have bcopy reverse the arguments and fall into memmove. More recently, we’ve also changed the memcpy entry point to occur after the overlap check (but not yet everywhere, as the arm example shows).

memmove

As mentioned, we’re changing bcopy to memcpy and the occasional memmove. Sometimes that goes wrong, and a memcpy is introduced where memmove is needed. That happened in the PPPOE code as seen here. A MAC address of all ‘A’ or ‘F’? Almost as if memcpy copied two overlapping addresses as seen in the example above. Sure enough, reverting to bcopy solves the problem. Or switching to memmove. In this case, it seemed like the bcopy to memcpy conversion was safe because the destination was freshly allocated by M_PREPEND, and therefore the pointers would be unique. But it turns out the source was actually part of the mbuf to start, and had been chopped off with m_adj earlier in the function. Very sneaky.

nop

Not long ago, reyk@ noticed that while we use asm versions of these functions in the kernel, libc is built with the stock C versions. Switching to the asm versions gives a nice performance boost. But then something very strange happened. mlarkin@ had an amd64 ramdisk kernel crashing at boot. In bcopy. A poorly assembled version of bcopy.

Normally, bcopy is supposed to look something like this:

ffffffff811a7060 <bcopy>:
ffffffff811a7060:       48 87 fe                xchg   %rdi,%rsi
ffffffff811a7063:       90                      nop
ffffffff811a7064:       90                      nop
ffffffff811a7065:       90                      nop
ffffffff811a7066:       90                      nop
ffffffff811a7067:       90                      nop
ffffffff811a7068:       90                      nop
ffffffff811a7069:       90                      nop
ffffffff811a706a:       90                      nop
ffffffff811a706b:       90                      nop
ffffffff811a706c:       90                      nop
ffffffff811a706d:       90                      nop
ffffffff811a706e:       90                      nop
ffffffff811a706f:       90                      nop
ffffffff811a7070 <memmove>:
ffffffff811a7070:       49 89 fb                mov    %rdi,%r11

xchg the source and destination registers, and then “nop sled” into memmove. But Mike had a special edition bcopy:

ffffffff811a7060 <bcopy>:
ffffffff811a7060:       48 87 fe                xchg   %rdi,%rsi
ffffffff811a7063:       90                      nop
ffffffff811a7064:       90                      nop
ffffffff811a7065:       ff                      (bad)
ffffffff811a7066:       ff 90 90 ff ff 90       callq  *0xffffffff90ffff90(%rax)
ffffffff811a706c:       90                      nop
ffffffff811a706d:       ff                      (bad)
ffffffff811a706e:       ff 90 49 89 fb 48       callq  *0x48fb8949(%rax)
ffffffff811a7070 <memmove>:
ffffffff811a7070:       49 89 fb                mov    %rdi,%r11

Well, that’s certainly different. Perhaps most awesome is that the last callq in bcopy is actually also the first few instructions of memmove, but that’s just coincidence. But how is this possible? The kernel is built from entirely separate sources, unshared with libc. A switch to asm memcpy/memmove in libc can’t possibly affect the same functions in the kernel. Can it?

gas

Reverting to a previous build of libc and recompiling the kernel corrected the problem. The problem was actually in userland. millert@ quickly found and backported two memcpy fixes for gas. Surely, that’ll fix it. But no. The problem is in the libc asm commit, but not in memcpy or memmove.

memset

After deraadt@ deduced through some trial and error that memset was the buggy function, naddy@ noticed that the FreeBSD version of this function had one tiny difference. Another diff for the one liner hall of fame.

diff -u -r1.3 -r1.4
--- src/sys/lib/libkern/arch/amd64/memset.S	2007/11/24 19:28:25	1.3
+++ src/sys/lib/libkern/arch/amd64/memset.S	2014/11/21 21:30:50	1.4
@@ -8,6 +8,7 @@
 
 ENTRY(memset)
 	movq	%rsi,%rax
+	andq	$0xff,%rax
 	movq	%rdx,%rcx
 	movq	%rdi,%r11

The second argument to memset is of type int, but is expected to be interpreted as an unsigned char. By default, however, the register it’s passed in will be sign extended. 0x90 turns into 0xffffff90 and everything goes sideways. How could a bug like this live in the kernel for so long? The kernel only ever calls memset with three arguments: 0, ' ', 0xff. The first two are positive values, and the last sign extends into itself.

cld

Related fun fact: the x86 architectures have a direction flag that can be set to cause the processor to run backwards. This is how the backwards copying overlapping memmove is implemented. It’s important for the operating system to keep this flag set to the correct value. If userland sets the flag and then traps into the kernel, it would not be good for the kernel to run in reverse. 900 years ago, bugs of this nature could be exploited for some hilarity. The asm versions of memcpy were until very recently somewhat paranoid about always clearing the direction flag in case it had somehow been forgotten. But guenther@ did an audit and fixed up all the remaining entry points where it was possible it wasn’t set, and then the unnecessary instruction could be removed.

libc

After the initial libc asm commit was reverted, deraadt@ spent some time polishing and improving it. Mostly untangling the complicated libc build infrastructure. The Makefiles are designed to support asm or C versions of each function, but sometimes they have to be implemented in pairs or with dummy stub files or else too many functions get compiled. And the dependencies were generally wrong. Part of the great restructuring involved temporarily changing all architectures to a special memcpy implementation. It checks for overlap like memmove would, but instead of processing the data, it logs a message and aborts. All within C spec. This will help us verify that programs are using memcpy correctly and memmove where they must. Similar changes should be coming to the kernel. So far a few problems have been identified. Also a bug involving overlapping snprintf buffers. Some in expect. Another in vi.

Some of the more surprising bugs caught weren’t just overlaps, but overflows. One in tar and one in postgresql that’s even more interesting.

Somewhat relatedly, zero length null pointers and memcpy are also undefined.

Posted 01 Dec 2014 17:16 by tedu Updated: 12 May 2016 23:36
Tagged: c openbsd programming