Over-aggressive GCC optimization can cause SIGBUS crash when using memcpy with the Android NDK

At work we’ve been developing new Android hardware, and as such I’ve been porting a lot of our existing C/C++ code to Android using the NDK, a collection of GNU build tools (gcc, objcopy, etc.) and associated scripts to aid the development of native C/C++ code on Android. One of our projects is nearly 1000 source files, and therefore can be a bit of a headache to debug. For a while I’ve had a problem where this project would crash when built in release mode, but work fine when built for debug. Unable to spend the time to figure out what was going on, I’d simply been using the debug build of the code. However recently the performance implications were too great and I needed to get to the bottom of what was going on. Interestingly enough, the issue wasn’t uninitialized memory (as is usually the case with debug/release inconsistencies) but rather a bug in GCC 4.4.3’s optimization of memcpy.

Since copying memory is one of, if not the most basic functions of a CPU, most compilers have a built-in implementation of memcpy that can harness some of the more esoteric instructions of the target architecture. One approach idea is to align to the basic unit of operation of the CPU and then use some sort of instruction that can operate quickly on large data chunks. Instead of copying 9 bytes, the compiler memcpy may copy one byte, then copy 2 4-byte values, resulting in a considerable reduction of instructions. Even the libc implementations of memcpy attempt to do this, but of course cannot do it with as few instructions as a compiler can, which knows exactly the format and layout of the objects it’s operating on as well as the capabilities of the target architecture (you can view the standard Android memcpy here). Presented for your consideration:

Pretty straightforward code. For those of you unfamiliar with the NDK it uses JNI to hook into the Java layer of Android. The above function is called explicitly from a simple Java-mode application. We declare a few struct types, one which contains instances of the other, and some static memory. data is used to simulate some generic memory (in our case data we received from a hardware peripheral), parent is our target memory, and parentptr points one byte into data. Again, straightforward.

This code will cause a SIGBUS when built with -O2 (or even -O1) on GCC 4.4.x and 4.5.x. SIGBUS is sort of the quiet, less seen cousin of SIGSEGV. It occurs when the CPU is instructed to perform an operation on memory that’s not properly aligned. This is hardly ever seen on mainstream x86 architectures since x86 has instructions for both aligned and unaligned access in all types. Embedded architectures (such as ARM, used by most Android hardware) often don’t however, which can lead to nasty, non-obvious failures at run-time.

Consider again our discussion of memcpy optimizations. A simple optimization example would be to remove the function call to memcpy with an inlined version. However, suppose we knew the value of the size parameter at compile-time. We could output code that never branches/loops because we could simply output size number of copy instructions instead. This greatly speeds up execution because pipelining never stalls (there is no compare to unknown values so we can deterministically say how the code will execute). But suppose we ALSO know about the layout of the types of the source and destination. We would then not have to do a byte-by-byte (or word-by-word) copy, but rather use the specific copy instructions that exist for the types of the members of source and dest. If source and dest are of the same type; we essentially are describing operator=. Let’s take a look at the disassembly of the code a bit back, first in debug (-O0) mode:

Lines 8-22 simply load the child3 pointer. Lines 24-38 prepare the stack with the various arguments to memcpy, and line 3a actually jumps to memcpy. Now, let’s see what happens when we build with NDK_DEBUG=0 (-O2):

Hmm, no jump. Lines 12-18 seem especially interesting. ldmia stands for “Load multiple increment after”, and it is an ARM instruction saying: “Take the address in r2, and load the value into r0. Increment the address in r2, then load the value at that address into r4. Increment the address in r2…, etc. etc.”. Line 14 then takes these values and does the opposite, storing them into the address (and incrementing the address as well) at r3. The actual SIGBUS occurs on line 12. r2 contains the address of source, the second parameter to memcpy. The problem is that ldmia is only valid when the address in it’s argument is aligned to a 4-byte boundary. But this is not the case; *child3 is ((char*)parentptr) + 4 + size + size, or 76 bytes into parentptr. But, parentptr is 1 byte into data, so *child3 = &data[77], which is not on a 4-byte aligned!

GCC has felt at liberty to treat this memcpy as more or less equivalent to an operator= since it a) knows the types of source and dest and b) knows the value of size at compile-time. The compiler has used static analysis to determine that while the types appear to be different, they “really” aren’t: dest is child_t* cast to void* and source is child_t* cast to char* cast to void*. It’s decided that because of this, it is safe to perform operator= logic. But this is a false positive. source, while derived from a child_t*, is never actually dereferenced as such. Dereferencing it would indeed cause a SIGBUS even in debug mode. But, the code never does such a thing. However, GCC has felt free to impose this further restriction upon it’s built-in memcpy; it will automatically dereference pointers to aligned types even if the programmer explicitly never did so. In my opinion GCC is over-zealous in this case, since memcpy is used exactly when a byte-by-byte copy is preferred over a member-by-member copy. The optimization is certainly clever but also too clever — it silently imposes further restrictions on memcpy exactly when a programmer may be using it to avoid such restrictions in other use cases!

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

3 thoughts on “Over-aggressive GCC optimization can cause SIGBUS crash when using memcpy with the Android NDK”

  1. JFYI, the compiler is in fact behaving correctly here (where by “correctly” I mean “according to the C language specification”). By using a value of the parent_t* type, you’re declaring to the compiler that that value is correctly aligned for that type — in this case, 4-byte alignment due to the “unsigned long” field in child_t — so the compiler is allowed to make the optimization that it’s making. From the C99 specification, 6.3.2.3.7: “A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined.”

    If you meant for the structure to have a 23-byte field immediately followed (without padding) by an aligned 4-byte integer, you need to add a 1-byte padding field at the beginning of the structure, which will allow you to use a properly aligned pointer for the structure itself in this example. With GCC, you could also put attribute((packed)) on the structure, which would presumably prevent GCC from doing this optimization.

Leave a Reply

Your email address will not be published.