At work we’ve been developing new Android hardware, and as such I’ve been porting a lot of our existing C/C++ code to Android using the NDK, a collection of GNU build tools (gcc, objcopy, etc.) and associated scripts to aid the development of native C/C++ code on Android. One of our projects is nearly 1000 source files, and therefore can be a bit of a headache to debug. For a while I’ve had a problem where this project would crash when built in release mode, but work fine when built for debug. Unable to spend the time to figure out what was going on, I’d simply been using the debug build of the code. However recently the performance implications were too great and I needed to get to the bottom of what was going on. Interestingly enough, the issue wasn’t uninitialized memory (as is usually the case with debug/release inconsistencies) but rather a bug in GCC 4.4.3’s optimization of memcpy.
Since copying memory is one of, if not the most basic functions of a CPU, most compilers have a built-in implementation of memcpy that can harness some of the more esoteric instructions of the target architecture. One approach idea is to align to the basic unit of operation of the CPU and then use some sort of instruction that can operate quickly on large data chunks. Instead of copying 9 bytes, the compiler memcpy may copy one byte, then copy 2 4-byte values, resulting in a considerable reduction of instructions. Even the libc implementations of memcpy attempt to do this, but of course cannot do it with as few instructions as a compiler can, which knows exactly the format and layout of the objects it’s operating on as well as the capabilities of the target architecture (you can view the standard Android memcpy here). Presented for your consideration:
#include
typedef struct {
unsigned char byte[23];
unsigned long number;
} child_t;
typedef struct {
child_t children[3];
} parent_t;
static unsigned char data[500];
static volatile parent_t parent;
static const parent_t* parentptr = (const parent_t*)(data + 1);
static const unsigned long size = sizeof(child_t);
void Java_com_opendoorstudios_memcpycrash_jni_main(void* env, void* clazz)
{
const child_t * child3 = (const child_t*)( (const char*)(parentptr) + 4 + size + size);
memcpy((void*)&parent.children[2],(const void*)child3,size);
}
Pretty straightforward code. For those of you unfamiliar with the NDK it uses JNI to hook into the Java layer of Android. The above function is called explicitly from a simple Java-mode application. We declare a few struct types, one which contains instances of the other, and some static memory. data is used to simulate some generic memory (in our case data we received from a hardware peripheral), parent is our target memory, and parentptr points one byte into data. Again, straightforward.
This code will cause a SIGBUS when built with -O2 (or even -O1) on GCC 4.4.x and 4.5.x. SIGBUS is sort of the quiet, less seen cousin of SIGSEGV. It occurs when the CPU is instructed to perform an operation on memory that’s not properly aligned. This is hardly ever seen on mainstream x86 architectures since x86 has instructions for both aligned and unaligned access in all types. Embedded architectures (such as ARM, used by most Android hardware) often don’t however, which can lead to nasty, non-obvious failures at run-time.
Consider again our discussion of memcpy optimizations. A simple optimization example would be to remove the function call to memcpy with an inlined version. However, suppose we knew the value of the size parameter at compile-time. We could output code that never branches/loops because we could simply output size number of copy instructions instead. This greatly speeds up execution because pipelining never stalls (there is no compare to unknown values so we can deterministically say how the code will execute). But suppose we ALSO know about the layout of the types of the source and destination. We would then not have to do a byte-by-byte (or word-by-word) copy, but rather use the specific copy instructions that exist for the types of the members of source and dest. If source and dest are of the same type; we essentially are describing operator=. Let’s take a look at the disassembly of the code a bit back, first in debug (-O0) mode:
0: b500 push {lr}
2: b085 sub sp, #20
4: 9001 str r0, [sp, #4]
6: 9100 str r1, [sp, #0]
char * child3 = ((char*)parentptr) + 4 + size + size;
8: 4b0e ldr r3, [pc, #56]
a: 447b add r3, pc
c: 681b ldr r3, [r3, #0]
e: 1c1a adds r2, r3, #0
10: 4b0d ldr r3, [pc, #52]
12: 447b add r3, pc
14: 6819 ldr r1, [r3, #0]
16: 4b0d ldr r3, [pc, #52]
18: 447b add r3, pc
1a: 681b ldr r3, [r3, #0]
1c: 18cb adds r3, r1, r3
1e: 3304 adds r3, #4
20: 18d3 adds r3, r2, r3
22: 9303 str r3, [sp, #12]
memcpy((void*)&parent.children[2],(void*)child3,size);
24: 4b0a ldr r3, [pc, #40]
26: 447b add r3, pc
28: 1c19 adds r1, r3, #0
2a: 3138 adds r1, #56
2c: 4b09 ldr r3, [pc, #36]
2e: 447b add r3, pc
30: 681b ldr r3, [r3, #0]
32: 9a03 ldr r2, [sp, #12]
34: 1c08 adds r0, r1, #0
36: 1c11 adds r1, r2, #0
38: 1c1a adds r2, r3, #0
3a: f7ff fffe bl 0
Lines 8-22 simply load the child3 pointer. Lines 24-38 prepare the stack with the various arguments to memcpy, and line 3a actually jumps to memcpy. Now, let’s see what happens when we build with NDK_DEBUG=0 (-O2):
0: b530 push {r4, r5, lr}
char * child3 = ((char*)parentptr) + 4 + size + size;
memcpy((void*)&parent.children[2],(void*)child3,size);
2: 4b07 ldr r3, [pc, #28]
4: 4907 ldr r1, [pc, #28]
6: 447b add r3, pc
8: 681a ldr r2, [r3, #0]
a: 4479 add r1, pc
c: 3138 adds r1, #56
e: 1c0b adds r3, r1, #0
10: 323c adds r2, #60
12: ca31 ldmia r2!, {r0, r4, r5}
14: c331 stmia r3!, {r0, r4, r5}
16: ca13 ldmia r2!, {r0, r1, r4}
18: c313 stmia r3!, {r0, r1, r4}
1a: 6812 ldr r2, [r2, #0]
1c: 601a str r2, [r3, #0]
Hmm, no jump. Lines 12-18 seem especially interesting. ldmia stands for “Load multiple increment after”, and it is an ARM instruction saying: “Take the address in r2, and load the value into r0. Increment the address in r2, then load the value at that address into r4. Increment the address in r2…, etc. etc.”. Line 14 then takes these values and does the opposite, storing them into the address (and incrementing the address as well) at r3. The actual SIGBUS occurs on line 12. r2 contains the address of source, the second parameter to memcpy. The problem is that ldmia is only valid when the address in it’s argument is aligned to a 4-byte boundary. But this is not the case; *child3 is ((char*)parentptr) + 4 + size + size, or 76 bytes into parentptr. But, parentptr is 1 byte into data, so *child3 = &data[77], which is not on a 4-byte aligned!
GCC has felt at liberty to treat this memcpy as more or less equivalent to an operator= since it a) knows the types of source and dest and b) knows the value of size at compile-time. The compiler has used static analysis to determine that while the types appear to be different, they “really” aren’t: dest is child_t* cast to void* and source is child_t* cast to char* cast to void*. It’s decided that because of this, it is safe to perform operator= logic. But this is a false positive. source, while derived from a child_t*, is never actually dereferenced as such. Dereferencing it would indeed cause a SIGBUS even in debug mode. But, the code never does such a thing. However, GCC has felt free to impose this further restriction upon it’s built-in memcpy; it will automatically dereference pointers to aligned types even if the programmer explicitly never did so. In my opinion GCC is over-zealous in this case, since memcpy is used exactly when a byte-by-byte copy is preferred over a member-by-member copy. The optimization is certainly clever but also too clever — it silently imposes further restrictions on memcpy exactly when a programmer may be using it to avoid such restrictions in other use cases!
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Leave a Reply