A word on multithreading

A lot of people have asked why nds4droid only takes 50% of their CPU on quad core devices. The reason is generally straightforward: nds4droid has two main threads, one that does the actual emulation and one that does all drawing/compositing/input processing. The actual “main” emulation thread is bottleneck and why the emulator can be slow; we’re only doing the “heavy lifting” using one core. When emulating the DS we have to emulate the CPU instruction per instruction, and doing this on several threads is an extremely complicated (if not an impossible proposition). However, there is some low hanging fruit for parallelization: the DS actually is made up of 2 CPUS, the main ARM9 as well as an auxiliary ARM7 (the same CPU as in the GBA, this is how backwards compatibility was achieved). The ARM7 does sound processing and some 2D graphics work. Currently on the main thread we emulate a few ARM9 instructions, then a few ARM7, and back and forth. In theory we could emulate each of these CPUs on their own thread, giving more time on the main thread for the ARM9 emulation which is our current bottleneck. The problem is that these threads still require lots of synchronization; we often have to “halt” the instruction emulation to process the rest of the DS machinery (graphics, interrupts, etc). I have code that does this, the problem is that my synchronization mechanism is too slow… we lose more than we gain.

In my experiments, I spawn a new thread for each CPU. These then sit in a “wait state”, waiting to be told to emulate instructions. They’re then signaled to execute, and then halt again when they detect some other work needs to be done (usually about 6-12 instructions are emulated before we need to halt again). The problem is that my signaling mechanism is too slow: I’ve tried the standard model of pthread mutexes, but these are way too slow. From my understanding pthread mutexes involve kernel syscalls, and yield the thread timeslice (which is way too long, we need to execute again quickly). Spinlocks are also very slow (although I suppose I’m not 100% sure why). If anyone knows of/can suggest some better thread synchronization models, or maybe some libraries that do intelligent user mode synchronization it’d be greatly appreciated!

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Comments

Leave a Reply Cancel reply