A word on multithreading

A lot of people have asked why nds4droid only takes 50% of their CPU on quad core devices. The reason is generally straightforward: nds4droid has two main threads, one that does the actual emulation and one that does all drawing/compositing/input processing. The actual “main” emulation thread is bottleneck and why the emulator can be slow; we’re only doing the “heavy lifting” using one core. When emulating the DS we have to emulate the CPU instruction per instruction, and doing this on several threads is an extremely complicated (if not an impossible proposition). However, there is some low hanging fruit for parallelization: the DS actually is made up of 2 CPUS, the main ARM9 as well as an auxiliary ARM7 (the same CPU as in the GBA, this is how backwards compatibility was achieved). The ARM7 does sound processing and some 2D graphics work. Currently on the main thread we emulate a few ARM9 instructions, then a few ARM7, and back and forth. In theory we could emulate each of these CPUs on their own thread, giving more time on the main thread for the ARM9 emulation which is our current bottleneck. The problem is that these threads still require lots of synchronization; we often have to “halt” the instruction emulation to process the rest of the DS machinery (graphics, interrupts, etc). I have code that does this, the problem is that my synchronization mechanism is too slow… we lose more than we gain.

In my experiments, I spawn a new thread for each CPU. These then sit in a “wait state”, waiting to be told to emulate instructions. They’re then signaled to execute, and then halt again when they detect some other work needs to be done (usually about 6-12 instructions are emulated before we need to halt again). The problem is that my signaling mechanism is too slow: I’ve tried the standard model of pthread mutexes, but these are way too slow. From my understanding pthread mutexes involve kernel syscalls, and yield the thread timeslice (which is way too long, we need to execute again quickly). Spinlocks are also very slow (although I suppose I’m not 100% sure why). If anyone knows of/can suggest some better thread synchronization models, or maybe some libraries that do intelligent user mode synchronization it’d be greatly appreciated!

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

10 thoughts on “A word on multithreading”

  1. As i am in no way a developer i don’t have any solutions, but i do have a couple of ideas 😀 hope they will be of any use.

    As it has been suggested in the release post, the author of the Dsdroid emulator was thinking about teaming up with you or the Dsoid developer. He also claims that has an ARMjit almost working. I suppose you are aware, but just in case, here it is his twitter, github, XDA profile … http://www.seandev.ca/contact.php

    Just one week ago, Exophase released the DraStic emulator for the Open Pandora handheld. He has developed this Nintendo DS emulator on his own and it is not based in Desmume or in other previous software. It is closed source but he has stated that maybe he could share the code with a trusty person. Being you so committed with open source could make you that person. Here he talks about the emulator and shows a couple of (impressive) videos: http://boards.openpandora.org/index.php/topic/11333-drastic-nintendo-ds-emulator-teaser-video/
    And here is the release thread:

    Finally, talking about multithreading, the developer of PPSSPP (an open source psp emulator) has stated that their next milestone is to implement this. I suppose you know about this project too and how in a few months they have given shape to a neat piece of software that is evolving quite fast. Henrik Rydgard, the main developer, seems to be such a great guy and is always happy to lend a hand to everybody who ask in the forums. Not to mention that a great community of developers has aroused around the emulator. So there you could find plenty of help. Here it is their page: http://ppsspp.org/

    I have been following you since the very beginning of this emulator and i have to say that i love the way you think about open source and how this emulator is evolving. Thanks for all 😀

  2. Use a scheduler thread to feed each of the cpu cores a timeslice containing the number of cycles to execute. The scheduler then waits until all of the CPU cores have exhausted their allocated cycles before adding the next timeslice. Once the CPU core has exhausted the cycles, it waits (using a busy wait or event) until additional cycles are added by the scheduler. This way, the size of the timeslice can be controlled by the emulator. A larger timeslice will offset the overhead caused by the signaling. A timeslice of six to twelve instructions is a very short timeslice and would be the reason why the sync mechanism is not performing.

    1. Listen to skid, he’s a major contributor to the Dolphin Gamecube emulator. If anyone knows this shit, it’s him.

  3. I’m not a develloper, but I have an Idea:

    Well, we have 4 Cores ….
    Core 1 makes drawing/compositing/input processing
    Core 2 is for the ARM 9
    Core 3 is for the ARM 7 (and 9, if needed)

    Core 4 makes the Sync for the other 3 Cores
    No. 4 sends 6-12 Arm9-Instuctions to Core 2 and 6-12 Arm7-Instructions to Core 3. If one Core is finished, he sends the results back to Core 4. These results sends Core 4 to Core 1 (for drawing). Core 4 sents new Instructions to The finished Core. If the Arm9 Core is not ready yet, the Core 4 sends some Arm 9 to Core 3.

    So there aren’t any “wait-state”, because if one Core isn’t fast engough, the other core supports him by recieving the next instructions of the slow Core. The results must be saved into a cache untill the results of both cores are ready to sent them to core 1 for drawing etc.

    I hope, that you understand, what I wanted to say….

    There are two workers:
    The first one gets “math”-Instructions from the boss and
    the other gets “translation”-Instructions from the boss.

    The second one (Translation) finished his work before the first one and gives the Translation the boss. The boss is happy and put it on his desk.
    Math and Translatiom results of the first page must be printed togehter (if not, there are two pages), so the boss must wait until the first one ist finished.
    The second one thinks: “Haha, now I musn’t work, the first one is not ready yet. (wait-state).”
    But that was wrong. The boss gives (few) math-instructions to the translation one. He said: “Here, that is for the second page. And here are your normal Translation-Instructions”

    Now I hope, you understand me haha, sray for my bad english 🙂

    1. Only problem with that is that not everyone uses a quad-core phone.

      And before you say to just go out and buy one, there are people who aren’t willing to pay hundreds of dollars just to get a few differences out of it.

      1. He’s saying cores but essentially in programming you don’t hardcore cores, you simply spin up threads. Mutlicore processors will allocate one thread per core if they have sufficient cores, otherwise the workload of multiple threads will be distributed among the available cores during runtime

    2. Well this is a nice idea, but as jeff said the threads have to be synchronized really often, and this requires too many instructions atm.

      1. The method described here is interesting theoretically, basically having an executing thread and then threads to process the remaining instructions in parallel, then queuing them up to the execution thread unfortunately even though it sounds good in theory, computers don’t work like that, and in all probability even if this could be coded the execution thread would most likely become a bottleneck in it of itself.

Leave a Reply

Your email address will not be published.