You are correct, but there's an even bigger problem with the program.
y-cruncher allocates all the memory it needs in a single gigantic malloc(). The OS will put that entire memory block on the node that called the malloc(). So the result is that most (if not all) of the memory will be concentrated in a single node.
The 4 nodes will all try to access that same memory - all of which is in one node. So not only do you have contention on the interconnect, but the aggregate memory bandwidth is only equal to that of one socket. (since all the memory is concentrated there)
If the memory is interleaved, you still don't get rid of the interconnect traffic, but at least now you have 4 sockets of memory bandwidth.
This problem does not occur in "normal" applications that do lots of small allocations. For these normal apps, the OS will bias all the allocations on the nodes that called them - so memory accesses are mostly local.
But since y-cruncher does one massive malloc() at the beginning, the OS is not able to do this.
There were several major reasons why I made the program do a single malloc() and it's own memory management as opposed to letting the OS handle it.
- Memory allocation is expensive - (hundreds/thousands of cycles?)
- malloc() is not thread-safe. Placing locks around it is not scalable. Even if it is thread-safe (with the right compiler options), it's still implemented with locks.
- Memory allocation is not deterministic. The address that malloc() returns differs between program instances even with the exact same settings.
The memory manager I put into the program is deterministic - even with multi-threading.
It's a lot easier to hold together 100,000 lines of multi-threaded code if its deterministic. :yepp:
I made this decision at around October 2008 or so (quite a while ago) - but still before the first version of y-cruncher. It's worked fine since then.
For NUMA, this approach obviously backfires like crazy. It's still possible to make it work without giving up determinism or scalability - but it's messy. It's something I'm probably gonna put off for a few major releases while I redesign/rewrite the program.
There's about 80,000 lines of (nearly) completed code that haven't been enabled yet because the new internal interface is incompatible with what y-cruncher currently uses.
It's gonna take a while to migrate the program to that new interface.
The NUMA layer will eventually be built on top of all the existing "normal" threading code - though I have yet to work out the details.