I've been meaning to post in here for awhile, as I've got access to an 8-socket Barcelona machine (w/ ~80GB ram) here in our research group at school that I've tried running things on. The NUMA nature of the machine definitely causes sub-optimal scaling for the smaller workloads (those below 1B).
I tried numactl for 1B and performance goes from 648sec (without) to 466sec (interleave all), so it does make a noticeable difference. Almost a 1.4X speedup in this case.
For comparison (without numactl), 10B takes 5,379sec and 100M takes 110sec.
Yeah, tying all of the memory allocation to one thread/node doesn't do so well with NUMA I imagine. Beyond just the interconnect bandwidth impact, coherency traffic must become a significant overhead as well.
Determinism definitely is desirable with parallel programs, though in what capacity are you using the deterministic malloc()? Is it mainly for debug, or structured memory/pointer accesses? As for thread-safe, are you saying just using barriers to properly synchronize isn't good enough?
At the expense of memory capacity trade-off, could you have each thread malloc() a (relatively) small buffer to work out of? In the interest of creating some locality and doing burst transfers.
I'm curious of the current implementation as well
Is it all p-thread based? MPI would definitely be the way to go for ensuring proper locality on a NUMA system, at the expense of all the icky manual data sharing. I've only done some basic use of OpenMP, so I don't know too much of the advanced details like optimizations for NUMA systems, but is it really applicable to your current algorithm?
I remember reading sometime ago about your implementation spawning a large amount of threads, do you limit the number of work threads to the available hardware threads, or go beyond? Having excessive threads would likely thrash around too much with larger NUMA systems and be counter productive.
How independent are the threads? Is there much synchronization/sharing of data between them, or does each basically work on its own independent chunk of the series and combine in a tree-reduce fashion?
Bookmarks