Thanks
Well... it IS my research area right now. So I "better" learn it at some point for me to be of any use.
lol?
I haven't, but others have. The scalability sucks on NUMA - because it's all shared-memory programming.
On Windows, I use threads directly to bypass OpenMP overhead. But that doesn't solve the scalability problems of OpenMP. For that, as mentioned in earlier posts, I need MPI.
Look at the some of the quad-socket results on my thread. (I moved them off my thread to my site, but I link to the full list from my thread.)
You can see that the quad-socket Barcelona's don't do too well... (they get beaten by single-socket i7s)
There's also an 8-socket Barcelona in there - less than 10% faster than the quad-sockets at the same clock.
The 4-socket Beckton machine gets like 40% less "throughput/cycle" compared to the Gainestowns and Westmeres...
But... If you compare single-socket to dual-socket, they scale almost perfectly. (1.8x - 1.9x speedup from 1 -> 2 sockets @ same clock)
Core 2 -> Harpertown: This is all uniform memory. Even the dual-socket Harpertown is uniform memory - both sockets go through the same external memory controller.
Core i7 -> Gainestown: Gainestown/Westmere is NUMA, but barely so. The latency penalty for accessing the other socket's memory is only 30% - most of which gets hidden behind HyperThreading... lol
Bookmarks