Quote Originally Posted by alpha754293 View Post
Very interesting. I gotta say, I'm learning a LOT from this thread, although I also honestly, still don't understand alot of it as well only because I'm not a programmer. BUT...I THINK that I sorta get the jist of it though (and pity that you're so far otherwise I think that I can really learn programming from you), since you're one of the few people that I've EVER met who can explain it and explain it well!

Here's something interesting that I've noticed -- on the 48-core system that I've got at work - it's four sockets. I THINK that I have it set up so that there are four NUMA nodes.

And when I run y-cruncher on it, it doesn't really seem to quite take full advantage of the hardware and that's evident in the CPU utilization reported by Windows Task Manager and also by the program itself.

Is that (^ the above aforementioned ^) why it's like that?

I'm only used to commercially available software that's MPI and/or OpenMP and maybe it's cuz it's commercial, so it has very high CPU utilization. ?
lol... I thought I was always bad at explaining stuff.
Also, I don't always know what I'm talking about. So take my posts with a grain of salt.

What you're seeing is load imbalance due to NUMA. What's happening is that some nodes run faster than others depending on where the data is. So you'll find one or two nodes lagging far behind the others - and when the threads on the faster nodes finish first, the slower threads will hang around a lot longer afterwards. So you see low CPU usage.

I was confused as well when I first noticed this on the 4 x 4 that Skycrane sent.
From the beginning, I had already suspected this was the cause. But I couldn't confirm it until I wrote some mini-benchmarks to specifically test for this.

There may be more reasons to it, but so far that's the only explanation I have.