Quote Originally Posted by poke349 View Post
lol... I thought I was always bad at explaining stuff.
Also, I don't always know what I'm talking about. So take my posts with a grain of salt.

What you're seeing is load imbalance due to NUMA. What's happening is that some nodes run faster than others depending on where the data is. So you'll find one or two nodes lagging far behind the others - and when the threads on the faster nodes finish first, the slower threads will hang around a lot longer afterwards. So you see low CPU usage.

I was confused as well when I first noticed this on the 4 x 4 that Skycrane sent.
From the beginning, I had already suspected this was the cause. But I couldn't confirm it until I wrote some mini-benchmarks to specifically test for this.

There may be more reasons to it, but so far that's the only explanation I have.
Well, if you need another quad-socket system to test stuff with, my quad dual-core (8-cores total) at home that's available. It's not the fastest system by any stretch of the imagination now, and I'll still have access to my 48-core system at work for about another week (I'm switching jobs).

And you're actually pretty good at explaining stuff. Better than most profs and even some programmers that I've met before. I'm pretty sure that you have a pretty good idea what you're talking about because otherwise, you probably wouldn't be writing the programs and doing what it is that you're doing. (Last I recall, you're doing your grad school...so I'd presume that you gotta know SOMETHING.)