Updated Your 25m re-run is slower probably because of normal variation. +/- .1 seconds is very typical for so many threads. (I see it myself on my Harpertowns...)

You should try some of the larger sizes... The program doesn't scale well with multi-threading for small computations.

You'll need to super-size it to at least 250 million before it can bring out the power of those 16-virtual cores.