Quote Originally Posted by terrace215 View Post
The comparison we want is 256 bit sum[A1 + A2 + ... + A_n] (to use your example) on 256 bit hardware vs 128 bit hardware.

Say n = 16, just for grins. What's the # clock cycles needed in each case?
This particular example is of course an ideal case for 256 bit hardware
and now Sandy Bridge's 48 byte/cycle versus Llano's 32 byte/cycle
L1 cache bandwidth will determine the throughput.

(Note that in this kind of cases there is no advantage from HT for Sandy
Bridge since a single thread already utilizes 100% of the resources)


Regards, Hans