Quote Originally Posted by Hans de Vries View Post
It's not that "crippled", not by a factor 2 (=256/128). For example:
If an SIMD FP add takes 4 clock cycles then:

128 bit: A+B+C takes 8 clock cycles.
256 bit: A+B+C takes 9 clock cycles. (using pipelined 128 bit hardware)

128 bit: A+B+C+D takes 9 clock cycles.
256 bit: A+B+C+D takes 11 clock cycles. (using pipelined 128 bit hardware)
The comparison we want is 256 bit sum[A1 + A2 + ... + A_n] (to use your example) on 256 bit hardware vs 128 bit hardware.

Say n = 16, just for grins. What's the # clock cycles needed in each case?