Quote Originally Posted by agenda2005

The 128-bit SSE on Conroe will obviously benefit more than AMD 64 because two 64-bit SSE(x) intructions will now require one cycle unlike two in Presscott and Athlon 64. This is nobody's fault since Intel have concentrated more effort on making better CPU than crippling competitors performance on their compiler.
Just for clarifications: SSE3 always has 128 bit instructions. On earlier platforms they might need more cycles to execute but the 128 bit instructions as such are present in any processor supporting SSE3.

I don't think it is quite correct to say that two 64 bit SSE operations will now require half the time. This is only the case when ...
  • ... there are indeed independent units to compute. Doing two instructions at the same time requires that non depends on the outcome of the other. That is frequently not the case
  • .... and unless it is hand-coded assembly the compiler has to be sure about the previous fact. It can be nontrivial for the compiler to figure this out in a bulletproof way. If the compiler is not entirely sure it will default to be conservative
  • ... to be most effective the compiler has to be able to do out-of-order processing to scrap two 64 bit operations into one 128 bit one even if they are at different places in the source code. Prooving that this is safe is nontrivial, too, in particular in languages like C/C++ where there is a lot of aliasing going on