Quote Originally Posted by informal View Post
The way I see it is that intel fellow indirectly confirmed what Hans already found out from SB die photo.
Or not

Great questions … some more details to the response Max gave.

1) The chart is wrong, we will fix it. Sandy Bridge has true 256-bit FP execution units (mul, add, shuffle). They are on exactly the same execution ports as the 128-bit versions. You can get a 256-bit multiply (on port 0) and a 256-bit add (on port 1) and a 256-bit shuffle (port 5) every cycle. 256-bit FP add and multiply bandwidth is therefore 2X higher flops than 128. See IACA for the ports on an instruction-by-instruction basis.
2) The chart doesn’t mention 16-byte paths. We have true 32-byte loads (i.e. each load only uses one AGU resource and we have 2 AGU’s) but only a 48-byte/cycle total is supported to the L1 each cycle. You can’t get 48 bytes per cycle to the DCU using 128-bit operations (only 2 agu’s…). This is why a simple memory-limited kernel like matrix add (load, load, add, store) measures 1.42X speedup (would have predicted 1.5X with the current architecture in the limit; vs. 1.0X if we had double pumped).
http://software.intel.com/en-us/foru...st.php?p=97176

What can we understand from "true 256bit EU" and denial of double pumping ?