Great questions … some more details to the response Max gave.
1) The chart is wrong, we will fix it. Sandy Bridge has true 256-bit FP execution units (mul, add, shuffle). They are on exactly the same execution ports as the 128-bit versions. You can get a 256-bit multiply (on port 0) and a 256-bit add (on port 1) and a 256-bit shuffle (port 5) every cycle. 256-bit FP add and multiply bandwidth is therefore 2X higher flops than 128. See IACA for the ports on an instruction-by-instruction basis.
2)
The chart doesn’t mention 16-byte paths. We have true 32-byte loads (i.e. each load only uses one AGU resource and we have 2 AGU’s) but only a 48-byte/cycle total is supported to the L1 each cycle. You can’t get 48 bytes per cycle to the DCU using 128-bit operations (only 2 agu’s…). This is why
a simple memory-limited kernel like matrix add (load, load, add, store) measures
1.42X speedup (would have
predicted 1.5X with the current architecture in the limit;
vs. 1.0X if we had double pumped).
Bookmarks