AMD cuts to the core with 'Bulldozer' Opterons

**Dresdenboy** · 08-11-2010, 12:08 AM

Originally Posted by savantu

Or not

Great questions … some more details to the response Max gave.

1) The chart is wrong, we will fix it. Sandy Bridge has true 256-bit FP execution units (mul, add, shuffle). They are on exactly the same execution ports as the 128-bit versions. You can get a 256-bit multiply (on port 0) and a 256-bit add (on port 1) and a 256-bit shuffle (port 5) every cycle. 256-bit FP add and multiply bandwidth is therefore 2X higher flops than 128. See IACA for the ports on an instruction-by-instruction basis.
2) The chart doesn’t mention 16-byte paths. We have true 32-byte loads (i.e. each load only uses one AGU resource and we have 2 AGU’s) but only a 48-byte/cycle total is supported to the L1 each cycle. You can’t get 48 bytes per cycle to the DCU using 128-bit operations (only 2 agu’s…). This is why a simple memory-limited kernel like matrix add (load, load, add, store) measures 1.42X speedup (would have predicted 1.5X with the current architecture in the limit; vs. 1.0X if we had double pumped).

http://software.intel.com/en-us/foru...st.php?p=97176

What can we understand from "true 256bit EU" and denial of double pumping ?

Did you check the context of the double pumping statement? For me this looks to be related to loads and the cache bandwidth/AGU resources. It's also contained in point 2). I highlighted some different parts. You can also double pump cache accesses etc.

The first version of the chart (said to be wrong in 1)) contained "AVX LO" and "AVX HI" units, also drawn at the same width as the 128 bit units. Maybe they're even not using double pumping but other techniques like wave pipelining (less likely).

How would you explain the nearly unchanged area of the FPU on die? Surely not by chip stacking.

Thread: AMD cuts to the core with 'Bulldozer' Opterons

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions