Originally Posted by
drfedja
I mean 2x64-bit store for 10h or 2x128-bit load. Because 10h can't execute AVX 256 instructions, it can load data in 128-bit chunks.
This is 128-bits /cycle for stores, or 256-bits/cycle for loads.
In the Bulldozer core(not module), there is 256-bit load + 128-bit store in the same time. With Bulldozer module there is double of that operations.
Bulldozer core can calculate 2 adresses at same time because it has 2 AGU - adress generation units.
Sandy core can do also 2 adress operations at once, because it has 2 L/S AGU. It has slightly different approach for store. SB store unit is attached to scheduler
Yes, per core it has 2 ALU and 2 AGU. I've made detail diagram for Bulldozer module, K10, Nehalem and of course of Sandy Bridge HT core architecture.