Quote Originally Posted by informal View Post
You have a total of 4 instructions executed by each integer core.In 10h you had a total of 3(be it mem or math ops).That's a 33% difference.Now count in the massively improved prefetch and other stuff in the front end that are supposed to keep the core(s) busy all the time and you have a potentially pretty nice boost in IPC. Remember that with 10h ,the 3ALUs were paired with AGUs and sat around just waiting for data doing nothing. The new 2+2 scheme is built in order to address the under-utilization.
Thats not totally true. K8-K10 cores can do 3 register additions, substractions, shifts, moves per cycle while bulldozer can do only two. Now in K8-k10 AGU was fused with ALU (this is the reason why out-of-order load/stores where imposible on those architectures) but stil K8-K10 were able to execute more then 2 (2.7) arithmetic instruction with memory operand per cycle.
Here is the throughput table:
http://gmplib.org/~tege/x86-timing.pdf