Quote Originally Posted by kl0012 View Post
Thats not totally true. K8-K10 cores can do 3 register additions, substractions, shifts, moves per cycle while bulldozer can do only two. Now in K8-k10 AGU was fused with ALU (this is the reason why out-of-order load/stores where imposible on those architectures) but stil K8-K10 were able to execute more then 2 (2.7) arithmetic instruction with memory operand per cycle.
Here is the throughput table:
http://gmplib.org/~tege/x86-timing.pdf
You also had separate scheduler for math and address ops in 10h,now you have the unified scheduler for both. The problem with 10h is that ALUs were underutilized(although in theory as you show it was quite capable) and BD is supposed to fix that bottleneck.