Quote Originally Posted by gOJDO View Post
The L3 size improves the performance in Super Pi for sure, but I doubt that 12% are delivered from the L3 size alone. Since I can't open the page I wonder what is the NB & L3 frequency. I hope for more benches soon.
Maybe AMD engineers little tweaked K10 core with load ahead of store and they finally done right with OoO load execution.

It has 3 pipelines, just like Agena. As for the pipeline length I'm pretty sure it is same as on Agena, 12/18 ALU/FPU stages.
Acutally it has 2 pipelines. One integer, and one floating point with separate schedulers for integer/memory and floating point pipeline. These pipelines has 12 stages for integer and 17 stages for floating point.
K10 has 3 integer units, 3 AGU units and 3 FP units.
Core 2 has also 3 128-bit wide FPU ,3 ALU and 2 AGU units, but it can decode up to 4 instructions per clock, usually 3 arithmetic + one memory, because Intel PR machinery says that they have 4-way CPU, and other guys haven't. :d
I can't tell how much impact on performance has 4-way decoder on Core 2, but I am sure in that the main reason for good performance is good prefetch in combination with big L2 cache. Also, low latency L2, inclusive architecture, good branch predictor and high associativity L1 may bring few percent of performance.
What K10 needs is higher core, L3 and NB clocks, lower cache latencies, wider cache and some architectural improvements.
I agree with that, but it also needs better prefetch. When CPU needs data from memory IPC factor drops significantly. With better prefetch algorithms CPU can access data from L1, L2 or L3 without that significant performance drop. If CPU hasn't that smart algorithms IMC or better IMC can't help there.