I think the main reason is threads synchronization - shared L3 cache, IMC, and all cores in same chip.
My background - I am a programmer (server side).
When a complex program (good game, server, ...) is multithreaded it needs to synchronize access to some resources which are exclusive (only single thread can use them at a time). Synchronization is usually implemented by spin locks or other techniques using shared memory.
On a Phenom, the spin-lock gets cached in the shared cache L3 and it stays there - it is accessed very often by many threads and likely all cores.
Occasionally it moves from L3 to L2, L1, L2 and back to L3 when a thread/core spins on it.
On a C2D/C2Q the spin-lock stays mostly in memory and moves around between caches of individual cores. But it stays mostly in memory when 2 cores try to spin on it. So it moves Memory - L2 - L1 - L2 - Memory.
So, when 2 cores try to spin the same lock, it moves around between each core. On Phenom it moves up to L3, on C2D/C2Q it goes up to Memory.
L3 is much faster than memory - lower latency, so contentions finish a lot faster.
On benchmarks, all thread are just simple copies of each other with no synchronization between them, so this is not a factor. But real application will require synchronization and will experience the penalty described above.
Add to this the no-IMC on C2D/C2Q and you get even bigger extra penalty.
Extra explanation:
Spin lock is when a thread keeps trying to acquire the lock until it succeeds - it continuously spins on the lock until it succeeds acquiring it. (Using atomic instructions like TestAndSet)
Bellow is an example of 2 threads spinning on a lock and how the lock(an int - 32/64 bit) moves around.
(L1A is L1 cache of core A, L1B for core B, ...).
Asuming same latencies for Phenom & C2Q: L1 - 3 cycles, L2 - 15, L3 - 48, Memory - 150
Phenom:
Core A spins: Mem -> L3 -> L2A -> L1A -> Core A -> L1A (total 150 cycles)
Core A spins: L1A -> Core A -> L1A (3 cycles)
Core B spins: L1A -> L2A -> L3 -> L2B -> L1B -> Core B -> L1B (48 cycles)
Core B spins: L1B -> Core B -> L1B (3 cycles)
Core A spins: L1B -> L2B -> L3 -> L2A -> L1A -> Core A -> L1A (48 cycles)
C2Q:
Core A spins: Mem -> L2A -> L1A -> Core A -> L1A (total 150 cycles)
Core A spins: L1A -> Core A -> L1A (3 cycles)
Core B spins: L1A -> L2A -> 2xFSB/Memory -> L2B -> L1B -> Core B -> L1B (150 cycles)
Core B spins: L1B -> Core B -> L1B (3 cycles)
Core A spins: L1B -> L2B -> 2xFSB/Memory -> L2A -> L1A -> Core A -> L1A (150 cycles)
Imagine how much waiting happens when 2 cores spin on the same lock while it has been acquired by a 3rd thread. It cannot be cached for more than 1-2 instructions because both cores want exclusive access to it...
Just count how many clocks are required when the lock is transitioned from one core to another... No shared cache becomes very big penalty. Add to that FSB penalty...
Of course, good program will use as few as possible locks, but they are still required...
A single threaded program doesn't have this problem - most of the frequently used data will move in L1 or L2 cache and stay there, giving C2D higher performance (because of clock frequency).
PS: Hope it makes sense... sorry for being so long
Bookmarks