Results 1 to 25 of 137

Thread: Clarify this for me, please

Threaded View

  1. #11
    Registered User
    Join Date
    Apr 2008
    Posts
    12

    Shared Cache

    I think the main reason is threads synchronization - shared L3 cache, IMC, and all cores in same chip.

    My background - I am a programmer (server side).

    When a complex program (good game, server, ...) is multithreaded it needs to synchronize access to some resources which are exclusive (only single thread can use them at a time). Synchronization is usually implemented by spin locks or other techniques using shared memory.

    On a Phenom, the spin-lock gets cached in the shared cache L3 and it stays there - it is accessed very often by many threads and likely all cores.
    Occasionally it moves from L3 to L2, L1, L2 and back to L3 when a thread/core spins on it.

    On a C2D/C2Q the spin-lock stays mostly in memory and moves around between caches of individual cores. But it stays mostly in memory when 2 cores try to spin on it. So it moves Memory - L2 - L1 - L2 - Memory.

    So, when 2 cores try to spin the same lock, it moves around between each core. On Phenom it moves up to L3, on C2D/C2Q it goes up to Memory.
    L3 is much faster than memory - lower latency, so contentions finish a lot faster.

    On benchmarks, all thread are just simple copies of each other with no synchronization between them, so this is not a factor. But real application will require synchronization and will experience the penalty described above.

    Add to this the no-IMC on C2D/C2Q and you get even bigger extra penalty.

    Extra explanation:

    Spin lock is when a thread keeps trying to acquire the lock until it succeeds - it continuously spins on the lock until it succeeds acquiring it. (Using atomic instructions like TestAndSet)

    Bellow is an example of 2 threads spinning on a lock and how the lock(an int - 32/64 bit) moves around.
    (L1A is L1 cache of core A, L1B for core B, ...).
    Asuming same latencies for Phenom & C2Q: L1 - 3 cycles, L2 - 15, L3 - 48, Memory - 150

    Phenom:
    Core A spins: Mem -> L3 -> L2A -> L1A -> Core A -> L1A (total 150 cycles)
    Core A spins: L1A -> Core A -> L1A (3 cycles)
    Core B spins: L1A -> L2A -> L3 -> L2B -> L1B -> Core B -> L1B (48 cycles)
    Core B spins: L1B -> Core B -> L1B (3 cycles)
    Core A spins: L1B -> L2B -> L3 -> L2A -> L1A -> Core A -> L1A (48 cycles)

    C2Q:
    Core A spins: Mem -> L2A -> L1A -> Core A -> L1A (total 150 cycles)
    Core A spins: L1A -> Core A -> L1A (3 cycles)
    Core B spins: L1A -> L2A -> 2xFSB/Memory -> L2B -> L1B -> Core B -> L1B (150 cycles)
    Core B spins: L1B -> Core B -> L1B (3 cycles)
    Core A spins: L1B -> L2B -> 2xFSB/Memory -> L2A -> L1A -> Core A -> L1A (150 cycles)

    Imagine how much waiting happens when 2 cores spin on the same lock while it has been acquired by a 3rd thread. It cannot be cached for more than 1-2 instructions because both cores want exclusive access to it...
    Just count how many clocks are required when the lock is transitioned from one core to another... No shared cache becomes very big penalty. Add to that FSB penalty...

    Of course, good program will use as few as possible locks, but they are still required...
    A single threaded program doesn't have this problem - most of the frequently used data will move in L1 or L2 cache and stay there, giving C2D higher performance (because of clock frequency).

    PS: Hope it makes sense... sorry for being so long
    Last edited by Pla123; 05-13-2008 at 07:06 PM.
    AMD Phenom 9850 BE
    DFI LP UT 790FX-M2R
    Sapphire Toxic 3870
    2x2gb OCZ Reaper HPC DDR2-1066
    2xSamsung 320GB T-321 16MB cache (80% Raid 0/20% Raid 1)
    Xigmatec MC751 750W 80 Plus Modular
    Antec P182

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •