Results 1 to 6 of 6

Thread: Explanation of Lower Gflops with HT Enabled

  1. #1
    Xtreme Enthusiast
    Join Date
    Nov 2009
    Location
    Bloomfield Evergreen
    Posts
    607

    Explanation of Lower Gflops with HT Enabled

    Many of you must have noticed that disabling HT could result in more Gflops in LinX. Some of you have already known the reason. I myself have done some small research into this.

    I have wrote some simple MATLAB code to measure the Gflops of my 980X by timing large matrix multiplications. I have noticed that MATLAB only utilizes 6 logical cores out of the 12, delivering 40 Gflops from my code. The CPU load shown in Windows task manager is capped at 50%.

    Then I ran two instances of my MATLAB code concurrently. By doing so, all 12 logical cores are utilized and the CPU load shown in Windows task manager is 100%. However, each instance of my MATLAB code is only delivering 15 Gflops, which means all 12 logical cores are doing only 30 Gflops in total, which is actually slower than the 40 Gflops from 6 logical cores.

    Note that MATLAB is based on Intel Math Kernel Library. Intel claims that:

    Simultaneous MultiThreading (SMT) or Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread.
    Source: http://software.intel.com/en-us/arti...abled-systems/

    In other words, to get more Gflops from LinX, it is better to disable HT.

  2. #2
    Xtreme Member
    Join Date
    Mar 2008
    Location
    Sweden
    Posts
    365
    I have also wonder why HT on give lower GFlops compare to HT off, to lazy to research myself but thanks to you I got the answer, thanks, much appreciated

  3. #3
    Registered User
    Join Date
    May 2005
    Location
    shanghai, pr china
    Posts
    61
    Quote Originally Posted by sniper_sung View Post
    Many of you must have noticed that disabling HT could result in more Gflops in LinX. Some of you have already known the reason. I myself have done some small research into this.

    I have wrote some simple MATLAB code to measure the Gflops of my 980X by timing large matrix multiplications. I have noticed that MATLAB only utilizes 6 logical cores out of the 12, delivering 40 Gflops from my code. The CPU load shown in Windows task manager is capped at 50%.

    Then I ran two instances of my MATLAB code concurrently. By doing so, all 12 logical cores are utilized and the CPU load shown in Windows task manager is 100%. However, each instance of my MATLAB code is only delivering 15 Gflops, which means all 12 logical cores are doing only 30 Gflops in total, which is actually slower than the 40 Gflops from 6 logical cores.

    Note that MATLAB is based on Intel Math Kernel Library. Intel claims that:



    Source: http://software.intel.com/en-us/arti...abled-systems/

    In other words, to get more Gflops from LinX, it is better to disable HT.
    IMHO cache trashing is the major cause of your performance drop, because you are running two instances. Probably each instance is unware of another and tries to make use of the full cache.
    (Correct me if that is not the case, because I have never used MATLAB)

    It would be another story for a single instance linpack.
    i7 920/950/w3565 @ ek hf
    R3E @ bios 0602/0704
    bbse gs 2133ps/hyper pdp2kc8 stt2kc8 stt2kc7 stt2.2kc8
    zotec gtx480 @ stock @ pcie x8
    x-fi elite pro @ pci
    areca arc-1300-4e eSATAx4 controller @ pcie x1(x8)
    hd*N @ ahci
    corsair hx850w
    cosmos rc-1000 case
    windows xp pro 32-bit sp3/windows 7 ult 64-bit

  4. #4
    Xtreme Member
    Join Date
    Sep 2009
    Posts
    190
    This was discussed on TPU. http://forums.techpowerup.com/showth...=94721&page=12

    Your quote from Intel sums it up very well. It's probably easier to disable HT but IMO you should be able to run Linpack okay if you limit the number of software threads to the number of physical cores and then make sure each Linpack thread is set to run on different cores.

    Also IMO the windows taskmanager shows scheduling load for each thread rather than real cpu load. In your example 6 cores are being used ~100% of the time and the other 6 threads are idle ~0% so even though the real cpu load is probably ~100% taskmanager shows the average scheduling of 12 threads which is 50%.

    HTT is a way of utilizing unused execution units on each core. With one of the 2 threads on one core utilizing ~100% there is little left to be gained by the second thread of that core.

    I don't think you will find cache thrashing as the major cause although it can have an effect. Simply run 2 threads of Linpack on the same core with HT and then run 2 threads of Linpack on separate cores and you should see ~100% improvement in GFlops. Be careful with the way windows chooses how thread affinity is tied to real core/threads.

  5. #5
    Xtreme Cruncher
    Join Date
    Oct 2007
    Posts
    1,638
    The strangest LinX Gflops thing i've seen is when changing voltages affects Gflops. I also found that setting voltages on my classified with eleet gave different gflops vs setting the same voltages in bios.
    XTREMESupercomputer: Phase 2
    Live up to your name - November 1 - 8
    Crunch with us, the XS WCG team

  6. #6
    Xtreme Enthusiast
    Join Date
    Nov 2009
    Location
    Bloomfield Evergreen
    Posts
    607
    Quote Originally Posted by shadewither View Post
    IMHO cache trashing is the major cause of your performance drop, because you are running two instances. Probably each instance is unware of another and tries to make use of the full cache.
    (Correct me if that is not the case, because I have never used MATLAB)

    It would be another story for a single instance linpack.
    It's the same for a single instance of linpack. Each thread still needs to utilize the cache and they compete. The problem is HT is not for SIMD so one thread gets blocked while the other does computing.

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •