Page 27 of 33 FirstFirst ... 1724252627282930 ... LastLast
Results 651 to 675 of 815

Thread: New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

  1. #651
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    HTT is an interesting oddity. Sometimes it works really well. Otherwise, not so much. And there doesn't really seem to be any general trend one way or another.

    Nice system. I'm still hoping to go up to 32 cores and 128 GB of RAM by June, but I don't know about that as I'm also planning on finish off my master's. (I don't really NEED 32-cores right now, but it would be a fun (and expensive) toy to play with.)
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  2. #652
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    HTT is an interesting oddity. Sometimes it works really well. Otherwise, not so much. And there doesn't really seem to be any general trend one way or another.

    Nice system. I'm still hoping to go up to 32 cores and 128 GB of RAM by June, but I don't know about that as I'm also planning on finish off my master's. (I don't really NEED 32-cores right now, but it would be a fun (and expensive) toy to play with.)
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  3. #653
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    OMG... I just found a HUGE bug in the program... I feel like Intel now. lol
    It affects Log(2), Log(10), and the Euler-Mascheroni Constant for very large computations...

    It's probably been there since v0.4.1... One of the optimizations I did botched the Arc Hyperbolic Cotangent code and affects everything that depends on it.

    Hardly anyone (including me) runs these 3 constants, hence why the bug has been allowed to linger for sooooooo long.


    Damn... I'm gonna need to take my SB rig off WCG for a few days to fix it...



    EDIT:
    My Harpertown rig should finish verifying the fix in a couple of days... I'll probably release the patched version if it passes.
    The more comprehensive tests will be done simultaneously on my Harpertown rig and my SB rig. But since my SB rig doesn't have as much ram and disk bandwidth, I'm starting it now. My Harpertown rig will probably catch up in no time - hopefully they'll finish at the same time 4 - 5 days from now.
    Last edited by poke349; 02-18-2011 at 01:22 PM.
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  4. #654
    Xtreme Enthusiast
    Join Date
    Sep 2007
    Location
    Coimbra - Portugal
    Posts
    699
    Kicked a i980x a ss

    St0ned - i2600k @ 5.1 8GB ram 2133Mhz 10-12-10-28 1T

    Code:
    Validation Version:    1.1
    
    Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
                           Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )
    
    
    User:                  None Specified - You can edit this in "Username.txt".
    
    
    Processor(s):          Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
    Logical Cores:         8
    Physical Memory:       8,569,065,472 bytes  ( 8.00 GB )
    CPU Frequency:         3,400,077,024 Hz
    
    Program Version:       0.5.5 Build 9178 Alpha (x64 AVX - Windows ~ Hina)
    Constant:              Pi
    Algorithm:             Chudnovsky Formula
    Decimal Digits:        1,000,000,000
    Hexadecimal Digits:    Disabled
    Threading Mode:        8 threads
    Computation Mode:      Ram Only
    Swap Disks:            0
    Working Memory:        4.75 GB
    
    Start Date:            Fri Feb 18 22:06:04 2011
    End Date:              Fri Feb 18 22:11:04 2011
    
    Computation Time:      284.704 seconds
    Total Time:            300.343 seconds
    
    CPU Utilization:           742.08 %
    Multi-core Efficiency:     92.76 %
    
    Last Digits:
    6434543524 2766553567 4357021939 6394581990 5483278746  :  999,999,950
    7139868209 3196353628 2046127557 1517139511 5275045519  :  1,000,000,000
    
    Timer Sanity Check:        Passed
    Frequency Sanity Check:    Passed
    ECC Recovered Errors:      0
    Checkpoint From:           None
    
    ----
    
    Checksum:   890fdff76350813617e46f49a997a7ee37d4c700fd4c643b19a26237424fa2e7
    Pic:



    Ok, after the 1st nice result, I went for the killing, yes I was a bit greedy but hey! can you blame me ? Such a nice piece of silicon, I'm in love, volts for both runs were the same.
    Code:
    Validation Version:    1.1
    
    Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
                           Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )
    
    
    User:                  None Specified - You can edit this in "Username.txt".
    
    
    Processor(s):          Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
    Logical Cores:         8
    Physical Memory:       8,569,065,472 bytes  ( 8.00 GB )
    CPU Frequency:         3,400,035,360 Hz
    
    Program Version:       0.5.5 Build 9178 Alpha (x64 AVX - Windows ~ Hina)
    Constant:              Pi
    Algorithm:             Chudnovsky Formula
    Decimal Digits:        1,000,000,000
    Hexadecimal Digits:    Disabled
    Threading Mode:        8 threads
    Computation Mode:      Ram Only
    Swap Disks:            0
    Working Memory:        4.75 GB
    
    Start Date:            Fri Feb 18 22:20:59 2011
    End Date:              Fri Feb 18 22:25:53 2011
    
    Computation Time:      279.048 seconds
    Total Time:            294.340 seconds
    
    CPU Utilization:           740.41 %
    Multi-core Efficiency:     92.55 %
    
    Last Digits:
    6434543524 2766553567 4357021939 6394581990 5483278746  :  999,999,950
    7139868209 3196353628 2046127557 1517139511 5275045519  :  1,000,000,000
    
    Timer Sanity Check:        Passed
    Frequency Sanity Check:    Passed
    ECC Recovered Errors:      0
    Checkpoint From:           None
    
    ----
    
    Checksum:   c496b0ffc62ffcdf077409de852166e56c13134b07ed842710e229eb0c670dfc
    Pics!!:



    2% improve oh well could have been worst :P
    Last edited by st0ned; 02-18-2011 at 02:41 PM.

  5. #655
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by st0ned View Post
    Kicked a i980x a ss

    St0ned - i2600k @ 5.1 8GB ram 2133Mhz 10-12-10-28 1T


    Ok, after the 1st nice result, I went for the killing, yes I was a bit greedy but hey! can you blame me ? Such a nice piece of silicon, I'm in love, volts for both runs were the same.


    2% improve oh well could have been worst :P

    Nice!!! 5+ GHz

    Well... what did you expect? 5.1 GHz -> 5.2 GHz is less than 2%


    I'll update the charts later.
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  6. #656
    Xtreme Cruncher
    Join Date
    Jun 2005
    Location
    Northern VA
    Posts
    1,285
    hey poke, its been a few days, any updates ??????
    Its not overkill if it works.


  7. #657
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by skycrane View Post
    hey poke, its been a few days, any updates ??????
    Nothing big so far. Been using it to run little code snippets to see why certain things run so well (or poorly) on AMDs.

    Currently, there's one critical subroutine in y-cruncher, (and two more in development) that runs horrifically slow clock-for-clock on K10 compared to Core 2/ix. (this is all in cache so NUMA isn't a factor)
    Been trying to figure out the cause of this and improve it if possible.

    For example, I'm finding out that it's nearly impossible to sustain the theoretical 1 SSE add + 1 SSE multiply per cycle even when I write meaningless code designed specifically to sustain that throughput.
    This is something I can easily achieve in all my post-Core 2 machines but I'm having a hard time even reaching 70% of the theoretical throughput - perhaps I misread the AMD K10 docs or something.
    (And I'm getting consistent results across 3 different compilers - so it's not ICC being "unfair". lol)

    In the meantime, I've been doing some NUMA ray-tracing on it with 2 other people. So it's my "learning process" for NUMA.
    Ray-tracing is obviously easier to parallelize and less memory intensive than something like computing Pi...

    One thing is: NUMA programming is HARD - a lot harder and a lot trickier than shared-memory or MPI...
    At first I thought it was as simple as duplicating resources into all the nodes, but that doesn't really work when it won't fit on a node.
    Last edited by poke349; 03-05-2011 at 04:31 PM. Reason: typo
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  8. #658
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Looks like I finally broke the SATA on my SB rig...
    That was fast - 2 months only.

    One of the 4 HDs dropped out on me. Then it refuses to get past the splash screen on boot.
    None of the HDs show up in the BIOS. Once I unplug all of them, everything is fine...


    Still waiting for that newegg RMA email...
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  9. #659
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Pity that the i2600K is only 4 cores and that most of the motherboards doesn't support (I think) > 32 GB.
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  10. #660
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by alpha754293 View Post
    Pity that the i2600K is only 4 cores and that most of the motherboards doesn't support (I think) > 32 GB.
    I'd be all over a Sandy Bridge machine with more than 32GB of ram...

    if it was affordable that is...


    The dual-socket Sandies don't come out till later this year - but I don't plan on building another rig for a long time.

    A 2 x 8-core Sandy with 128GB or 256GB of ram would be a pretty amazing code-testing machine... (more so than my Harpertown rig back in 2009)
    But that would probably cost more than my car...
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  11. #661
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Been playing with Linux and NUMA for some time now on the rig Skycrane sent me.

    If anyone (with a NUMA machine) is interested, you might be able to get a speedup if you install the NUMA control package (numactl) and play with the settings.

    Try running y-cruncher like this:

    numactl --interleave=all "./x64 SSE3.out"

    or whatever version is compatible.
    I haven't checked if there is a Windows equivalent.

    It's doesn't "make" the program NUMA aware, but it does seem to distribute the memory traffic more evenly across the cores.
    y-cruncher has a problem of having a huge amount of memory/interconnect traffic concentrated on the "primary" NUMA node. Interleaving the memory seems to spread it out a bit.
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  12. #662
    Xtremely Retired OC'er
    Join Date
    Dec 2006
    Posts
    1,084
    Thats my upload, Validation - Pi - 500,000,000
    main computer not oced

  13. #663
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by poke349 View Post
    Been playing with Linux and NUMA for some time now on the rig Skycrane sent me.

    If anyone (with a NUMA machine) is interested, you might be able to get a speedup if you install the NUMA control package (numactl) and play with the settings.

    Try running y-cruncher like this:

    numactl --interleave=all "./x64 SSE3.out"

    or whatever version is compatible.
    I haven't checked if there is a Windows equivalent.

    It's doesn't "make" the program NUMA aware, but it does seem to distribute the memory traffic more evenly across the cores.
    y-cruncher has a problem of having a huge amount of memory/interconnect traffic concentrated on the "primary" NUMA node. Interleaving the memory seems to spread it out a bit.
    I don't think that there's numactl for cygwin. I tried looking for it and also trying to see if I compile it from source and get the following error:

    Code:
    $ make
    Makefile:179: .depend: No such file or directory
    cc -MM -DDEPS_RUN -I. bitops.c libnuma.c distance.c memhog.c numactl.c numademo.
    c numamon.c shm.c stream_lib.c stream_main.c syscall.c util.c mt.c clearcache.c
    test/*.c > .depend.X && mv .depend.X .depend
    cc -g -Wall -O2  -I.   -c -o numactl.o numactl.c
    cc -g -Wall -O2  -I.   -c -o util.o util.c
    util.c: In function `memsize':
    util.c:80: warning: array subscript has type `char'
    cc -g -Wall -O2  -I.   -c -o shm.o shm.c
    shm.c: In function `attach_shared':
    shm.c:129: error: storage size of `st' isn't known
    shm.c:140: warning: implicit declaration of function `fstat64'
    shm.c:143: warning: implicit declaration of function `ftruncate64'
    shm.c:157: warning: implicit declaration of function `mmap64'
    shm.c:157: warning: assignment makes pointer from integer without a cast
    shm.c:129: warning: unused variable `st'
    make: *** [shm.o] Error 1
    Source from http://oss.sgi.com/projects/libnuma/
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  14. #664
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by alpha754293 View Post
    I don't think that there's numactl for cygwin. I tried looking for it and also trying to see if I compile it from source and get the following error:

    Code:
    $ make
    Makefile:179: .depend: No such file or directory
    cc -MM -DDEPS_RUN -I. bitops.c libnuma.c distance.c memhog.c numactl.c numademo.
    c numamon.c shm.c stream_lib.c stream_main.c syscall.c util.c mt.c clearcache.c
    test/*.c > .depend.X && mv .depend.X .depend
    cc -g -Wall -O2  -I.   -c -o numactl.o numactl.c
    cc -g -Wall -O2  -I.   -c -o util.o util.c
    util.c: In function `memsize':
    util.c:80: warning: array subscript has type `char'
    cc -g -Wall -O2  -I.   -c -o shm.o shm.c
    shm.c: In function `attach_shared':
    shm.c:129: error: storage size of `st' isn't known
    shm.c:140: warning: implicit declaration of function `fstat64'
    shm.c:143: warning: implicit declaration of function `ftruncate64'
    shm.c:157: warning: implicit declaration of function `mmap64'
    shm.c:157: warning: assignment makes pointer from integer without a cast
    shm.c:129: warning: unused variable `st'
    make: *** [shm.o] Error 1
    Source from http://oss.sgi.com/projects/libnuma/
    I'm getting speedups of up to 30% using numactl interleave on Linux. The speedup depends on the size - for smaller sizes, it sometimes backfires.
    I'll accept any settings as valid benchmarks. Tuning the OS like this is part of the game, lol.

    As far as I can tell, only Windows Server has any sort of NUMA-awareness support. But you need Win Server anyway to get more than 2 sockets in the first place.
    There might be a setting somewhere in the OS that can be set to force interleaved memory allocation. If there isn't, maybe there's a special WinAPI malloc() function that does interleaved allocation. If I find it, I'll try it out.

    But in any case, interleaved memory allocation isn't a "solution" to NUMA.
    It lets the program get the full memory bandwidth, but it doesn't get rid of the interconnect contention and latency.

    The program will still need to be redesigned to run well on NUMA.


    EDIT:
    Got my B3 stepping today. I finally get my SATA ports back. (I still don't get how I managed to kill them in only 2 months...)
    Gonna have to re-test my OC tomorrow after the TIM sets. That thermal pad on the H50 was great, but it's only good for the first use, afterwards it's too uneven so I had to scrape it all off and use some of my leftovered Arctic Silver.
    Last edited by poke349; 03-29-2011 at 02:51 AM.
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  15. #665
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by poke349 View Post
    I'm getting speedups of up to 30% using numactl interleave on Linux. The speedup depends on the size - for smaller sizes, it sometimes backfires.
    I'll accept any settings as valid benchmarks. Tuning the OS like this is part of the game, lol.

    As far as I can tell, only Windows Server has any sort of NUMA-awareness support. But you need Win Server anyway to get more than 2 sockets in the first place.
    There might be a setting somewhere in the OS that can be set to force interleaved memory allocation. If there isn't, maybe there's a special WinAPI malloc() function that does interleaved allocation. If I find it, I'll try it out.

    But in any case, interleaved memory allocation isn't a "solution" to NUMA.
    It lets the program get the full memory bandwidth, but it doesn't get rid of the interconnect contention and latency.

    The program will still need to be redesigned to run well on NUMA.


    EDIT:
    Got my B3 stepping today. I finally get my SATA ports back. (I still don't get how I managed to kill them in only 2 months...)
    Gonna have to re-test my OC tomorrow after the TIM sets. That thermal pad on the H50 was great, but it's only good for the first use, afterwards it's too uneven so I had to scrape it all off and use some of my leftovered Arctic Silver.
    Okay - correct me if I'm wrong, but I thought that NUMA meant that CPU fastest when it works on data that's closest to the CPU?

    I know that I had to play around with NUMA settings for Folding@Home (at the behest of the folks at the forum over there), so that when I ran "coreinfo", it showed that each socket (on a quad-socket system) was on it's own NUMA node.

    And I suppose you're right too - < quad-sockets - probably doesn't matter as much (unless you're going in-between nodes).

    Hmmm....*confused*
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  16. #666
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by alpha754293 View Post
    Okay - correct me if I'm wrong, but I thought that NUMA meant that CPU fastest when it works on data that's closest to the CPU?

    I know that I had to play around with NUMA settings for Folding@Home (at the behest of the folks at the forum over there), so that when I ran "coreinfo", it showed that each socket (on a quad-socket system) was on it's own NUMA node.

    And I suppose you're right too - < quad-sockets - probably doesn't matter as much (unless you're going in-between nodes).

    Hmmm....*confused*
    You are correct, but there's an even bigger problem with the program.
    y-cruncher allocates all the memory it needs in a single gigantic malloc(). The OS will put that entire memory block on the node that called the malloc(). So the result is that most (if not all) of the memory will be concentrated in a single node.

    The 4 nodes will all try to access that same memory - all of which is in one node. So not only do you have contention on the interconnect, but the aggregate memory bandwidth is only equal to that of one socket. (since all the memory is concentrated there)
    If the memory is interleaved, you still don't get rid of the interconnect traffic, but at least now you have 4 sockets of memory bandwidth.

    This problem does not occur in "normal" applications that do lots of small allocations. For these normal apps, the OS will bias all the allocations on the nodes that called them - so memory accesses are mostly local.
    But since y-cruncher does one massive malloc() at the beginning, the OS is not able to do this.

    There were several major reasons why I made the program do a single malloc() and it's own memory management as opposed to letting the OS handle it.
    1. Memory allocation is expensive - (hundreds/thousands of cycles?)
    2. malloc() is not thread-safe. Placing locks around it is not scalable. Even if it is thread-safe (with the right compiler options), it's still implemented with locks.
    3. Memory allocation is not deterministic. The address that malloc() returns differs between program instances even with the exact same settings.
      The memory manager I put into the program is deterministic - even with multi-threading.
      It's a lot easier to hold together 100,000 lines of multi-threaded code if its deterministic.


    I made this decision at around October 2008 or so (quite a while ago) - but still before the first version of y-cruncher. It's worked fine since then.

    For NUMA, this approach obviously backfires like crazy. It's still possible to make it work without giving up determinism or scalability - but it's messy. It's something I'm probably gonna put off for a few major releases while I redesign/rewrite the program.
    There's about 80,000 lines of (nearly) completed code that haven't been enabled yet because the new internal interface is incompatible with what y-cruncher currently uses.
    It's gonna take a while to migrate the program to that new interface.
    The NUMA layer will eventually be built on top of all the existing "normal" threading code - though I have yet to work out the details.
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  17. #667
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by poke349 View Post
    You are correct, but there's an even bigger problem with the program.
    y-cruncher allocates all the memory it needs in a single gigantic malloc(). The OS will put that entire memory block on the node that called the malloc(). So the result is that most (if not all) of the memory will be concentrated in a single node.

    The 4 nodes will all try to access that same memory - all of which is in one node. So not only do you have contention on the interconnect, but the aggregate memory bandwidth is only equal to that of one socket. (since all the memory is concentrated there)
    If the memory is interleaved, you still don't get rid of the interconnect traffic, but at least now you have 4 sockets of memory bandwidth.

    This problem does not occur in "normal" applications that do lots of small allocations. For these normal apps, the OS will bias all the allocations on the nodes that called them - so memory accesses are mostly local.
    But since y-cruncher does one massive malloc() at the beginning, the OS is not able to do this.

    There were several major reasons why I made the program do a single malloc() and it's own memory management as opposed to letting the OS handle it.
    1. Memory allocation is expensive - (hundreds/thousands of cycles?)
    2. malloc() is not thread-safe. Placing locks around it is not scalable. Even if it is thread-safe (with the right compiler options), it's still implemented with locks.
    3. Memory allocation is not deterministic. The address that malloc() returns differs between program instances even with the exact same settings.
      The memory manager I put into the program is deterministic - even with multi-threading.
      It's a lot easier to hold together 100,000 lines of multi-threaded code if its deterministic.


    I made this decision at around October 2008 or so (quite a while ago) - but still before the first version of y-cruncher. It's worked fine since then.

    For NUMA, this approach obviously backfires like crazy. It's still possible to make it work without giving up determinism or scalability - but it's messy. It's something I'm probably gonna put off for a few major releases while I redesign/rewrite the program.
    There's about 80,000 lines of (nearly) completed code that haven't been enabled yet because the new internal interface is incompatible with what y-cruncher currently uses.
    It's gonna take a while to migrate the program to that new interface.
    The NUMA layer will eventually be built on top of all the existing "normal" threading code - though I have yet to work out the details.
    More stupid questions:

    will MPI or OMP help?

    How is it doing the multi-threading/multi-processing now anyways?

    Can threads do malloc()? (You might have actually said it already, but I'm a dummy non-programmer trying to wrap my head around programming concepts.)

    I've always been under the impression/belief that multi-processing can happen ONLY if it spawns slave processes. So the whole idea of multi-threading != multi-processing in my mind (or at least it doesn't seem to be as efficient or doesn't use the full resources available to it).
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  18. #668
    Xtreme Member
    Join Date
    Apr 2010
    Posts
    109
    Had some spare time to run this on one of my older rigs, just for fun

    It's a HP ML350G4 as shown in my sig.



    25M
    Code:
    Validation Version:    1.1
    
    Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
                           Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )
    
    
    User:                  None Specified - You can edit this in "Username.txt".
    
    
    Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
    Logical Cores:         4
    Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
    CPU Frequency:         3,000,194,895 Hz
    
    Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
    Constant:              Pi
    Algorithm:             Chudnovsky Formula
    Decimal Digits:        25,000,000
    Hexadecimal Digits:    Disabled
    Threading Mode:        4 threads
    Computation Mode:      Ram Only
    Swap Disks:            0
    Working Memory:        207 MB
    
    Start Date:            Wed Mar 30 14:03:49 2011
    End Date:              Wed Mar 30 14:04:32 2011
    
    Computation Time:      40.462 seconds
    Total Time:            43.022 seconds
    
    CPU Utilization:           373.11 %
    Multi-core Efficiency:     93.27 %
    
    Last Digits:
    3803750790 9491563108 2381689226 7224175329 0045253446  :  24,999,950
    0786411592 4597806944 2455112852 2554677483 6191884322  :  25,000,000
    
    Timer Sanity Check:        Passed
    Frequency Sanity Check:    Passed
    ECC Recovered Errors:      0
    Checkpoint From:           None
    
    ----
    
    Checksum:   67dec58e652f6b1d0fea1feafad483d2799e90c7c5db8b200c858b7f86d6f727
    50M
    Code:
    Validation Version:    1.1
    
    Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
                           Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )
    
    
    User:                  None Specified - You can edit this in "Username.txt".
    
    
    Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
    Logical Cores:         4
    Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
    CPU Frequency:         3,000,202,463 Hz
    
    Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
    Constant:              Pi
    Algorithm:             Chudnovsky Formula
    Decimal Digits:        50,000,000
    Hexadecimal Digits:    Disabled
    Threading Mode:        4 threads
    Computation Mode:      Ram Only
    Swap Disks:            0
    Working Memory:        316 MB
    
    Start Date:            Wed Mar 30 14:06:08 2011
    End Date:              Wed Mar 30 14:07:43 2011
    
    Computation Time:      89.224 seconds
    Total Time:            94.464 seconds
    
    CPU Utilization:           380.51 %
    Multi-core Efficiency:     95.12 %
    
    Last Digits:
    4127897300 0153683630 8346732220 0943329365 1632962502  :  49,999,950
    5130045796 0464561703 2424263071 4554183801 7945652654  :  50,000,000
    
    Timer Sanity Check:        Passed
    Frequency Sanity Check:    Passed
    ECC Recovered Errors:      0
    Checkpoint From:           None
    
    ----
    
    Checksum:   1db010eae8d71aad16d348835969f377f22f7419e93594ee04695171dcf2eb5e
    100M
    Code:
    Validation Version:    1.1
    
    Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
                           Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )
    
    
    User:                  None Specified - You can edit this in "Username.txt".
    
    
    Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
    Logical Cores:         4
    Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
    CPU Frequency:         3,000,202,223 Hz
    
    Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
    Constant:              Pi
    Algorithm:             Chudnovsky Formula
    Decimal Digits:        100,000,000
    Hexadecimal Digits:    Disabled
    Threading Mode:        4 threads
    Computation Mode:      Ram Only
    Swap Disks:            0
    Working Memory:        536 MB
    
    Start Date:            Wed Mar 30 14:08:34 2011
    End Date:              Wed Mar 30 14:12:04 2011
    
    Computation Time:      200.057 seconds
    Total Time:            210.020 seconds
    
    CPU Utilization:           386.74 %
    Multi-core Efficiency:     96.68 %
    
    Last Digits:
    9948682556 3967530560 3352869667 7734610718 4471868529  :  99,999,950
    7572203175 2074898161 1683139375 1497058112 0187751592  :  100,000,000
    
    Timer Sanity Check:        Passed
    Frequency Sanity Check:    Passed
    ECC Recovered Errors:      0
    Checkpoint From:           None
    
    ----
    
    Checksum:   8f72a3e29cc2c459101624ad135c899f8894f4d30000bad9123c253fa1ba71ee
    250M
    Code:
    Validation Version:    1.1
    
    Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
                           Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )
    
    
    User:                  None Specified - You can edit this in "Username.txt".
    
    
    Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
    Logical Cores:         4
    Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
    CPU Frequency:         3,000,188,655 Hz
    
    Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
    Constant:              Pi
    Algorithm:             Chudnovsky Formula
    Decimal Digits:        250,000,000
    Hexadecimal Digits:    Disabled
    Threading Mode:        4 threads
    Computation Mode:      Ram Only
    Swap Disks:            0
    Working Memory:        1.25 GB
    
    Start Date:            Wed Mar 30 14:13:07 2011
    End Date:              Wed Mar 30 14:23:09 2011
    
    Computation Time:      576.015 seconds
    Total Time:            602.238 seconds
    
    CPU Utilization:           390.76 %
    Multi-core Efficiency:     97.69 %
    
    Last Digits:
    3673748634 2742427296 0219667627 3141599893 4569474921  :  249,999,950
    9958866734 1705167068 8515785208 0067520395 3452027780  :  250,000,000
    
    Timer Sanity Check:        Passed
    Frequency Sanity Check:    Passed
    ECC Recovered Errors:      0
    Checkpoint From:           None
    
    ----
    
    Checksum:   2d5e582a64c81add039e51df177d4a69a1f84aa395364b6915776ddfb6e1841f
    500M
    Code:
    Validation Version:    1.1
    
    Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
                           Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )
    
    
    User:                  None Specified - You can edit this in "Username.txt".
    
    
    Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
    Logical Cores:         4
    Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
    CPU Frequency:         3,000,192,863 Hz
    
    Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
    Constant:              Pi
    Algorithm:             Chudnovsky Formula
    Decimal Digits:        500,000,000
    Hexadecimal Digits:    Disabled
    Threading Mode:        4 threads
    Computation Mode:      Ram Only
    Swap Disks:            0
    Working Memory:        2.42 GB
    
    Start Date:            Wed Mar 30 14:25:08 2011
    End Date:              Wed Mar 30 14:47:20 2011
    
    Computation Time:      1,275.159 seconds
    Total Time:            1,331.560 seconds
    
    CPU Utilization:           392.47 %
    Multi-core Efficiency:     98.11 %
    
    Last Digits:
    3896531789 0364496761 5664275325 5483742003 7847987772  :  499,999,950
    5002477883 0364214864 5906800532 7052368734 3293261427  :  500,000,000
    
    Timer Sanity Check:        Passed
    Frequency Sanity Check:    Passed
    ECC Recovered Errors:      0
    Checkpoint From:           None
    
    ----
    
    Checksum:   10ef6c30486f7b58013d61d24782c8cda6cc98b3c21e5dbc19074b403e5e0b98
    I also ran a few on my DL380 but it's powered down ATM so I can't get the results I think I did up to 1B there with 3 disks as swap.
    I'll take some time once and do some more on the ML350 using swap
    For the DL320 I need to take the RAM out of the 380 since it only has 1GB and 2 SATA 80GB disks which are in RAID 1 so those won't help...much
    ________________
    Main:
    Phenom II x6 1090T BE|Crosshair IV Formula|Corsair 4x2GB DDR3|Sapphire HD5870|Adaptec 2405 + Hitachi Ultrastar 15k 450GB SAS, Toshiba MBD2147RC 146GB 10k SAS, Samsung F3 1TB, Seagate Barracuda Green 2TB 5900RPM, WD Black 2TB, Seagate Barracuda ST2000M001 2TB|Asus Xonar Essence ST + HD600|Corsair HX850|HPZR24w|Fractal Define XL Black|Windows 7 X64 Pro
    Backup/Storage server:
    HP Proliant ML350 G4|2 x Xeon "Nocona" 3GHz|4GB DDR1 ECC|Storage (SCSI): 3x10k 72GB + 10k 300GB + 15k 300GB + Ultrium460 tape drive|Storage (SATA): Adaptec 2810SA + 2 x WD Caviar 250GB RAID0 + Seagate 250GB|Windows Server 2008r2 Datacenter
    Other:
    HP Proliant DL380 G5|Xeon 5150|4GB FB DDR2 ECC|HP Smart Array P400-256MB cache|3x10k 146GB SAS in RAID 0 + 10k 146GB SAS|2x800W|ATi FireGL V7700|Samsung 226BW|Windows Server 2008r2 Enterprise
    HP DL320 G5|Xeon 3150 2.13GHz|1GB DDR2 ECC|2x80GB RAID 0|Windows Server 2008r2 Standard
    Laptop:
    HP 8560w|i5-2540M|2x4GB DDR3|AMD FirePro M5950|Samsung 840 Pro 256GB|Windows 7 X64 Pro

  19. #669
    Xtreme Member
    Join Date
    Mar 2010
    Location
    Germany
    Posts
    118
    I saw that nobody had yet uploaded a result of a AMD X6 so I searched my old screens and found one, maybe it helps you.

    http://www.abload.de/img/ycrunchercuti7d1.png

    Soya

  20. #670
    Xtreme Cruncher
    Join Date
    Apr 2005
    Location
    TX, USA
    Posts
    898
    Quote Originally Posted by poke349 View Post
    I'm getting speedups of up to 30% using numactl interleave on Linux. The speedup depends on the size - for smaller sizes, it sometimes backfires.
    I'll accept any settings as valid benchmarks. Tuning the OS like this is part of the game, lol.

    As far as I can tell, only Windows Server has any sort of NUMA-awareness support. But you need Win Server anyway to get more than 2 sockets in the first place.
    There might be a setting somewhere in the OS that can be set to force interleaved memory allocation. If there isn't, maybe there's a special WinAPI malloc() function that does interleaved allocation. If I find it, I'll try it out.

    But in any case, interleaved memory allocation isn't a "solution" to NUMA.
    It lets the program get the full memory bandwidth, but it doesn't get rid of the interconnect contention and latency.

    The program will still need to be redesigned to run well on NUMA.
    I've been meaning to post in here for awhile, as I've got access to an 8-socket Barcelona machine (w/ ~80GB ram) here in our research group at school that I've tried running things on. The NUMA nature of the machine definitely causes sub-optimal scaling for the smaller workloads (those below 1B).

    I tried numactl for 1B and performance goes from 648sec (without) to 466sec (interleave all), so it does make a noticeable difference. Almost a 1.4X speedup in this case.

    For comparison (without numactl), 10B takes 5,379sec and 100M takes 110sec.

    Quote Originally Posted by poke349 View Post
    You are correct, but there's an even bigger problem with the program.
    y-cruncher allocates all the memory it needs in a single gigantic malloc(). The OS will put that entire memory block on the node that called the malloc(). So the result is that most (if not all) of the memory will be concentrated in a single node.

    The 4 nodes will all try to access that same memory - all of which is in one node. So not only do you have contention on the interconnect, but the aggregate memory bandwidth is only equal to that of one socket. (since all the memory is concentrated there)
    If the memory is interleaved, you still don't get rid of the interconnect traffic, but at least now you have 4 sockets of memory bandwidth.

    This problem does not occur in "normal" applications that do lots of small allocations. For these normal apps, the OS will bias all the allocations on the nodes that called them - so memory accesses are mostly local.
    But since y-cruncher does one massive malloc() at the beginning, the OS is not able to do this.

    There were several major reasons why I made the program do a single malloc() and it's own memory management as opposed to letting the OS handle it.
    1. Memory allocation is expensive - (hundreds/thousands of cycles?)
    2. malloc() is not thread-safe. Placing locks around it is not scalable. Even if it is thread-safe (with the right compiler options), it's still implemented with locks.
    3. Memory allocation is not deterministic. The address that malloc() returns differs between program instances even with the exact same settings.
      The memory manager I put into the program is deterministic - even with multi-threading.
      It's a lot easier to hold together 100,000 lines of multi-threaded code if its deterministic.
    Yeah, tying all of the memory allocation to one thread/node doesn't do so well with NUMA I imagine. Beyond just the interconnect bandwidth impact, coherency traffic must become a significant overhead as well.

    Determinism definitely is desirable with parallel programs, though in what capacity are you using the deterministic malloc()? Is it mainly for debug, or structured memory/pointer accesses? As for thread-safe, are you saying just using barriers to properly synchronize isn't good enough?

    At the expense of memory capacity trade-off, could you have each thread malloc() a (relatively) small buffer to work out of? In the interest of creating some locality and doing burst transfers.
    Quote Originally Posted by alpha754293 View Post
    More stupid questions:

    will MPI or OMP help?

    How is it doing the multi-threading/multi-processing now anyways?

    Can threads do malloc()? (You might have actually said it already, but I'm a dummy non-programmer trying to wrap my head around programming concepts.)

    I've always been under the impression/belief that multi-processing can happen ONLY if it spawns slave processes. So the whole idea of multi-threading != multi-processing in my mind (or at least it doesn't seem to be as efficient or doesn't use the full resources available to it).
    I'm curious of the current implementation as well

    Is it all p-thread based? MPI would definitely be the way to go for ensuring proper locality on a NUMA system, at the expense of all the icky manual data sharing. I've only done some basic use of OpenMP, so I don't know too much of the advanced details like optimizations for NUMA systems, but is it really applicable to your current algorithm?

    I remember reading sometime ago about your implementation spawning a large amount of threads, do you limit the number of work threads to the available hardware threads, or go beyond? Having excessive threads would likely thrash around too much with larger NUMA systems and be counter productive.

    How independent are the threads? Is there much synchronization/sharing of data between them, or does each basically work on its own independent chunk of the series and combine in a tree-reduce fashion?
    Last edited by rcofell; 03-30-2011 at 08:36 AM.



  21. #671
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by alpha754293 View Post
    More stupid questions:

    will MPI or OMP help?

    How is it doing the multi-threading/multi-processing now anyways?

    Can threads do malloc()? (You might have actually said it already, but I'm a dummy non-programmer trying to wrap my head around programming concepts.)

    I've always been under the impression/belief that multi-processing can happen ONLY if it spawns slave processes. So the whole idea of multi-threading != multi-processing in my mind (or at least it doesn't seem to be as efficient or doesn't use the full resources available to it).
    Yes, threads can malloc(), but malloc() isn't thread safe. Some compilers can make it thread-safe with the right options, but that's done using locks - which limits scalability. I'll get to that later.

    Quote Originally Posted by rcofell View Post
    I've been meaning to post in here for awhile, as I've got access to an 8-socket Barcelona machine (w/ ~80GB ram) here in our research group at school that I've tried running things on. The NUMA nature of the machine definitely causes sub-optimal scaling for the smaller workloads (those below 1B).

    I tried numactl for 1B and performance goes from 648sec (without) to 466sec (interleave all), so it does make a noticeable difference. Almost a 1.4X speedup in this case.

    For comparison (without numactl), 10B takes 5,379sec and 100M takes 110sec.


    Yeah, tying all of the memory allocation to one thread/node doesn't do so well with NUMA I imagine. Beyond just the interconnect bandwidth impact, coherency traffic must become a significant overhead as well.

    Determinism definitely is desirable with parallel programs, though in what capacity are you using the deterministic malloc()? Is it mainly for debug, or structured memory/pointer accesses? As for thread-safe, are you saying just using barriers to properly synchronize isn't good enough?

    At the expense of memory capacity trade-off, could you have each thread malloc() a (relatively) small buffer to work out of? In the interest of creating some locality and doing burst transfers.

    I'm curious of the current implementation as well

    Is it all p-thread based? MPI would definitely be the way to go for ensuring proper locality on a NUMA system, at the expense of all the icky manual data sharing. I've only done some basic use of OpenMP, so I don't know too much of the advanced details like optimizations for NUMA systems, but is it really applicable to your current algorithm?

    I remember reading sometime ago about your implementation spawning a large amount of threads, do you limit the number of work threads to the available hardware threads, or go beyond? Having excessive threads would likely thrash around too much with larger NUMA systems and be counter productive.

    How independent are the threads? Is there much synchronization/sharing of data between them, or does each basically work on its own independent chunk of the series and combine in a tree-reduce fashion?

    For computation threads, threading is done using recursive thread-forking.
    For I/O threads, they are done using iterative thread-spawning.

    I'll only focus on the computation threads.

    Basically, there's a ton of binary divide-and-conquer crap in the program. Since each of these divide-and-conquer algorithms typically involve some sort of data merging/splitting at each stage, you can't easily split them into N threads at the start. So instead, I integrate the recursive thread-splitting into these binary divide-and-conquer algorithms.
    At each stage, I split the current thread into two. In Windows, this is done by spawning another thread. In Linux, it's done using recursive OpenMP.
    There's no pthread version yet, but it would basically be a copy/paste of the Windows code with some renaming. So far, OpenMP works just fine, so there's no need for pthreads yet.
    By adjusting the depth of the thread-splitting, I can change the # of threads. This is the primary reason why y-cruncher is locked to power-of-two threads.

    Some of the new code that I'm working on right now is able to do non-power-of-two threads using the same recursive splitting method. So in the future, I might be able to lift the power-of-two restriction on the program.

    If the routine needs memory, then the memory heap is split in half (or some pre-determined ratio). This is also done recursively along with the thread-splitting.
    All memory allocations that a thread does are on it's own part of the heap. They never touch another thread's memory - hence why its deterministic. No locks, no synchronization, no overhead.
    The only non-deterministic bugs are the ones where one thread accidentally touches another thread's memory. But these are rare unless an out-of-bound array access happens to cross a thread boundary in the heap.
    (In fact, I have yet to encounter a bug like this since I started this project)

    There's a lot of fancy math needed to prove that a routine will never use more than XX memory - especially when you factor in memory fragmentation.
    Another fancy protocol is needed to join threads when a large value needs to be returned via the heap.

    So the memory allocation is deterministic, and completely free of locks/synchronization. A memory allocation is typically 20-100 cycles depending on how much stuff is on the heap.

    Locking/synchronization isn't scalable, so there's none of it in the program. Thread-joining is the only form of "synchronization" that's used.

    OS-managed memory allocation isn't scalable for this reason. All that page committing stuff needs to access global data - using locks...
    To top it off the OS needs to zero newly allocated memory before the program can use it. (This is for security reasons: so you can't like steal a password by logging into to something, closing the program and allocating that physical memory into another program.)


    As for pre-allocating a buffer for each thread to get better locality in NUMA - that's essentially the same as MPI as you have to explicitly move data in and out of that buffer.

    At any given point in the program, all the threads are entirely independent.
    When all threads complete their given tasks, they are assigned new tasks on different parts of memory. So no single part of memory is accessed by only one thread.

    Implementing this in MPI will require inter-node matrix transposes - which is a huge research topic in my field.
    In shared memory, these transposes can be eliminated by directly accessing the memory that's needed - this is what y-cruncher does right now.


    In any case, an MPI implementation will pretty much require a rewrite of all the high-level code. I don't have immediate plans for this. And when I do, it'll be built on top of the NUMA/shared-memory layer.


    EDIT:
    Yes, it's a tree-reduction.

    EDIT 2:
    The program does in fact use more threads than hardware threads. This is to reduce the impact of load imbalance.
    For systems with a non-power-of-two hardware threads, it needs to be rounded up to the next power-of-two anyways.
    Last edited by poke349; 03-30-2011 at 10:45 AM. Reason: typo
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  22. #672
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Very interesting. I gotta say, I'm learning a LOT from this thread, although I also honestly, still don't understand alot of it as well only because I'm not a programmer. BUT...I THINK that I sorta get the jist of it though (and pity that you're so far otherwise I think that I can really learn programming from you), since you're one of the few people that I've EVER met who can explain it and explain it well!

    Here's something interesting that I've noticed -- on the 48-core system that I've got at work - it's four sockets. I THINK that I have it set up so that there are four NUMA nodes.

    And when I run y-cruncher on it, it doesn't really seem to quite take full advantage of the hardware and that's evident in the CPU utilization reported by Windows Task Manager and also by the program itself.

    Is that (^ the above aforementioned ^) why it's like that?

    I'm only used to commercially available software that's MPI and/or OpenMP and maybe it's cuz it's commercial, so it has very high CPU utilization. ?
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  23. #673
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by alpha754293 View Post
    Very interesting. I gotta say, I'm learning a LOT from this thread, although I also honestly, still don't understand alot of it as well only because I'm not a programmer. BUT...I THINK that I sorta get the jist of it though (and pity that you're so far otherwise I think that I can really learn programming from you), since you're one of the few people that I've EVER met who can explain it and explain it well!

    Here's something interesting that I've noticed -- on the 48-core system that I've got at work - it's four sockets. I THINK that I have it set up so that there are four NUMA nodes.

    And when I run y-cruncher on it, it doesn't really seem to quite take full advantage of the hardware and that's evident in the CPU utilization reported by Windows Task Manager and also by the program itself.

    Is that (^ the above aforementioned ^) why it's like that?

    I'm only used to commercially available software that's MPI and/or OpenMP and maybe it's cuz it's commercial, so it has very high CPU utilization. ?
    lol... I thought I was always bad at explaining stuff.
    Also, I don't always know what I'm talking about. So take my posts with a grain of salt.

    What you're seeing is load imbalance due to NUMA. What's happening is that some nodes run faster than others depending on where the data is. So you'll find one or two nodes lagging far behind the others - and when the threads on the faster nodes finish first, the slower threads will hang around a lot longer afterwards. So you see low CPU usage.

    I was confused as well when I first noticed this on the 4 x 4 that Skycrane sent.
    From the beginning, I had already suspected this was the cause. But I couldn't confirm it until I wrote some mini-benchmarks to specifically test for this.

    There may be more reasons to it, but so far that's the only explanation I have.
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  24. #674
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by poke349 View Post
    lol... I thought I was always bad at explaining stuff.
    Also, I don't always know what I'm talking about. So take my posts with a grain of salt.

    What you're seeing is load imbalance due to NUMA. What's happening is that some nodes run faster than others depending on where the data is. So you'll find one or two nodes lagging far behind the others - and when the threads on the faster nodes finish first, the slower threads will hang around a lot longer afterwards. So you see low CPU usage.

    I was confused as well when I first noticed this on the 4 x 4 that Skycrane sent.
    From the beginning, I had already suspected this was the cause. But I couldn't confirm it until I wrote some mini-benchmarks to specifically test for this.

    There may be more reasons to it, but so far that's the only explanation I have.
    Well, if you need another quad-socket system to test stuff with, my quad dual-core (8-cores total) at home that's available. It's not the fastest system by any stretch of the imagination now, and I'll still have access to my 48-core system at work for about another week (I'm switching jobs).

    And you're actually pretty good at explaining stuff. Better than most profs and even some programmers that I've met before. I'm pretty sure that you have a pretty good idea what you're talking about because otherwise, you probably wouldn't be writing the programs and doing what it is that you're doing. (Last I recall, you're doing your grad school...so I'd presume that you gotta know SOMETHING.)
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  25. #675
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by alpha754293 View Post
    Well, if you need another quad-socket system to test stuff with, my quad dual-core (8-cores total) at home that's available. It's not the fastest system by any stretch of the imagination now, and I'll still have access to my 48-core system at work for about another week (I'm switching jobs).

    And you're actually pretty good at explaining stuff. Better than most profs and even some programmers that I've met before. I'm pretty sure that you have a pretty good idea what you're talking about because otherwise, you probably wouldn't be writing the programs and doing what it is that you're doing. (Last I recall, you're doing your grad school...so I'd presume that you gotta know SOMETHING.)
    Well, I may be in grad school, but I'm still just a first year... There's still a lot I don't know - especially supercomputer related things.
    The only difference between me and other grad students is that I have a bunch of machines to toy with. Everyone else needs to share the school's resources.
    And there's a lot you can't do when you don't have root or physical access to a machine you're trying to study.

    Stuff like playing with the CPU clock, messing with memory timings, enabling/disabling cores... are some of the unconventional things I do to see what's going on in a program.

    If you try that on a school machine, you'll get expelled.


    And I have some good news for the AMD fans...

    Visual Studio 2010 SP1 fixes the bug that was miscompiling AVX.

    It looks like I have a Windows binary that will run on Bulldozer.
    Interestingly, Visual Studio 2010 SP1 compiles a faster AVX binary than the Intel Compiler (about 1% faster)...

    Also, VS 2010 SP1 has support for FMA4 and XOP instructions. But since I don't plan on building another rig for a while, I won't be able to add support for FMA4 and XOP anytime soon.
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

Page 27 of 33 FirstFirst ... 1724252627282930 ... LastLast

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •