# Thread: New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

1. HTT is an interesting oddity. Sometimes it works really well. Otherwise, not so much. And there doesn't really seem to be any general trend one way or another.

Nice system. I'm still hoping to go up to 32 cores and 128 GB of RAM by June, but I don't know about that as I'm also planning on finish off my master's. (I don't really NEED 32-cores right now, but it would be a fun (and expensive) toy to play with.)

2. HTT is an interesting oddity. Sometimes it works really well. Otherwise, not so much. And there doesn't really seem to be any general trend one way or another.

Nice system. I'm still hoping to go up to 32 cores and 128 GB of RAM by June, but I don't know about that as I'm also planning on finish off my master's. (I don't really NEED 32-cores right now, but it would be a fun (and expensive) toy to play with.)

3. OMG... I just found a HUGE bug in the program... I feel like Intel now. lol
It affects Log(2), Log(10), and the Euler-Mascheroni Constant for very large computations...

It's probably been there since v0.4.1... One of the optimizations I did botched the Arc Hyperbolic Cotangent code and affects everything that depends on it.

Hardly anyone (including me) runs these 3 constants, hence why the bug has been allowed to linger for sooooooo long.

Damn... I'm gonna need to take my SB rig off WCG for a few days to fix it...

EDIT:
My Harpertown rig should finish verifying the fix in a couple of days... I'll probably release the patched version if it passes.
The more comprehensive tests will be done simultaneously on my Harpertown rig and my SB rig. But since my SB rig doesn't have as much ram and disk bandwidth, I'm starting it now. My Harpertown rig will probably catch up in no time - hopefully they'll finish at the same time 4 - 5 days from now.

4. Kicked a i980x a ss

St0ned - i2600k @ 5.1 8GB ram 2133Mhz 10-12-10-28 1T

Code:
Validation Version:    1.1

Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )

User:                  None Specified - You can edit this in "Username.txt".

Processor(s):          Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Logical Cores:         8
Physical Memory:       8,569,065,472 bytes  ( 8.00 GB )
CPU Frequency:         3,400,077,024 Hz

Program Version:       0.5.5 Build 9178 Alpha (x64 AVX - Windows ~ Hina)
Constant:              Pi
Algorithm:             Chudnovsky Formula
Decimal Digits:        1,000,000,000
Computation Mode:      Ram Only
Swap Disks:            0
Working Memory:        4.75 GB

Start Date:            Fri Feb 18 22:06:04 2011
End Date:              Fri Feb 18 22:11:04 2011

Computation Time:      284.704 seconds
Total Time:            300.343 seconds

CPU Utilization:           742.08 %
Multi-core Efficiency:     92.76 %

Last Digits:
6434543524 2766553567 4357021939 6394581990 5483278746  :  999,999,950
7139868209 3196353628 2046127557 1517139511 5275045519  :  1,000,000,000

Timer Sanity Check:        Passed
Frequency Sanity Check:    Passed
ECC Recovered Errors:      0
Checkpoint From:           None

----

Checksum:   890fdff76350813617e46f49a997a7ee37d4c700fd4c643b19a26237424fa2e7
Pic:

Ok, after the 1st nice result, I went for the killing, yes I was a bit greedy but hey! can you blame me ? Such a nice piece of silicon, I'm in love, volts for both runs were the same.
Code:
Validation Version:    1.1

Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )

User:                  None Specified - You can edit this in "Username.txt".

Processor(s):          Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Logical Cores:         8
Physical Memory:       8,569,065,472 bytes  ( 8.00 GB )
CPU Frequency:         3,400,035,360 Hz

Program Version:       0.5.5 Build 9178 Alpha (x64 AVX - Windows ~ Hina)
Constant:              Pi
Algorithm:             Chudnovsky Formula
Decimal Digits:        1,000,000,000
Computation Mode:      Ram Only
Swap Disks:            0
Working Memory:        4.75 GB

Start Date:            Fri Feb 18 22:20:59 2011
End Date:              Fri Feb 18 22:25:53 2011

Computation Time:      279.048 seconds
Total Time:            294.340 seconds

CPU Utilization:           740.41 %
Multi-core Efficiency:     92.55 %

Last Digits:
6434543524 2766553567 4357021939 6394581990 5483278746  :  999,999,950
7139868209 3196353628 2046127557 1517139511 5275045519  :  1,000,000,000

Timer Sanity Check:        Passed
Frequency Sanity Check:    Passed
ECC Recovered Errors:      0
Checkpoint From:           None

----

Checksum:   c496b0ffc62ffcdf077409de852166e56c13134b07ed842710e229eb0c670dfc
Pics!!:

2% improve oh well could have been worst :P

5. Originally Posted by st0ned
Kicked a i980x a ss

St0ned - i2600k @ 5.1 8GB ram 2133Mhz 10-12-10-28 1T

Ok, after the 1st nice result, I went for the killing, yes I was a bit greedy but hey! can you blame me ? Such a nice piece of silicon, I'm in love, volts for both runs were the same.

2% improve oh well could have been worst :P

Nice!!! 5+ GHz

Well... what did you expect? 5.1 GHz -> 5.2 GHz is less than 2%

I'll update the charts later.

6. hey poke, its been a few days, any updates ??????

7. Originally Posted by skycrane
hey poke, its been a few days, any updates ??????
Nothing big so far. Been using it to run little code snippets to see why certain things run so well (or poorly) on AMDs.

Currently, there's one critical subroutine in y-cruncher, (and two more in development) that runs horrifically slow clock-for-clock on K10 compared to Core 2/ix. (this is all in cache so NUMA isn't a factor)
Been trying to figure out the cause of this and improve it if possible.

For example, I'm finding out that it's nearly impossible to sustain the theoretical 1 SSE add + 1 SSE multiply per cycle even when I write meaningless code designed specifically to sustain that throughput.
This is something I can easily achieve in all my post-Core 2 machines but I'm having a hard time even reaching 70% of the theoretical throughput - perhaps I misread the AMD K10 docs or something.
(And I'm getting consistent results across 3 different compilers - so it's not ICC being "unfair". lol)

In the meantime, I've been doing some NUMA ray-tracing on it with 2 other people. So it's my "learning process" for NUMA.
Ray-tracing is obviously easier to parallelize and less memory intensive than something like computing Pi...

One thing is: NUMA programming is HARD - a lot harder and a lot trickier than shared-memory or MPI...
At first I thought it was as simple as duplicating resources into all the nodes, but that doesn't really work when it won't fit on a node.

8. Looks like I finally broke the SATA on my SB rig...
That was fast - 2 months only.

One of the 4 HDs dropped out on me. Then it refuses to get past the splash screen on boot.
None of the HDs show up in the BIOS. Once I unplug all of them, everything is fine...

Still waiting for that newegg RMA email...

9. Pity that the i2600K is only 4 cores and that most of the motherboards doesn't support (I think) > 32 GB.

10. Originally Posted by alpha754293
Pity that the i2600K is only 4 cores and that most of the motherboards doesn't support (I think) > 32 GB.
I'd be all over a Sandy Bridge machine with more than 32GB of ram...

if it was affordable that is...

The dual-socket Sandies don't come out till later this year - but I don't plan on building another rig for a long time.

A 2 x 8-core Sandy with 128GB or 256GB of ram would be a pretty amazing code-testing machine... (more so than my Harpertown rig back in 2009)
But that would probably cost more than my car...

11. Been playing with Linux and NUMA for some time now on the rig Skycrane sent me.

If anyone (with a NUMA machine) is interested, you might be able to get a speedup if you install the NUMA control package (numactl) and play with the settings.

Try running y-cruncher like this:

numactl --interleave=all "./x64 SSE3.out"

or whatever version is compatible.
I haven't checked if there is a Windows equivalent.

It's doesn't "make" the program NUMA aware, but it does seem to distribute the memory traffic more evenly across the cores.
y-cruncher has a problem of having a huge amount of memory/interconnect traffic concentrated on the "primary" NUMA node. Interleaving the memory seems to spread it out a bit.

12. Thats my upload, Validation - Pi - 500,000,000
main computer not oced

13. Originally Posted by poke349
Been playing with Linux and NUMA for some time now on the rig Skycrane sent me.

If anyone (with a NUMA machine) is interested, you might be able to get a speedup if you install the NUMA control package (numactl) and play with the settings.

Try running y-cruncher like this:

numactl --interleave=all "./x64 SSE3.out"

or whatever version is compatible.
I haven't checked if there is a Windows equivalent.

It's doesn't "make" the program NUMA aware, but it does seem to distribute the memory traffic more evenly across the cores.
y-cruncher has a problem of having a huge amount of memory/interconnect traffic concentrated on the "primary" NUMA node. Interleaving the memory seems to spread it out a bit.
I don't think that there's numactl for cygwin. I tried looking for it and also trying to see if I compile it from source and get the following error:

Code:
$make Makefile:179: .depend: No such file or directory cc -MM -DDEPS_RUN -I. bitops.c libnuma.c distance.c memhog.c numactl.c numademo. c numamon.c shm.c stream_lib.c stream_main.c syscall.c util.c mt.c clearcache.c test/*.c > .depend.X && mv .depend.X .depend cc -g -Wall -O2 -I. -c -o numactl.o numactl.c cc -g -Wall -O2 -I. -c -o util.o util.c util.c: In function memsize': util.c:80: warning: array subscript has type char' cc -g -Wall -O2 -I. -c -o shm.o shm.c shm.c: In function attach_shared': shm.c:129: error: storage size of st' isn't known shm.c:140: warning: implicit declaration of function fstat64' shm.c:143: warning: implicit declaration of function ftruncate64' shm.c:157: warning: implicit declaration of function mmap64' shm.c:157: warning: assignment makes pointer from integer without a cast shm.c:129: warning: unused variable st' make: *** [shm.o] Error 1 Source from http://oss.sgi.com/projects/libnuma/ 14. Originally Posted by alpha754293 I don't think that there's numactl for cygwin. I tried looking for it and also trying to see if I compile it from source and get the following error: Code: $ make
Makefile:179: .depend: No such file or directory
cc -MM -DDEPS_RUN -I. bitops.c libnuma.c distance.c memhog.c numactl.c numademo.
c numamon.c shm.c stream_lib.c stream_main.c syscall.c util.c mt.c clearcache.c
test/*.c > .depend.X && mv .depend.X .depend
cc -g -Wall -O2  -I.   -c -o numactl.o numactl.c
cc -g -Wall -O2  -I.   -c -o util.o util.c
util.c: In function memsize':
util.c:80: warning: array subscript has type char'
cc -g -Wall -O2  -I.   -c -o shm.o shm.c
shm.c: In function attach_shared':
shm.c:129: error: storage size of st' isn't known
shm.c:140: warning: implicit declaration of function fstat64'
shm.c:143: warning: implicit declaration of function ftruncate64'
shm.c:157: warning: implicit declaration of function mmap64'
shm.c:157: warning: assignment makes pointer from integer without a cast
shm.c:129: warning: unused variable st'
make: *** [shm.o] Error 1
Source from http://oss.sgi.com/projects/libnuma/
I'm getting speedups of up to 30% using numactl interleave on Linux. The speedup depends on the size - for smaller sizes, it sometimes backfires.
I'll accept any settings as valid benchmarks. Tuning the OS like this is part of the game, lol.

As far as I can tell, only Windows Server has any sort of NUMA-awareness support. But you need Win Server anyway to get more than 2 sockets in the first place.
There might be a setting somewhere in the OS that can be set to force interleaved memory allocation. If there isn't, maybe there's a special WinAPI malloc() function that does interleaved allocation. If I find it, I'll try it out.

But in any case, interleaved memory allocation isn't a "solution" to NUMA.
It lets the program get the full memory bandwidth, but it doesn't get rid of the interconnect contention and latency.

The program will still need to be redesigned to run well on NUMA.

EDIT:
Got my B3 stepping today. I finally get my SATA ports back. (I still don't get how I managed to kill them in only 2 months...)
Gonna have to re-test my OC tomorrow after the TIM sets. That thermal pad on the H50 was great, but it's only good for the first use, afterwards it's too uneven so I had to scrape it all off and use some of my leftovered Arctic Silver.

15. Originally Posted by poke349
I'm getting speedups of up to 30% using numactl interleave on Linux. The speedup depends on the size - for smaller sizes, it sometimes backfires.
I'll accept any settings as valid benchmarks. Tuning the OS like this is part of the game, lol.

As far as I can tell, only Windows Server has any sort of NUMA-awareness support. But you need Win Server anyway to get more than 2 sockets in the first place.
There might be a setting somewhere in the OS that can be set to force interleaved memory allocation. If there isn't, maybe there's a special WinAPI malloc() function that does interleaved allocation. If I find it, I'll try it out.

But in any case, interleaved memory allocation isn't a "solution" to NUMA.
It lets the program get the full memory bandwidth, but it doesn't get rid of the interconnect contention and latency.

The program will still need to be redesigned to run well on NUMA.

EDIT:
Got my B3 stepping today. I finally get my SATA ports back. (I still don't get how I managed to kill them in only 2 months...)
Gonna have to re-test my OC tomorrow after the TIM sets. That thermal pad on the H50 was great, but it's only good for the first use, afterwards it's too uneven so I had to scrape it all off and use some of my leftovered Arctic Silver.
Okay - correct me if I'm wrong, but I thought that NUMA meant that CPU fastest when it works on data that's closest to the CPU?

I know that I had to play around with NUMA settings for Folding@Home (at the behest of the folks at the forum over there), so that when I ran "coreinfo", it showed that each socket (on a quad-socket system) was on it's own NUMA node.

And I suppose you're right too - < quad-sockets - probably doesn't matter as much (unless you're going in-between nodes).

Hmmm....*confused*

16. Originally Posted by alpha754293
Okay - correct me if I'm wrong, but I thought that NUMA meant that CPU fastest when it works on data that's closest to the CPU?

I know that I had to play around with NUMA settings for Folding@Home (at the behest of the folks at the forum over there), so that when I ran "coreinfo", it showed that each socket (on a quad-socket system) was on it's own NUMA node.

And I suppose you're right too - < quad-sockets - probably doesn't matter as much (unless you're going in-between nodes).

Hmmm....*confused*
You are correct, but there's an even bigger problem with the program.
y-cruncher allocates all the memory it needs in a single gigantic malloc(). The OS will put that entire memory block on the node that called the malloc(). So the result is that most (if not all) of the memory will be concentrated in a single node.

The 4 nodes will all try to access that same memory - all of which is in one node. So not only do you have contention on the interconnect, but the aggregate memory bandwidth is only equal to that of one socket. (since all the memory is concentrated there)
If the memory is interleaved, you still don't get rid of the interconnect traffic, but at least now you have 4 sockets of memory bandwidth.

This problem does not occur in "normal" applications that do lots of small allocations. For these normal apps, the OS will bias all the allocations on the nodes that called them - so memory accesses are mostly local.
But since y-cruncher does one massive malloc() at the beginning, the OS is not able to do this.

There were several major reasons why I made the program do a single malloc() and it's own memory management as opposed to letting the OS handle it.
1. Memory allocation is expensive - (hundreds/thousands of cycles?)
2. malloc() is not thread-safe. Placing locks around it is not scalable. Even if it is thread-safe (with the right compiler options), it's still implemented with locks.
3. Memory allocation is not deterministic. The address that malloc() returns differs between program instances even with the exact same settings.
The memory manager I put into the program is deterministic - even with multi-threading.
It's a lot easier to hold together 100,000 lines of multi-threaded code if its deterministic.

I made this decision at around October 2008 or so (quite a while ago) - but still before the first version of y-cruncher. It's worked fine since then.

For NUMA, this approach obviously backfires like crazy. It's still possible to make it work without giving up determinism or scalability - but it's messy. It's something I'm probably gonna put off for a few major releases while I redesign/rewrite the program.
There's about 80,000 lines of (nearly) completed code that haven't been enabled yet because the new internal interface is incompatible with what y-cruncher currently uses.
It's gonna take a while to migrate the program to that new interface.
The NUMA layer will eventually be built on top of all the existing "normal" threading code - though I have yet to work out the details.

17. Originally Posted by poke349
You are correct, but there's an even bigger problem with the program.
y-cruncher allocates all the memory it needs in a single gigantic malloc(). The OS will put that entire memory block on the node that called the malloc(). So the result is that most (if not all) of the memory will be concentrated in a single node.

The 4 nodes will all try to access that same memory - all of which is in one node. So not only do you have contention on the interconnect, but the aggregate memory bandwidth is only equal to that of one socket. (since all the memory is concentrated there)
If the memory is interleaved, you still don't get rid of the interconnect traffic, but at least now you have 4 sockets of memory bandwidth.

This problem does not occur in "normal" applications that do lots of small allocations. For these normal apps, the OS will bias all the allocations on the nodes that called them - so memory accesses are mostly local.
But since y-cruncher does one massive malloc() at the beginning, the OS is not able to do this.

There were several major reasons why I made the program do a single malloc() and it's own memory management as opposed to letting the OS handle it.
1. Memory allocation is expensive - (hundreds/thousands of cycles?)
2. malloc() is not thread-safe. Placing locks around it is not scalable. Even if it is thread-safe (with the right compiler options), it's still implemented with locks.
3. Memory allocation is not deterministic. The address that malloc() returns differs between program instances even with the exact same settings.
The memory manager I put into the program is deterministic - even with multi-threading.
It's a lot easier to hold together 100,000 lines of multi-threaded code if its deterministic.

I made this decision at around October 2008 or so (quite a while ago) - but still before the first version of y-cruncher. It's worked fine since then.

For NUMA, this approach obviously backfires like crazy. It's still possible to make it work without giving up determinism or scalability - but it's messy. It's something I'm probably gonna put off for a few major releases while I redesign/rewrite the program.
There's about 80,000 lines of (nearly) completed code that haven't been enabled yet because the new internal interface is incompatible with what y-cruncher currently uses.
It's gonna take a while to migrate the program to that new interface.
The NUMA layer will eventually be built on top of all the existing "normal" threading code - though I have yet to work out the details.
More stupid questions:

will MPI or OMP help?

How is it doing the multi-threading/multi-processing now anyways?

Can threads do malloc()? (You might have actually said it already, but I'm a dummy non-programmer trying to wrap my head around programming concepts.)

I've always been under the impression/belief that multi-processing can happen ONLY if it spawns slave processes. So the whole idea of multi-threading != multi-processing in my mind (or at least it doesn't seem to be as efficient or doesn't use the full resources available to it).

18. Had some spare time to run this on one of my older rigs, just for fun

It's a HP ML350G4 as shown in my sig.

25M
Code:
Validation Version:    1.1

Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )

User:                  None Specified - You can edit this in "Username.txt".

Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
Logical Cores:         4
Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
CPU Frequency:         3,000,194,895 Hz

Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
Constant:              Pi
Algorithm:             Chudnovsky Formula
Decimal Digits:        25,000,000
Computation Mode:      Ram Only
Swap Disks:            0
Working Memory:        207 MB

Start Date:            Wed Mar 30 14:03:49 2011
End Date:              Wed Mar 30 14:04:32 2011

Computation Time:      40.462 seconds
Total Time:            43.022 seconds

CPU Utilization:           373.11 %
Multi-core Efficiency:     93.27 %

Last Digits:
3803750790 9491563108 2381689226 7224175329 0045253446  :  24,999,950
0786411592 4597806944 2455112852 2554677483 6191884322  :  25,000,000

Timer Sanity Check:        Passed
Frequency Sanity Check:    Passed
ECC Recovered Errors:      0
Checkpoint From:           None

----

Checksum:   67dec58e652f6b1d0fea1feafad483d2799e90c7c5db8b200c858b7f86d6f727
50M
Code:
Validation Version:    1.1

Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )

User:                  None Specified - You can edit this in "Username.txt".

Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
Logical Cores:         4
Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
CPU Frequency:         3,000,202,463 Hz

Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
Constant:              Pi
Algorithm:             Chudnovsky Formula
Decimal Digits:        50,000,000
Computation Mode:      Ram Only
Swap Disks:            0
Working Memory:        316 MB

Start Date:            Wed Mar 30 14:06:08 2011
End Date:              Wed Mar 30 14:07:43 2011

Computation Time:      89.224 seconds
Total Time:            94.464 seconds

CPU Utilization:           380.51 %
Multi-core Efficiency:     95.12 %

Last Digits:
4127897300 0153683630 8346732220 0943329365 1632962502  :  49,999,950
5130045796 0464561703 2424263071 4554183801 7945652654  :  50,000,000

Timer Sanity Check:        Passed
Frequency Sanity Check:    Passed
ECC Recovered Errors:      0
Checkpoint From:           None

----

Checksum:   1db010eae8d71aad16d348835969f377f22f7419e93594ee04695171dcf2eb5e
100M
Code:
Validation Version:    1.1

Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )

User:                  None Specified - You can edit this in "Username.txt".

Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
Logical Cores:         4
Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
CPU Frequency:         3,000,202,223 Hz

Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
Constant:              Pi
Algorithm:             Chudnovsky Formula
Decimal Digits:        100,000,000
Computation Mode:      Ram Only
Swap Disks:            0
Working Memory:        536 MB

Start Date:            Wed Mar 30 14:08:34 2011
End Date:              Wed Mar 30 14:12:04 2011

Computation Time:      200.057 seconds
Total Time:            210.020 seconds

CPU Utilization:           386.74 %
Multi-core Efficiency:     96.68 %

Last Digits:
9948682556 3967530560 3352869667 7734610718 4471868529  :  99,999,950
7572203175 2074898161 1683139375 1497058112 0187751592  :  100,000,000

Timer Sanity Check:        Passed
Frequency Sanity Check:    Passed
ECC Recovered Errors:      0
Checkpoint From:           None

----

Checksum:   8f72a3e29cc2c459101624ad135c899f8894f4d30000bad9123c253fa1ba71ee
250M
Code:
Validation Version:    1.1

Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )

User:                  None Specified - You can edit this in "Username.txt".

Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
Logical Cores:         4
Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
CPU Frequency:         3,000,188,655 Hz

Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
Constant:              Pi
Algorithm:             Chudnovsky Formula
Decimal Digits:        250,000,000
Computation Mode:      Ram Only
Swap Disks:            0
Working Memory:        1.25 GB

Start Date:            Wed Mar 30 14:13:07 2011
End Date:              Wed Mar 30 14:23:09 2011

Computation Time:      576.015 seconds
Total Time:            602.238 seconds

CPU Utilization:           390.76 %
Multi-core Efficiency:     97.69 %

Last Digits:
3673748634 2742427296 0219667627 3141599893 4569474921  :  249,999,950
9958866734 1705167068 8515785208 0067520395 3452027780  :  250,000,000

Timer Sanity Check:        Passed
Frequency Sanity Check:    Passed
ECC Recovered Errors:      0
Checkpoint From:           None

----

Checksum:   2d5e582a64c81add039e51df177d4a69a1f84aa395364b6915776ddfb6e1841f
500M
Code:
Validation Version:    1.1

Program:               y-cruncher - Gamma to the eXtReMe!!!     ( www.numberworld.org )
Copyright 2008-2011 Alexander J. Yee    ( a-yee@northwestern.edu )

User:                  None Specified - You can edit this in "Username.txt".

Processor(s):          Intel(R) Xeon(TM) CPU 3.00GHz
Logical Cores:         4
Physical Memory:       4,227,399,680 bytes  ( 3.93 GB )
CPU Frequency:         3,000,192,863 Hz

Program Version:       0.5.5 Build 9179 (fix 1) (x64 SSE3 - Windows)
Constant:              Pi
Algorithm:             Chudnovsky Formula
Decimal Digits:        500,000,000
Computation Mode:      Ram Only
Swap Disks:            0
Working Memory:        2.42 GB

Start Date:            Wed Mar 30 14:25:08 2011
End Date:              Wed Mar 30 14:47:20 2011

Computation Time:      1,275.159 seconds
Total Time:            1,331.560 seconds

CPU Utilization:           392.47 %
Multi-core Efficiency:     98.11 %

Last Digits:
3896531789 0364496761 5664275325 5483742003 7847987772  :  499,999,950
5002477883 0364214864 5906800532 7052368734 3293261427  :  500,000,000

Timer Sanity Check:        Passed
Frequency Sanity Check:    Passed
ECC Recovered Errors:      0
Checkpoint From:           None

----

Checksum:   10ef6c30486f7b58013d61d24782c8cda6cc98b3c21e5dbc19074b403e5e0b98
I also ran a few on my DL380 but it's powered down ATM so I can't get the results I think I did up to 1B there with 3 disks as swap.
I'll take some time once and do some more on the ML350 using swap
For the DL320 I need to take the RAM out of the 380 since it only has 1GB and 2 SATA 80GB disks which are in RAID 1 so those won't help...much

19. I saw that nobody had yet uploaded a result of a AMD X6 so I searched my old screens and found one, maybe it helps you.

Soya

20. Originally Posted by poke349
I'm getting speedups of up to 30% using numactl interleave on Linux. The speedup depends on the size - for smaller sizes, it sometimes backfires.
I'll accept any settings as valid benchmarks. Tuning the OS like this is part of the game, lol.

As far as I can tell, only Windows Server has any sort of NUMA-awareness support. But you need Win Server anyway to get more than 2 sockets in the first place.
There might be a setting somewhere in the OS that can be set to force interleaved memory allocation. If there isn't, maybe there's a special WinAPI malloc() function that does interleaved allocation. If I find it, I'll try it out.

But in any case, interleaved memory allocation isn't a "solution" to NUMA.
It lets the program get the full memory bandwidth, but it doesn't get rid of the interconnect contention and latency.

The program will still need to be redesigned to run well on NUMA.
I've been meaning to post in here for awhile, as I've got access to an 8-socket Barcelona machine (w/ ~80GB ram) here in our research group at school that I've tried running things on. The NUMA nature of the machine definitely causes sub-optimal scaling for the smaller workloads (those below 1B).

I tried numactl for 1B and performance goes from 648sec (without) to 466sec (interleave all), so it does make a noticeable difference. Almost a 1.4X speedup in this case.

For comparison (without numactl), 10B takes 5,379sec and 100M takes 110sec.

Originally Posted by poke349
You are correct, but there's an even bigger problem with the program.
y-cruncher allocates all the memory it needs in a single gigantic malloc(). The OS will put that entire memory block on the node that called the malloc(). So the result is that most (if not all) of the memory will be concentrated in a single node.

The 4 nodes will all try to access that same memory - all of which is in one node. So not only do you have contention on the interconnect, but the aggregate memory bandwidth is only equal to that of one socket. (since all the memory is concentrated there)
If the memory is interleaved, you still don't get rid of the interconnect traffic, but at least now you have 4 sockets of memory bandwidth.

This problem does not occur in "normal" applications that do lots of small allocations. For these normal apps, the OS will bias all the allocations on the nodes that called them - so memory accesses are mostly local.
But since y-cruncher does one massive malloc() at the beginning, the OS is not able to do this.

There were several major reasons why I made the program do a single malloc() and it's own memory management as opposed to letting the OS handle it.
1. Memory allocation is expensive - (hundreds/thousands of cycles?)
2. malloc() is not thread-safe. Placing locks around it is not scalable. Even if it is thread-safe (with the right compiler options), it's still implemented with locks.
3. Memory allocation is not deterministic. The address that malloc() returns differs between program instances even with the exact same settings.
The memory manager I put into the program is deterministic - even with multi-threading.
It's a lot easier to hold together 100,000 lines of multi-threaded code if its deterministic.
Yeah, tying all of the memory allocation to one thread/node doesn't do so well with NUMA I imagine. Beyond just the interconnect bandwidth impact, coherency traffic must become a significant overhead as well.

Determinism definitely is desirable with parallel programs, though in what capacity are you using the deterministic malloc()? Is it mainly for debug, or structured memory/pointer accesses? As for thread-safe, are you saying just using barriers to properly synchronize isn't good enough?

At the expense of memory capacity trade-off, could you have each thread malloc() a (relatively) small buffer to work out of? In the interest of creating some locality and doing burst transfers.
Originally Posted by alpha754293
More stupid questions:

will MPI or OMP help?

How is it doing the multi-threading/multi-processing now anyways?

Can threads do malloc()? (You might have actually said it already, but I'm a dummy non-programmer trying to wrap my head around programming concepts.)

I've always been under the impression/belief that multi-processing can happen ONLY if it spawns slave processes. So the whole idea of multi-threading != multi-processing in my mind (or at least it doesn't seem to be as efficient or doesn't use the full resources available to it).
I'm curious of the current implementation as well

Is it all p-thread based? MPI would definitely be the way to go for ensuring proper locality on a NUMA system, at the expense of all the icky manual data sharing. I've only done some basic use of OpenMP, so I don't know too much of the advanced details like optimizations for NUMA systems, but is it really applicable to your current algorithm?

I remember reading sometime ago about your implementation spawning a large amount of threads, do you limit the number of work threads to the available hardware threads, or go beyond? Having excessive threads would likely thrash around too much with larger NUMA systems and be counter productive.

How independent are the threads? Is there much synchronization/sharing of data between them, or does each basically work on its own independent chunk of the series and combine in a tree-reduce fashion?

21. Originally Posted by alpha754293
More stupid questions:

will MPI or OMP help?

How is it doing the multi-threading/multi-processing now anyways?

Can threads do malloc()? (You might have actually said it already, but I'm a dummy non-programmer trying to wrap my head around programming concepts.)

I've always been under the impression/belief that multi-processing can happen ONLY if it spawns slave processes. So the whole idea of multi-threading != multi-processing in my mind (or at least it doesn't seem to be as efficient or doesn't use the full resources available to it).
Yes, threads can malloc(), but malloc() isn't thread safe. Some compilers can make it thread-safe with the right options, but that's done using locks - which limits scalability. I'll get to that later.

Originally Posted by rcofell
I've been meaning to post in here for awhile, as I've got access to an 8-socket Barcelona machine (w/ ~80GB ram) here in our research group at school that I've tried running things on. The NUMA nature of the machine definitely causes sub-optimal scaling for the smaller workloads (those below 1B).

I tried numactl for 1B and performance goes from 648sec (without) to 466sec (interleave all), so it does make a noticeable difference. Almost a 1.4X speedup in this case.

For comparison (without numactl), 10B takes 5,379sec and 100M takes 110sec.

Yeah, tying all of the memory allocation to one thread/node doesn't do so well with NUMA I imagine. Beyond just the interconnect bandwidth impact, coherency traffic must become a significant overhead as well.

Determinism definitely is desirable with parallel programs, though in what capacity are you using the deterministic malloc()? Is it mainly for debug, or structured memory/pointer accesses? As for thread-safe, are you saying just using barriers to properly synchronize isn't good enough?

At the expense of memory capacity trade-off, could you have each thread malloc() a (relatively) small buffer to work out of? In the interest of creating some locality and doing burst transfers.

I'm curious of the current implementation as well

Is it all p-thread based? MPI would definitely be the way to go for ensuring proper locality on a NUMA system, at the expense of all the icky manual data sharing. I've only done some basic use of OpenMP, so I don't know too much of the advanced details like optimizations for NUMA systems, but is it really applicable to your current algorithm?

I remember reading sometime ago about your implementation spawning a large amount of threads, do you limit the number of work threads to the available hardware threads, or go beyond? Having excessive threads would likely thrash around too much with larger NUMA systems and be counter productive.

How independent are the threads? Is there much synchronization/sharing of data between them, or does each basically work on its own independent chunk of the series and combine in a tree-reduce fashion?

I'll only focus on the computation threads.

Basically, there's a ton of binary divide-and-conquer crap in the program. Since each of these divide-and-conquer algorithms typically involve some sort of data merging/splitting at each stage, you can't easily split them into N threads at the start. So instead, I integrate the recursive thread-splitting into these binary divide-and-conquer algorithms.
At each stage, I split the current thread into two. In Windows, this is done by spawning another thread. In Linux, it's done using recursive OpenMP.
There's no pthread version yet, but it would basically be a copy/paste of the Windows code with some renaming. So far, OpenMP works just fine, so there's no need for pthreads yet.
By adjusting the depth of the thread-splitting, I can change the # of threads. This is the primary reason why y-cruncher is locked to power-of-two threads.

Some of the new code that I'm working on right now is able to do non-power-of-two threads using the same recursive splitting method. So in the future, I might be able to lift the power-of-two restriction on the program.

If the routine needs memory, then the memory heap is split in half (or some pre-determined ratio). This is also done recursively along with the thread-splitting.
All memory allocations that a thread does are on it's own part of the heap. They never touch another thread's memory - hence why its deterministic. No locks, no synchronization, no overhead.
The only non-deterministic bugs are the ones where one thread accidentally touches another thread's memory. But these are rare unless an out-of-bound array access happens to cross a thread boundary in the heap.
(In fact, I have yet to encounter a bug like this since I started this project)

There's a lot of fancy math needed to prove that a routine will never use more than XX memory - especially when you factor in memory fragmentation.
Another fancy protocol is needed to join threads when a large value needs to be returned via the heap.

So the memory allocation is deterministic, and completely free of locks/synchronization. A memory allocation is typically 20-100 cycles depending on how much stuff is on the heap.

Locking/synchronization isn't scalable, so there's none of it in the program. Thread-joining is the only form of "synchronization" that's used.

OS-managed memory allocation isn't scalable for this reason. All that page committing stuff needs to access global data - using locks...
To top it off the OS needs to zero newly allocated memory before the program can use it. (This is for security reasons: so you can't like steal a password by logging into to something, closing the program and allocating that physical memory into another program.)

As for pre-allocating a buffer for each thread to get better locality in NUMA - that's essentially the same as MPI as you have to explicitly move data in and out of that buffer.

At any given point in the program, all the threads are entirely independent.
When all threads complete their given tasks, they are assigned new tasks on different parts of memory. So no single part of memory is accessed by only one thread.

Implementing this in MPI will require inter-node matrix transposes - which is a huge research topic in my field.
In shared memory, these transposes can be eliminated by directly accessing the memory that's needed - this is what y-cruncher does right now.

In any case, an MPI implementation will pretty much require a rewrite of all the high-level code. I don't have immediate plans for this. And when I do, it'll be built on top of the NUMA/shared-memory layer.

EDIT:
Yes, it's a tree-reduction.

EDIT 2:
The program does in fact use more threads than hardware threads. This is to reduce the impact of load imbalance.
For systems with a non-power-of-two hardware threads, it needs to be rounded up to the next power-of-two anyways.

22. Very interesting. I gotta say, I'm learning a LOT from this thread, although I also honestly, still don't understand alot of it as well only because I'm not a programmer. BUT...I THINK that I sorta get the jist of it though (and pity that you're so far otherwise I think that I can really learn programming from you), since you're one of the few people that I've EVER met who can explain it and explain it well!

Here's something interesting that I've noticed -- on the 48-core system that I've got at work - it's four sockets. I THINK that I have it set up so that there are four NUMA nodes.

And when I run y-cruncher on it, it doesn't really seem to quite take full advantage of the hardware and that's evident in the CPU utilization reported by Windows Task Manager and also by the program itself.

Is that (^ the above aforementioned ^) why it's like that?

I'm only used to commercially available software that's MPI and/or OpenMP and maybe it's cuz it's commercial, so it has very high CPU utilization. ?

23. Originally Posted by alpha754293
Very interesting. I gotta say, I'm learning a LOT from this thread, although I also honestly, still don't understand alot of it as well only because I'm not a programmer. BUT...I THINK that I sorta get the jist of it though (and pity that you're so far otherwise I think that I can really learn programming from you), since you're one of the few people that I've EVER met who can explain it and explain it well!

Here's something interesting that I've noticed -- on the 48-core system that I've got at work - it's four sockets. I THINK that I have it set up so that there are four NUMA nodes.

And when I run y-cruncher on it, it doesn't really seem to quite take full advantage of the hardware and that's evident in the CPU utilization reported by Windows Task Manager and also by the program itself.

Is that (^ the above aforementioned ^) why it's like that?

I'm only used to commercially available software that's MPI and/or OpenMP and maybe it's cuz it's commercial, so it has very high CPU utilization. ?
lol... I thought I was always bad at explaining stuff.
Also, I don't always know what I'm talking about. So take my posts with a grain of salt.

What you're seeing is load imbalance due to NUMA. What's happening is that some nodes run faster than others depending on where the data is. So you'll find one or two nodes lagging far behind the others - and when the threads on the faster nodes finish first, the slower threads will hang around a lot longer afterwards. So you see low CPU usage.

I was confused as well when I first noticed this on the 4 x 4 that Skycrane sent.
From the beginning, I had already suspected this was the cause. But I couldn't confirm it until I wrote some mini-benchmarks to specifically test for this.

There may be more reasons to it, but so far that's the only explanation I have.

24. Originally Posted by poke349
lol... I thought I was always bad at explaining stuff.
Also, I don't always know what I'm talking about. So take my posts with a grain of salt.

What you're seeing is load imbalance due to NUMA. What's happening is that some nodes run faster than others depending on where the data is. So you'll find one or two nodes lagging far behind the others - and when the threads on the faster nodes finish first, the slower threads will hang around a lot longer afterwards. So you see low CPU usage.

I was confused as well when I first noticed this on the 4 x 4 that Skycrane sent.
From the beginning, I had already suspected this was the cause. But I couldn't confirm it until I wrote some mini-benchmarks to specifically test for this.

There may be more reasons to it, but so far that's the only explanation I have.
Well, if you need another quad-socket system to test stuff with, my quad dual-core (8-cores total) at home that's available. It's not the fastest system by any stretch of the imagination now, and I'll still have access to my 48-core system at work for about another week (I'm switching jobs).

And you're actually pretty good at explaining stuff. Better than most profs and even some programmers that I've met before. I'm pretty sure that you have a pretty good idea what you're talking about because otherwise, you probably wouldn't be writing the programs and doing what it is that you're doing. (Last I recall, you're doing your grad school...so I'd presume that you gotta know SOMETHING.)

25. Originally Posted by alpha754293
Well, if you need another quad-socket system to test stuff with, my quad dual-core (8-cores total) at home that's available. It's not the fastest system by any stretch of the imagination now, and I'll still have access to my 48-core system at work for about another week (I'm switching jobs).

And you're actually pretty good at explaining stuff. Better than most profs and even some programmers that I've met before. I'm pretty sure that you have a pretty good idea what you're talking about because otherwise, you probably wouldn't be writing the programs and doing what it is that you're doing. (Last I recall, you're doing your grad school...so I'd presume that you gotta know SOMETHING.)
Well, I may be in grad school, but I'm still just a first year... There's still a lot I don't know - especially supercomputer related things.
The only difference between me and other grad students is that I have a bunch of machines to toy with. Everyone else needs to share the school's resources.
And there's a lot you can't do when you don't have root or physical access to a machine you're trying to study.

Stuff like playing with the CPU clock, messing with memory timings, enabling/disabling cores... are some of the unconventional things I do to see what's going on in a program.

If you try that on a school machine, you'll get expelled.

And I have some good news for the AMD fans...

Visual Studio 2010 SP1 fixes the bug that was miscompiling AVX.

It looks like I have a Windows binary that will run on Bulldozer.
Interestingly, Visual Studio 2010 SP1 compiles a faster AVX binary than the Intel Compiler (about 1% faster)...

Also, VS 2010 SP1 has support for FMA4 and XOP instructions. But since I don't plan on building another rig for a while, I won't be able to add support for FMA4 and XOP anytime soon.