New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

**poke349** · 07-27-2009, 11:06 AM

Originally Posted by Chumbucket843

i turned off the cores in msconfig so no background tasks would be placed on other cores. i noticed there was no 3 thread mode b/c when i had 3 threads it said i had 4. i will have a speed up graph soon but i am a newb at openoffice.

Yes, the program will round up to the next power of two if it isn't.

The reason for limiting it to powers of two, is for ease of implementation and efficiency of code.

There's a crap-load of binary divide-and-conquering in virtually all the algorithms that are used. Stuff like that just don't work well with non-power of two threads...

I've also found that the penalty of running extra threads is relatively small. Assuming that most computers now (and future) will have either a power-of-two # of cores or a "clean multiple" of one, this restriction is worth the ease of implementation.

**Hoss331** · 07-27-2009, 05:17 PM

Originally Posted by poke349

Anyways...
Anyone got a Core 2 @ 3.2 GHz? My workstation is, but it's down and it's not coming back online for a few more weeks.

Heres a Yorkfield at 3.2 even though it says 3.8 in the program, I guess it calculates mhz using only the stock multi.

**poke349** · 07-27-2009, 05:42 PM

Originally Posted by Hoss331

Heres a Yorkfield at 3.2 even though it says 3.8 in the program, I guess it calculates mhz using only the stock multi.

Could you do that run single-threaded? We were trying to compare the two in single-threaded mode (no bandwidth bottleneck) to see which (Core 2 or K10) has faster arithmetic for this program.

And I like how you set the FSB and mult to match my workstation.

It seems like on Core 2 it uses the stock multiplier. On i7, it uses the actual maximum multiplier (as set in BIOS) before Turbo Boost - but it never uses more than the stock multiplier.

Anyhow... I've some insane results coming in from someone in Japan with a very well tuned Dual Xeon W5580 rig with 72GB of ram... Benchmark sizes going all the way up to 32G with the help of Swap Mode...

I'll post those later. But they all lack verification checksums.

**Hoss331** · 07-27-2009, 06:50 PM

here you go, affinity set to 1 core

**poke349** · 07-27-2009, 06:58 PM

Originally Posted by Hoss331

here you go, affinity set to 1 core

Awesome

So the summary for Core 2 vs. Phenom II. (for y-cruncher)

Single-threaded (arithmetic speed test):

Phenom II @ 3.2 GHz - 55.1057
Core 2 Quad (12MB cache) @ 3.2 GHz - 49.5219

Multi-threaded (arithmetic + bandwidth test):

Phenom II @ 3.2 GHz - 15.455
Core 2 Quad (12MB cache) @ 3.2 GHz - 13.9467

If we did these runs on Q6600 @ 3.2 GHz, that'll also settle issue of cache size.

The two Q6600s that are already on the list are from v0.3.2.

**fasterklander** · 07-28-2009, 01:12 AM

I' running a P4 3,15Ghz

25M -->152.961s

50M -->344.768s

100M -->780.277s

**Hoss331** · 07-28-2009, 05:56 AM

Originally Posted by poke349

If we did these runs on Q6600 @ 3.2 GHz, that'll also settle issue of cache size.

The two Q6600s that are already on the list are from v0.3.2.

Id also like to see an I7 do a single core single thread run, turbo off.

**spdycpu** · 07-28-2009, 07:39 AM

Originally Posted by Hoss331

Id also like to see an I7 do a single core single thread run, turbo off.

I can do a 3.2GHz run on my i7 when I get home. It is 10:33am CST now, I should be able to get it run by 5:30pm.

Poke349: I finally got a new waterblock, the Heatkiller 3.0 CU. I can't believe the thing, 4.4GHz is 100% stable (Linx w/ 8 threads for 24 hours). That block with regular water is better than my old Apogee with ice water, no kidding. 65C full load at 4.2GHz, 1.3v. 75C full load at 4.4GHz, 1.38v. I'll give ice water a shot at some point, I really want to get a 4.6GHz run done. It would be nice if you could include some batch benchmarking.

For example you could set 3 loops then specify a range from X to Y. This way I could run 3 loops of each & save the fastest time and test times of 1m, 2m, 4m, 8m, 16m, etc digits as well as the 25, 50, 100, etc. Also outputting the fastest result to a file would be nice as not to need to copy/paste so much text. What do you think?

**poke349** · 07-28-2009, 08:03 AM

Originally Posted by Hoss331

Id also like to see an I7 do a single core single thread run, turbo off.

I can do them, but my rig is tied up for a few more days. Looks like spdy beat me to it.

Originally Posted by spdycpu

I can do a 3.2GHz run on my i7 when I get home. It is 10:33am CST now, I should be able to get it run by 5:30pm.

Poke349: I finally got a new waterblock, the Heatkiller 3.0 CU. I can't believe the thing, 4.4GHz is 100% stable (Linx w/ 8 threads for 24 hours). That block with regular water is better than my old Apogee with ice water, no kidding. 65C full load at 4.2GHz, 1.3v. 75C full load at 4.4GHz, 1.38v. I'll give ice water a shot at some point, I really want to get a 4.6GHz run done. It would be nice if you could include some batch benchmarking.

For example you could set 3 loops then specify a range from X to Y. This way I could run 3 loops of each & save the fastest time and test times of 1m, 2m, 4m, 8m, 16m, etc digits as well as the 25, 50, 100, etc. Also outputting the fastest result to a file would be nice as not to need to copy/paste so much text. What do you think?

I completely agree with you. I just need to find the time to polish up my bulk compute add-on and release it.

3 runs of each - Good idea. I'll probably set that as a default with an option to override it. And I'll add a size-limit to looped runs - say 10 min. Otherwise those massive single-threaded 10 and 12b runs on my workstation will take days.

I can have it output the benchmarks to a separate text file.

Something like 3 categories:

Standard Sizes: 25m, 100m, 250m, etc... all validated - print the best times (with it's validation) into a text file.

SuperPi Sizes: 1M, 2M, 4M, etc... all validated, same as above

Multi-core Scaling: 1m, 1.2m, 1.5m, 2m, 2.5m, etc*...
- Manually select threading mode
- No validation
*These are the sizes I used to generate those fancy multi-core scaling graphs.

I'd love to see a multi-core scaling graph from a pair of Gainestowns...

But I honestly doubt anyone will be patient enough to sit through single-threaded runs of 1b+.

For me, I just let it run while I'm at work, run overnight...

I also need a way to enforce processor affinity. I can't manually force it because I wouldn't know which cores are real and which are virtual from HT.

As for that... Time for some insaneness....

Results from Japan: http://ja0hxv.calico.jp/pai/pietc.html
Google translate it if you can't read Japanese. (I can't either...)

2 x Intel Xeon W5580 Gainestown @ 3.2 GHz
72 GB (18 x 4 GB) DDR3
Windows Server 2008

25m - 6.92
50m - 13.31
100m - 28.14
250m - 76.34
500m - 166.07
1b - 365.20
2.5b - 1,025.05
5b - 2,307.18
10b - 4,961 (1 hour, 22 min, 41 secs)
25b - 19,415 (5 hours, 23 min, 35 secs) - Done using Swap Mode*

1M - 0.37
2M - 0.67
4M - 1.21
8M - 2.31
16M - 4.47
32M - 8.75
64M - 18.02
128M - 38.18
256M - 82.63
512M - 185.41
1G - 398.09
2G - 868.54
4G - 1,928.29
8G - 4,235 (1 hour, 10 min, 35 secs)
16G - 11,892 (3 hours, 18 min, 12 secs) - Done using Swap Mode*
32G - 31,061 (8 hours, 37 min, 41 secs) - Done using Swap Mode*

One thing I have to say... This guy is NUTs...
He gets new workstations like this about once every half a year.

The last few he had are:

2 x Intel Xeon X5470
128 GB (16 x 8 GB) DDR2 FB-DIMM

2 x Intel Xeon X5460
64 GB (16 x 4 GB) DDR2 FB-DIMM

Not only that... He ACTUALLY ran this program for 8+ hours just for a benchmark. That's a pretty good stress test...

I've done longer runs than that (200+ hours), but that's because they were either tests, or were for size records. Not benchmarks...

*Swap Mode requires less memory but is significantly slower.
There's no validation for it, and it's available under the Custom Compute option.

Lastly... Dave, if you're here, you've got some SERIOUS competition.
This guy knows how to tune these things... enough to make his W5580s faster than your W5590s.

**Chumbucket843** · 07-28-2009, 12:50 PM

i was just on google trends and japan is the #1 country to search core i7.

**Chumbucket843** · 07-28-2009, 01:51 PM

deleted

**spdycpu** · 07-28-2009, 02:31 PM

Originally Posted by poke349

Awesome

So the summary for Core 2 vs. Phenom II. (for y-cruncher)

Single-threaded (arithmetic speed test):

Phenom II @ 3.2 GHz - 55.1057
Core 2 Quad (12MB cache) @ 3.2 GHz - 49.5219

Multi-threaded (arithmetic + bandwidth test):

Phenom II @ 3.2 GHz - 15.455
Core 2 Quad (12MB cache) @ 3.2 GHz - 13.9467

If we did these runs on Q6600 @ 3.2 GHz, that'll also settle issue of cache size.

The two Q6600s that are already on the list are from v0.3.2.

i7 @ 3.2GHz, 3.6GHz Uncore, Memory @ 1600 7-7-6-16.

Single:

Code:

Benchmark Successful. The digits appear to be OK.

Program Version:    0.4.1 Build 7408 (x64 SSE3)
Processor(s):       Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
CPU Frequency:      3,192,005,951 Hz  (frequency may be inaccurate)
Thread(s):          1
Digits:             25,000,000
Total Time:         44.5555 seconds
Checksum:           506bd9db81dfe73a07ae66fb5da8af7e

Multi (with HT):

Code:

Benchmark Successful. The digits appear to be OK.

Program Version:    0.4.1 Build 7408 (x64 SSE3)
Processor(s):       Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
CPU Frequency:      3,192,005,119 Hz  (frequency may be inaccurate)
Thread(s):          8
Digits:             25,000,000
Total Time:         11.4947 seconds
Checksum:           d2ec2f25569fffbd04422301296a783b

**poke349** · 07-28-2009, 06:24 PM

Originally Posted by Chumbucket843

i was just on google trends and japan is the #1 country to search core i7.

lolz...

They always have the newest gadgets for just about everything except for maybe processors... since both Intel and AMD are US-based...

Originally Posted by spdycpu

i7 @ 3.2GHz, 3.6GHz Uncore, Memory @ 1600 7-7-6-16.

Nice... At some point, I'm gonna need to make a database on my site. But I don't have the time for all that... argh...

And you're hitting 4.6 on plain water? That's just insane...

Because 5GHz is already LN2 territory. Those benches will be interesting and hard to beat.

Today, I got a nice look at a 96-core 16 x Dunnington machine with 512 GB ram at fair... Too bad it was too busy for me to try any benches...

**Ket** · 07-30-2009, 06:05 AM

Not bad considering I'm running SETI atm as well

reported CPU frequency is wrong, its 3.6GHz. I'll be back with some proper results when my PC9200 turns up.

**poke349** · 08-01-2009, 08:04 PM

I got some new fans.

A pair of these:
http://www.newegg.com/Product/Produc...82E16835213009
(I didn't get them from newegg though.)

Speed controlled, they are just as quiet as my old ones with slightly more airflow. I run it at this speed normally...

At full power, I can't hear myself talk...

Now I can safely hit 4.2GHz on air - with more room to spare.

This was stress test more than a benchmark. I intentionally left RealTemp and CPUz on to monitor it.

The temps peaked at 84C. They hit 90C when I benched 4GHz with my old fans.

**Particle** · 08-04-2009, 09:41 AM

Greetings, poke. I like your benchmark program--it's quite nice. Have you by chance experienced an issue where it doesn't seem to hit all cores very effectively? I've got a 12-core machine where it seems to stay in the 40-60% CPU range for smaller benchmarks (under 32M) and 60-80% for larger ones. It never actually "pegs" so to speak.

**poke349** · 08-04-2009, 10:10 AM

Originally Posted by Particle

Greetings, poke. I like your benchmark program--it's quite nice. Have you by chance experienced an issue where it doesn't seem to hit all cores very effectively? I've got a 12-core machine where it seems to stay in the 40-60% CPU range for smaller benchmarks (under 32M) and 60-80% for larger ones. It never actually "pegs" so to speak.

Yes, it's a fundamental issue with this type of task. Hence why it's taken while...

Pi - by it's very nature doesn't parallel as well as wprime, or any other "artificially made" task.

Why your cores aren't kept busy 100% of the time can be due to several reasons:

Load imbalance. Most types of scientific computing like this don't split evenly (or at least it's not easy to do so). So some threads will finish before others. When this happens, the threads that are done need to wait for the others.
Not every part of the computation is paralleled. Fast operations like additions and subtractions are limited by memory bandwidth so they will not benefit from multi-threading.
Thread creation and destruction have a lot of overhead. When the working size for a particular operation is small enough, the overhead of thread creation becomes greater than the benefit of threading. At this point, the program doesn't parallel it - hence less than 100% cpu.
Refresh rate of Task Manager. Task manager and other monitors average cpu usage over a period of time. If the computation is small, there won't any period of sustained 100% cpu long enough to average a 100%.

The larger the computation, the smaller the effect of these inefficiencies, and the higher the cpu usage.
With 12 cores, you're probably gonna need to go above 1 billion digits to get cpu usage averaging > 90%.
You WILL need to go up to several billion digits to achieve sustained 100% cpu that can last a few minutes. Most people don't have that kind of ram so it isn't suitable as a stress-test unless you run multiple instances.

CPU usage can be improved if I allow multi-threading to increase memory usage... But that gets prohibitive after a while. This type of computing is already enough of a memory hog as it is. So I prefer the ability to hit larger sizes.

So in some sense, computing Pi is a benchmark that "more closely" resembles real-life scientific computing.

EDIT:
So yes. Have fun with it.

Tell all those Pi fanatics... they'll need to move away from those C2D's to stay competitive.

jk

**Mechromancer** · 08-04-2009, 01:50 PM

Are these guys with 16-thread Xeon systems not hitting 100% CPU usage either?

**Particle** · 08-04-2009, 01:58 PM

Is it possible to specify a manual thread count? Inspired by HT people, I'd like to try a run at 24 threads to see if things stay busier.

**poke349** · 08-04-2009, 02:14 PM

Originally Posted by Mechromancer

Are these guys with 16-thread Xeon systems not hitting 100% CPU usage either?

I definitely wouldn't expect them to... But CPU usage as a whole can be fairly deceiving. Even though it "doesn't look" to be efficient, you're still getting massive speed up.

Here's the 8-core @ 1b screenie on my website:
I consider this a "good" graph. Mostly @ 100% but with dips every 10 - 20 seconds...

In most cases is won't be as efficient as this.

The only time where I've gotten near sustained 100% cpu is during one of the world size-records I set back in April:

Same computer: 8 cores @ 31 billion digits of a different constant
(click to enlarge)

This kind of efficiency... is only achievable if you have either a REALLY SLOW computer

, or if you have a completely stupid amount of ram...

Originally Posted by Particle

Is it possible to specify a manual thread count? Inspired by HT people, I'd like to try a run at 24 threads to see if things stay busier.

Good thinking

:

The program actually already does that.

When you run N threads, it will usually run 2N and occasionally 4N threads.

In any case, if you have a non-power of 2 cores, it rounds up. So on your rig, the program is running in 16-core mode which uses anywhere from 16 - 64 threads. (There's an option in task manager that shows how many threads a process is using.)

You can manually set your settings in the "Custom Compute a Constant" option. But there's no validation.

Just remember that higher % cpu usage doesn't always mean faster time.

**Particle** · 08-04-2009, 02:40 PM

Oh wow...I see you're creating and destroying threads during the computation cycle itself. In my own programming I've found it to be a good idea to create x number of threads and then use them all to process pieces of work dispatched from a synchronous controller. That may or may not be practical or applicable to your particular algorithm of course--I won't pretend to be familiar with your project.

In any case, that does make sense now. I saw as few as 8 threads and as many as 60-some.

**poke349** · 08-04-2009, 03:06 PM

Originally Posted by Particle

Oh wow...I see you're creating and destroying threads during the computation cycle itself. In my own programming I've found it to be a good idea to create x number of threads and then use them all to process pieces of work dispatched from a synchronous controller. That may or may not be practical or applicable to your particular algorithm of course--I won't pretend to be familiar with your project.

In any case, that does make sense now. I saw as few as 8 threads and as many as 60-some.

I've thought about using thread pools, but I decided against it for a few reasons:

I couldn't figure out how to use that API.
The program was written for extremely large computations (for breaking size-records). And on large computations, threading overhead is negligible.
My intuition told me that a synchronous work-dispatcher might have problems scaling into "many" cores... by many, I mean tens or hundreds...
(Specifically, it would take linear time to dispatch N-loads of work for N cores, whereas recursive thread-creation would take only log(N) provided that the memory allocator was efficient.)
Lastly, I was just plain lazy...

**Calmatory** · 08-05-2009, 02:11 PM

Only if there was Linux binaries...

**poke349** · 08-05-2009, 06:53 PM

If I had more experience with Linux, then there would be binary for it...

"Eventually", I'll have Linux binaries... whenever I get the time...

The entire program has been written to be easily ported to Linux... So I don't expect it to be too hard to do so when the time comes.

Have you tried running it under Wine?

**Chumbucket843** · 08-06-2009, 06:28 AM

if you compiled for linux and the cell i could run this thing on my ps3.

actually the cell has been beaten in flops by x86 by now though and its a PITA to work with.

Thread: New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

Thread Tools

Search Thread

Rate This Thread

Display

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions