New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

Printable View

Show 100 post(s) from this thread on one page

07-21-2009, 10:59 AM
Movieman

Ok, so how many records did I break? :sofa:
07-21-2009, 11:06 AM
poke349

Quote:

Originally Posted by Movieman

Ok, so how many records did I break? :sofa:

First, fix the screenie for your 32M run. :p:
I'd like to see what the 32M SuperPi record of 6 and half minutes on LN2 turns into... :rofl::rofl::rofl:

Otherwise every single run except for the 1M run is a new record. :ROTF:

On that note:
For the SuperPi-sized records on the download page, you should be able to clean-sweep everything from 2M all the way to 1G with 6GB of ram.

The 1M time of .405 seconds is hard to beat without OCing because it's too small of a computation to get much benefit from multi-threading.
07-21-2009, 11:50 AM
El Greco

http://h.imagehost.org/0882/2_7.jpg
http://h.imagehost.org/0930/3_1.jpg
07-21-2009, 11:56 AM
Movieman

This better?:D
http://img505.imageshack.us/img505/2135/ycrunch32mu.jpg

What? No updates yet? It's been 5 mins! :stick:
http://www.numberworld.org/y-cruncher/#Benchmarks
07-21-2009, 12:16 PM
El Greco

http://h.imagehost.org/0509/4_10.jpg
http://h.imagehost.org/0563/5.jpg
07-21-2009, 12:40 PM
poke349

Quote:

Originally Posted by Movieman

This better?:D

What? No updates yet? It's been 5 mins! :stick:
http://www.numberworld.org/y-cruncher/#Benchmarks

I'll do it tonight. I don't have access to my webserver right now.:rolleyes:

Interesting, your 32M time (9.45s) is slower than the W5580s (9.30s)... even though your memory is probably faster. Seems like that person did some serious tweaking.
You might have to do the same to beat those numbers. But for something larger like 256M, 512M, or 1G, your clock and memory speed advantage should beat any tweak.

For something as small as 32M, there's a lot of thread-creation/destruction overhead. So you might want disable HT or use the Custom Compute mode to override the thread settings to use fewer threads. At these sizes, the program probably spends a significant amount of time creating and destroying threads... bleh... Then I again, I never optimized the program for small computations.

Another possible reason is that since your memory is faster, your timings are more relaxed. I found that Nehalems have more memory bandwidth than the program needs. So tighter timings and slower memory might be better.

@ El Greco
I think we have a winner here! The first person to show up with a non-power-of-two cores! :D:D:D

You didn't unlock the 4th right?
07-21-2009, 12:45 PM
ex2cib

Here's a 25m and 50m

25m on the left, 50m on the right.

http://img8.imageshack.us/img8/3496/...00mhzi7.th.jpg

100m (no cpu-z in this one)

http://img525.imageshack.us/img525/1230/100m.th.jpg

100m is weird. I've run it about 5 times so far, and, i've had about 4 failures and 1 pass. apparently i'm just on the edge for that test as far as being stable.

1m then 32m
http://img299.imageshack.us/img299/8...m4200i7.th.jpg

edit2: and yes, dave is defeated in 1m @ least :p:

take that old man
07-21-2009, 12:48 PM
El Greco

Quote:

Originally Posted by poke349

@ El Greco
I think we have a winner here! The first person to show up with a non-power-of-two cores! :D:D:D

You didn't unlock the 4th right?

No..I am still testing the CPU as a X3 to find out the maximum stable OC using the multi only... (17.5 * 200 = 3500) ..you know , so the Black Edition title erns the money i spend for it
07-21-2009, 11:31 PM
poke349

Updated the list (on the first post of this thread) with some of the SuperPi-sized benchmarks.

I also "think" I've fixed "Sanity Check Error" for the smaller SuperPi-sized benchmarks.
07-24-2009, 01:12 PM
poke349

Here's a little something interesting...

I ran some automated benchmarks on my Lan-Box to see how the new version scales with multiple cores.

Here are the results: Main Page

http://www.numberworld.org/y-crunche...raph_small.jpg
The graph shows how many times faster a multi-threaded run is than a single-threaded run.
Obviously, 2 threads cannot do any better than 2x improvement and 4 threads cannot do better than 4x.

When the computation is small, the amount of time spawning threads dominates actual computation time. Therefore it scales very poorly for small computations. The bigger you go, the better it scales.
Notice that below ~5 million digits, Hyper-Threading with 8 threads is slower than 4 threads without HT. This is because the benefits of HT is outweighed by the overhead of spawning double the threads.

So if you have an i7 and you want to get the fastest 1M, 2M, or 4M times, try disabling HT...

This wasn't the case in the older versions. Because of a number of optimizations and bug-fixes, v0.4.1 doesn't scale as well as the older versions for small computations. (But it scales slightly better for large computations.)
07-25-2009, 03:57 PM
Chumbucket843

interesting results with SMT on i7. i think its hilarious your program can calculate 10 trillion digits of pi. is it stable after using more than 46 gigs?
07-25-2009, 06:53 PM
poke349

Quote:

Originally Posted by Chumbucket843

interesting results with SMT on i7. i think its hilarious your program can calculate 10 trillion digits of pi. is it stable after using more than 46 gigs?

That's 10 billion. Trillion has another set of zeros... (I lose count around that point too... :D)

Yes, the program is 64-bit. So it has no trouble using as much memory as it wants. As for the computer, it isn't even OC'ed so yes, its perfectly stable with 64GB no problems.
I know of one person who benched this program with 128GB of ram... (namely dual Xeon X5470 with 16 x 8GB DDR2 FB-DIMM)

The program will do a LOT more than just 10 billion.
The current version allows up to 200 billion digits. Though it's only been tested up to 31 billion.

As for what the "true" limit of the program is... I actually have no idea...
I don't see any major "wall" in the implementation until at least 10^16 digits...
But no computer in the world will have the ram or the computational power to test it - not even Road Runner.
07-26-2009, 09:12 AM
Mechromancer

Quote:

Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3)
Processor(s): AMD Phenom(tm) II X4 940 Processor
CPU Frequency: 3210978905
Thread(s): 2^2
Digits: 25000000
Total Time: 15.455
Checksum: 006b916ff8c0d1b40b57fc77474331ca

My Phenom II wants on the list right ahead of those two Q6600s@3.2Ghz :).
07-26-2009, 12:44 PM
Mechromancer

http://i145.photobucket.com/albums/r...y-Cruncher.jpg

My laptop is utterly WEAKSAUCE! I need a Phenom II X4 laptop (If you ask why, you're not xtreme).
07-26-2009, 01:38 PM
Mechromancer

For this run I had to clock my memory to 800mhz as I'm running 4-Dimms of mismatched RAM. One set of Corsair Dominators and one set of OCZ Reapers. I haven't figured out timings that allow all four to work at 1066mhz yet. I suppose it's still not a bad score.

http://i145.photobucket.com/albums/r...uncher1Bil.jpg
07-26-2009, 03:33 PM
poke349

Quote:

Originally Posted by Mechromancer

My Phenom II wants on the list right ahead of those two Q6600s@3.2Ghz :).

Yeah, I'm surprised at how evenly Phenom II and Core 2 Quad are matched. The program was written and tuned on two machines: Pentium D and Harpertown (the only two I had at the time). So I'd expect Intels to run faster. But obviously that isn't really the case.

C2Q is still a tiny bit faster though, but not by much. (Your run is a bit faster because of the newer version.)

Quote:

Originally Posted by Mechromancer

My laptop is utterly WEAKSAUCE! I need a Phenom II X4 laptop (If you ask why, you're not xtreme).

Mine isn't much better... 1.6 GHz Core Duo...:D I think the 25m time is like 120s or so... I'm waiting for some cheap quad-cores...

Quote:

Originally Posted by Mechromancer

For this run I had to clock my memory to 800mhz as I'm running 4-Dimms of mismatched RAM. One set of Corsair Dominators and one set of OCZ Reapers. I haven't figured out timings that allow all four to work at 1066mhz yet. I suppose it's still not a bad score.

Mismatched? What were you running before?

EDIT: Nevermind, I see it in your siggy. I suppose you could try some really conservative timings to see what happens...

And no, it isn't a bad score... The fact that you have enough ram to do 1b automatically makes it a good score... :D since full ram configurations are harder to OC.

EDIT: There isn't really such thing as a good or bad score... since the range of hardware on that list is massive... (From Atom to Gainestown...) So it's only fair to compare with hardware similar to yours.
07-26-2009, 03:58 PM
Hoss331

That new version is alot faster, same setup as before.

http://i131.photobucket.com/albums/p...mapImage12.jpg
07-26-2009, 07:08 PM
Chumbucket843

Quote:

Originally Posted by poke349

Yeah, I'm surprised at how evenly Phenom II and Core 2 Quad are matched. The program was written and tuned on two machines: Pentium D and Harpertown (the only two I had at the time). So I'd expect Intels to run faster. But obviously that isn't really the case.

this has to do with bottlenecks in the cpu. AMD has very fast execution units but slow retirement units and performs well under full load. the predecoding in core 2 limits performance in cpu intensive programs.
07-26-2009, 08:25 PM
poke349

Quote:

Originally Posted by Chumbucket843

this has to do with bottlenecks in the cpu. AMD has very fast execution units but slow retirement units and performs well under full load. the predecoding in core 2 limits performance in cpu intensive programs.

I was thinking the exact opposite. Core 2 has better SSE throughput than K10, but Core 2 is limited by memory bandwidth.

Now for a quick disclaimer on yet-another "sensitive" issue of Intel vs. AMD:
Before anybody yells at me for drawing a conclusion that Intel has faster arithmetic than AMD, this is merely my guesstimate based on the benchmarks. In no way does it indicate that Intel or AMD is better.
Since the vast majority of the program was written and tuned on Pentium and Harpertown (which is a Core 2), I'd expect there so be some favoring towards Intel.

As for the memory bandwidth issue, I've noticed that the program scales pretty poorly on Core 2 Quads... But, the only ones I've played with is Q6600 and Q9400 - both of which have significantly smaller cache than Harpertown.

If we to want throw out the bandwidth factor to determine which (Core 2 or K10) has better arithmetic throughput for this program, we'll need to do a single-threaded benchmark comparison between a Core 2 and a K10 at the same frequency.

My guess is that Core 2 will win (simply because I tuned for it), but I unfortunately don't have access to any K10s to try it.

Anyone have both and care enough to try that? :)
07-27-2009, 04:43 AM
Mechromancer

Quote:

Originally Posted by Chumbucket843

this has to do with bottlenecks in the cpu. AMD has very fast execution units but slow retirement units and performs well under full load. the predecoding in core 2 limits performance in cpu intensive programs.

Quote:

Originally Posted by poke349

I was thinking the exact opposite. Core 2 has better SSE throughput than K10, but Core 2 is limited by memory bandwidth.

Now for a quick disclaimer on yet-another "sensitive" issue of Intel vs. AMD:
Before anybody yells at me for drawing a conclusion that Intel has faster arithmetic than AMD, this is merely my guesstimate based on the benchmarks. In no way does it indicate that Intel or AMD is better.
Since the vast majority of the program was written and tuned on Pentium and Harpertown (which is a Core 2), I'd expect there so be some favoring towards Intel.

As for the memory bandwidth issue, I've noticed that the program scales pretty poorly on Core 2 Quads... But, the only ones I've played with is Q6600 and Q9400 - both of which have significantly smaller cache than Harpertown.

If we to want throw out the bandwidth factor to determine which (Core 2 or K10) has better arithmetic throughput for this program, we'll need to do a single-threaded benchmark comparison between a Core 2 and a K10 at the same frequency.

My guess is that Core 2 will win (simply because I tuned for it), but I unfortunately don't have access to any K10s to try it.

Anyone have both and care enough to try that? :)

Don't worry about saying one performs better than the other. The truth is both AMD and Intel uarchs have their strengths and weaknesses. If they were both good and bad at the same tasks and performed exactly the same it would make for a pretty boring conversation. Differences are good for this reason.

@Poke and Chumbucket: You two come to different conclusions as to where the performance and bottlenecks are on each platform. I suggest that you may both be correct. AMD has a great HyperTransport platform to work with, and needs to be pused at full load to shine. Core 2 gets choked up under load because of Chumbuckets' explanation and FSB bandwidth anemia.

PS: Dropping my memory from 1066 to 800 only increased my 25M score by less than 0.3 seconds. Not as bad of a hit as I was expecting.

EDIT: Single Threaded Phenom II 25M Test

http://i145.photobucket.com/albums/r...leThreaded.jpg
07-27-2009, 09:29 AM
poke349

Quote:

Originally Posted by Mechromancer

Don't worry about saying one performs better than the other. The truth is both AMD and Intel uarchs have their strengths and weaknesses. If they were both good and bad at the same tasks and performed exactly the same it would make for a pretty boring conversation. Differences are good for this reason.

@Poke and Chumbucket: You two come to different conclusions as to where the performance and bottlenecks are on each platform. I suggest that you may both be correct. AMD has a great HyperTransport platform to work with, and needs to be pused at full load to shine. Core 2 gets choked up under load because of Chumbuckets' explanation and FSB bandwidth anemia.

PS: Dropping my memory from 1066 to 800 only increased my 25M score by less than 0.3 seconds. Not as bad of a hit as I was expecting.

But then... I'm comparing Core 2 and K10. I've completely left Core i7 out of the equation.

Anyways...
Anyone got a Core 2 @ 3.2 GHz? My workstation is, but it's down and it's not coming back online for a few more weeks.

.3 seconds ~ 2%. That isn't much. Going from triple to dual channel @ 1600 MHz on my friend's i7 only made about 2% difference @ 100m... (the larger you go, the bigger the impact of memory bandwidth)

I know on dual-Harpertown, the difference between 667MHz and 800 MHz is huge - like 10% @ 500m... and it gets bigger and bigger as you scale up the size.

My rig,

Dual Xeon X5482 @ 3.2 GHz + 64 GB (16 x 4GB) @ 800 MHz

beats,

Dual Xeon X5470 @ 3.33 GHz + 128 GB (16 x 8GB) @ 667 MHz

by around 5 - 10% at 250m - 1b.

These aren't your normal desktops, so ignore the sheer quantity of ram.:rofl: It's the speed that matters. :yepp:

There's probably a threshold somewhere where extra bandwidth isn't going to help much. Though I haven't bothered to try to find it.

P.S.
Toradora = WIN!!!!! (sorry for looking at your desktop icons :ROTF:)
07-27-2009, 09:54 AM
Mechromancer

Quote:

Originally Posted by poke349

P.S.
Toradora = WIN!!!!! (sorry for looking at your desktop icons :ROTF:)

Yes, I like me some manga and anime :yepp:. I have no prob with anybody looking at my icons. If I did, I would have edited it out like the rest of you pr0n addicts. Ok jk, I really just wanted everybody to see how leet I am for having old games like Wing Commander Prophecy, Freespace 2, and Lock On which I still play with my Suncom F-15 E Talon+SFS Throttle HOTAS.

Also, I'm JohnnyNismo from www.Houston240sx.com if anyone cares. My drift b***h is once again my daily driver so no fun for me anymore.
07-27-2009, 10:16 AM
Chumbucket843

here a core 2 quad at stock. One of the reasons for poor scaling is core 2's cache. it has no shared l3 so the cores only get 2mb l2 each.
http://i25.tinypic.com/2rp5oj9.jpg
07-27-2009, 10:37 AM
poke349

Quote:

Originally Posted by Chumbucket843

here a core 2 quad at stock. One of the reasons for poor scaling is core 2's cache. it has no shared l3 so the cores only get 2mb l2 each.

Those were done using processor affinity right? Because there's no 3-thread mode, and 2-thread mode will use up to 4 threads.

Also I think a big reason is that the program was tuned with 3MB of cache per thread. So massive cache spilling on a bandwidth-limited system will have major penalties.

Here's Q9400 scaling... Also very bad. I also don't know what was causing all of the variation in the benchmarks.
http://www.numberworld.org/y-crunche...raph_small.jpg
Main Page

Here's some very old results with version 0.2.1 on my workstation:
Scaling seems to hit a wall at 7x. I'm almost certain it's the memory bandwidth.
http://www.numberworld.org/y-crunche...raph_small.jpg
Main Page

I need to integrate a bulk-bench option that will generate the data for these graphs without having to do each benchmark by hand.

I already have an automated benchmark add-on (hence how I did all these runs), but it has no interface yet - all options are set in the source code and I have to recompile it everytime I change a setting.
07-27-2009, 10:51 AM
Chumbucket843

i turned off the cores in msconfig so no background tasks would be placed on other cores. i noticed there was no 3 thread mode b/c when i had 3 threads it said i had 4. i will have a speed up graph soon but i am a newb at openoffice.
07-27-2009, 11:06 AM
poke349

Quote:

Originally Posted by Chumbucket843

i turned off the cores in msconfig so no background tasks would be placed on other cores. i noticed there was no 3 thread mode b/c when i had 3 threads it said i had 4. i will have a speed up graph soon but i am a newb at openoffice.

Yes, the program will round up to the next power of two if it isn't.

The reason for limiting it to powers of two, is for ease of implementation and efficiency of code.

There's a crap-load of binary divide-and-conquering in virtually all the algorithms that are used. Stuff like that just don't work well with non-power of two threads...

I've also found that the penalty of running extra threads is relatively small. Assuming that most computers now (and future) will have either a power-of-two # of cores or a "clean multiple" of one, this restriction is worth the ease of implementation.
07-27-2009, 05:17 PM
Hoss331

Quote:

Originally Posted by poke349

Anyways...
Anyone got a Core 2 @ 3.2 GHz? My workstation is, but it's down and it's not coming back online for a few more weeks.

Heres a Yorkfield at 3.2 even though it says 3.8 in the program, I guess it calculates mhz using only the stock multi.

http://i131.photobucket.com/albums/p...apImage12m.jpg
07-27-2009, 05:42 PM
poke349

Quote:

Originally Posted by Hoss331

Heres a Yorkfield at 3.2 even though it says 3.8 in the program, I guess it calculates mhz using only the stock multi.

Could you do that run single-threaded? We were trying to compare the two in single-threaded mode (no bandwidth bottleneck) to see which (Core 2 or K10) has faster arithmetic for this program.

And I like how you set the FSB and mult to match my workstation. :D

It seems like on Core 2 it uses the stock multiplier. On i7, it uses the actual maximum multiplier (as set in BIOS) before Turbo Boost - but it never uses more than the stock multiplier.

Anyhow... I've some insane results coming in from someone in Japan with a very well tuned Dual Xeon W5580 rig with 72GB of ram... Benchmark sizes going all the way up to 32G with the help of Swap Mode...

I'll post those later. But they all lack verification checksums.
07-27-2009, 06:50 PM
Hoss331

here you go, affinity set to 1 core

http://i131.photobucket.com/albums/p...pImage15pi.jpg
07-27-2009, 06:58 PM
poke349

Quote:

Originally Posted by Hoss331

here you go, affinity set to 1 core

Awesome :D

So the summary for Core 2 vs. Phenom II. (for y-cruncher)

Single-threaded (arithmetic speed test):

Phenom II @ 3.2 GHz - 55.1057
Core 2 Quad (12MB cache) @ 3.2 GHz - 49.5219

Multi-threaded (arithmetic + bandwidth test):

Phenom II @ 3.2 GHz - 15.455
Core 2 Quad (12MB cache) @ 3.2 GHz - 13.9467

If we did these runs on Q6600 @ 3.2 GHz, that'll also settle issue of cache size. :D

The two Q6600s that are already on the list are from v0.3.2.
07-28-2009, 01:12 AM
fasterklander

I' running a P4 3,15Ghz

25M -->152.961s
http://img34.imageshack.us/img34/4516/25mb.th.jpg

50M -->344.768s
http://img199.imageshack.us/img199/7016/50mb.th.jpg

100M -->780.277s
http://img145.imageshack.us/img145/9734/100mb.th.jpg

:up:
07-28-2009, 05:56 AM
Hoss331

Quote:

Originally Posted by poke349

If we did these runs on Q6600 @ 3.2 GHz, that'll also settle issue of cache size. :D

The two Q6600s that are already on the list are from v0.3.2.

Id also like to see an I7 do a single core single thread run, turbo off.
07-28-2009, 07:39 AM
spdycpu

Quote:

Originally Posted by Hoss331

Id also like to see an I7 do a single core single thread run, turbo off.

I can do a 3.2GHz run on my i7 when I get home. It is 10:33am CST now, I should be able to get it run by 5:30pm.

Poke349: I finally got a new waterblock, the Heatkiller 3.0 CU. I can't believe the thing, 4.4GHz is 100% stable (Linx w/ 8 threads for 24 hours). That block with regular water is better than my old Apogee with ice water, no kidding. 65C full load at 4.2GHz, 1.3v. 75C full load at 4.4GHz, 1.38v. I'll give ice water a shot at some point, I really want to get a 4.6GHz run done. It would be nice if you could include some batch benchmarking.

For example you could set 3 loops then specify a range from X to Y. This way I could run 3 loops of each & save the fastest time and test times of 1m, 2m, 4m, 8m, 16m, etc digits as well as the 25, 50, 100, etc. Also outputting the fastest result to a file would be nice as not to need to copy/paste so much text. What do you think?
07-28-2009, 08:03 AM
poke349

Quote:

Originally Posted by Hoss331

Id also like to see an I7 do a single core single thread run, turbo off.

I can do them, but my rig is tied up for a few more days. Looks like spdy beat me to it. :D

Quote:

Originally Posted by spdycpu

I can do a 3.2GHz run on my i7 when I get home. It is 10:33am CST now, I should be able to get it run by 5:30pm.

Poke349: I finally got a new waterblock, the Heatkiller 3.0 CU. I can't believe the thing, 4.4GHz is 100% stable (Linx w/ 8 threads for 24 hours). That block with regular water is better than my old Apogee with ice water, no kidding. 65C full load at 4.2GHz, 1.3v. 75C full load at 4.4GHz, 1.38v. I'll give ice water a shot at some point, I really want to get a 4.6GHz run done. It would be nice if you could include some batch benchmarking.

For example you could set 3 loops then specify a range from X to Y. This way I could run 3 loops of each & save the fastest time and test times of 1m, 2m, 4m, 8m, 16m, etc digits as well as the 25, 50, 100, etc. Also outputting the fastest result to a file would be nice as not to need to copy/paste so much text. What do you think?

I completely agree with you. I just need to find the time to polish up my bulk compute add-on and release it.

3 runs of each - Good idea. I'll probably set that as a default with an option to override it. And I'll add a size-limit to looped runs - say 10 min. Otherwise those massive single-threaded 10 and 12b runs on my workstation will take days. :rolleyes:

I can have it output the benchmarks to a separate text file.

Something like 3 categories:

Standard Sizes: 25m, 100m, 250m, etc... all validated - print the best times (with it's validation) into a text file.

SuperPi Sizes: 1M, 2M, 4M, etc... all validated, same as above

Multi-core Scaling: 1m, 1.2m, 1.5m, 2m, 2.5m, etc*...
- Manually select threading mode
- No validation
*These are the sizes I used to generate those fancy multi-core scaling graphs.

I'd love to see a multi-core scaling graph from a pair of Gainestowns... :D:D:D But I honestly doubt anyone will be patient enough to sit through single-threaded runs of 1b+.:shrug: For me, I just let it run while I'm at work, run overnight... :rolleyes:

I also need a way to enforce processor affinity. I can't manually force it because I wouldn't know which cores are real and which are virtual from HT.

As for that... Time for some insaneness....

Results from Japan: http://ja0hxv.calico.jp/pai/pietc.html
Google translate it if you can't read Japanese. (I can't either...)

2 x Intel Xeon W5580 Gainestown @ 3.2 GHz
72 GB (18 x 4 GB) DDR3
Windows Server 2008

25m - 6.92
50m - 13.31
100m - 28.14
250m - 76.34
500m - 166.07
1b - 365.20
2.5b - 1,025.05
5b - 2,307.18
10b - 4,961 (1 hour, 22 min, 41 secs)
25b - 19,415 (5 hours, 23 min, 35 secs) - Done using Swap Mode*

1M - 0.37
2M - 0.67
4M - 1.21
8M - 2.31
16M - 4.47
32M - 8.75
64M - 18.02
128M - 38.18
256M - 82.63
512M - 185.41
1G - 398.09
2G - 868.54
4G - 1,928.29
8G - 4,235 (1 hour, 10 min, 35 secs)
16G - 11,892 (3 hours, 18 min, 12 secs) - Done using Swap Mode*
32G - 31,061 (8 hours, 37 min, 41 secs) - Done using Swap Mode*

One thing I have to say... This guy is NUTs...
He gets new workstations like this about once every half a year.

The last few he had are:

2 x Intel Xeon X5470
128 GB (16 x 8 GB) DDR2 FB-DIMM

2 x Intel Xeon X5460
64 GB (16 x 4 GB) DDR2 FB-DIMM

Not only that... He ACTUALLY ran this program for 8+ hours just for a benchmark. That's a pretty good stress test... :rofl::rofl::rofl:
I've done longer runs than that (200+ hours), but that's because they were either tests, or were for size records. Not benchmarks... :shakes:

*Swap Mode requires less memory but is significantly slower.
There's no validation for it, and it's available under the Custom Compute option.

Lastly... Dave, if you're here, you've got some SERIOUS competition.
This guy knows how to tune these things... enough to make his W5580s faster than your W5590s.
07-28-2009, 12:50 PM
Chumbucket843

i was just on google trends and japan is the #1 country to search core i7.
07-28-2009, 01:51 PM
Chumbucket843

deleted
07-28-2009, 02:31 PM
spdycpu

Quote:

Originally Posted by poke349

Awesome :D

So the summary for Core 2 vs. Phenom II. (for y-cruncher)

Single-threaded (arithmetic speed test):

Phenom II @ 3.2 GHz - 55.1057
Core 2 Quad (12MB cache) @ 3.2 GHz - 49.5219

Multi-threaded (arithmetic + bandwidth test):

Phenom II @ 3.2 GHz - 15.455
Core 2 Quad (12MB cache) @ 3.2 GHz - 13.9467

If we did these runs on Q6600 @ 3.2 GHz, that'll also settle issue of cache size. :D

The two Q6600s that are already on the list are from v0.3.2.

i7 @ 3.2GHz, 3.6GHz Uncore, Memory @ 1600 7-7-6-16.

Single:

Code:

Benchmark Successful. The digits appear to be OK. Program Version: 0.4.1 Build 7408 (x64 SSE3) Processor(s): Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz CPU Frequency: 3,192,005,951 Hz (frequency may be inaccurate) Thread(s): 1 Digits: 25,000,000 Total Time: 44.5555 seconds Checksum: 506bd9db81dfe73a07ae66fb5da8af7e

Multi (with HT):

Code:

Benchmark Successful. The digits appear to be OK. Program Version: 0.4.1 Build 7408 (x64 SSE3) Processor(s): Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz CPU Frequency: 3,192,005,119 Hz (frequency may be inaccurate) Thread(s): 8 Digits: 25,000,000 Total Time: 11.4947 seconds Checksum: d2ec2f25569fffbd04422301296a783b
07-28-2009, 06:24 PM
poke349

Quote:

Originally Posted by Chumbucket843

i was just on google trends and japan is the #1 country to search core i7.

lolz... :rofl::rofl::rofl:
They always have the newest gadgets for just about everything except for maybe processors... since both Intel and AMD are US-based...

Quote:

Originally Posted by spdycpu

i7 @ 3.2GHz, 3.6GHz Uncore, Memory @ 1600 7-7-6-16.

Nice... At some point, I'm gonna need to make a database on my site. But I don't have the time for all that... argh...

And you're hitting 4.6 on plain water? That's just insane... :eek: Because 5GHz is already LN2 territory. Those benches will be interesting and hard to beat. :D

Today, I got a nice look at a 96-core 16 x Dunnington machine with 512 GB ram at fair... Too bad it was too busy for me to try any benches... :(:(:(:(:(
07-30-2009, 06:05 AM
Ket

Not bad considering I'm running SETI atm as well :D reported CPU frequency is wrong, its 3.6GHz. I'll be back with some proper results when my PC9200 turns up.

http://img.photobucket.com/albums/v1...pboard02-5.jpg
08-01-2009, 08:04 PM
poke349

I got some new fans. :)

A pair of these:
http://www.newegg.com/Product/Produc...82E16835213009
(I didn't get them from newegg though.)

Speed controlled, they are just as quiet as my old ones with slightly more airflow. I run it at this speed normally...

At full power, I can't hear myself talk... :rofl::rofl::rofl:

Now I can safely hit 4.2GHz on air - with more room to spare. :)

This was stress test more than a benchmark. I intentionally left RealTemp and CPUz on to monitor it.

http://www.numberworld.org/y-crunche...2009_small.jpg

The temps peaked at 84C. They hit 90C when I benched 4GHz with my old fans.
08-04-2009, 09:41 AM
Particle

Greetings, poke. I like your benchmark program--it's quite nice. Have you by chance experienced an issue where it doesn't seem to hit all cores very effectively? I've got a 12-core machine where it seems to stay in the 40-60% CPU range for smaller benchmarks (under 32M) and 60-80% for larger ones. It never actually "pegs" so to speak.
08-04-2009, 10:10 AM
poke349
Quote:

Originally Posted by Particle

Greetings, poke. I like your benchmark program--it's quite nice. Have you by chance experienced an issue where it doesn't seem to hit all cores very effectively? I've got a 12-core machine where it seems to stay in the 40-60% CPU range for smaller benchmarks (under 32M) and 60-80% for larger ones. It never actually "pegs" so to speak.

Yes, it's a fundamental issue with this type of task. Hence why it's taken while...

Pi - by it's very nature doesn't parallel as well as wprime, or any other "artificially made" task.

Why your cores aren't kept busy 100% of the time can be due to several reasons:
- Load imbalance. Most types of scientific computing like this don't split evenly (or at least it's not easy to do so). So some threads will finish before others. When this happens, the threads that are done need to wait for the others.
- Not every part of the computation is paralleled. Fast operations like additions and subtractions are limited by memory bandwidth so they will not benefit from multi-threading.
- Thread creation and destruction have a lot of overhead. When the working size for a particular operation is small enough, the overhead of thread creation becomes greater than the benefit of threading. At this point, the program doesn't parallel it - hence less than 100% cpu.
- Refresh rate of Task Manager. Task manager and other monitors average cpu usage over a period of time. If the computation is small, there won't any period of sustained 100% cpu long enough to average a 100%.
The larger the computation, the smaller the effect of these inefficiencies, and the higher the cpu usage.
With 12 cores, you're probably gonna need to go above 1 billion digits to get cpu usage averaging > 90%.
You WILL need to go up to several billion digits to achieve sustained 100% cpu that can last a few minutes. Most people don't have that kind of ram so it isn't suitable as a stress-test unless you run multiple instances.

CPU usage can be improved if I allow multi-threading to increase memory usage... But that gets prohibitive after a while. This type of computing is already enough of a memory hog as it is. So I prefer the ability to hit larger sizes.

So in some sense, computing Pi is a benchmark that "more closely" resembles real-life scientific computing.

EDIT:
So yes. Have fun with it. :) Tell all those Pi fanatics... they'll need to move away from those C2D's to stay competitive. :rofl::rofl::rofl: jk
08-04-2009, 01:50 PM
Mechromancer

Are these guys with 16-thread Xeon systems not hitting 100% CPU usage either?
08-04-2009, 01:58 PM
Particle

Is it possible to specify a manual thread count? Inspired by HT people, I'd like to try a run at 24 threads to see if things stay busier.
08-04-2009, 02:14 PM
poke349

Quote:

Originally Posted by Mechromancer

Are these guys with 16-thread Xeon systems not hitting 100% CPU usage either?

I definitely wouldn't expect them to... But CPU usage as a whole can be fairly deceiving. Even though it "doesn't look" to be efficient, you're still getting massive speed up.

Here's the 8-core @ 1b screenie on my website:
I consider this a "good" graph. Mostly @ 100% but with dips every 10 - 20 seconds...
http://www.numberworld.org/y-crunche.../cpu_usage.jpg
In most cases is won't be as efficient as this. :(

The only time where I've gotten near sustained 100% cpu is during one of the world size-records I set back in April:

Same computer: 8 cores @ 31 billion digits of a different constant
(click to enlarge)
http://www.numberworld.org/nagisa_ru...2009_small.jpg
This kind of efficiency... is only achievable if you have either a REALLY SLOW computer :rolleyes:, or if you have a completely stupid amount of ram...:rofl:

Quote:

Originally Posted by Particle

Is it possible to specify a manual thread count? Inspired by HT people, I'd like to try a run at 24 threads to see if things stay busier.

Good thinking :clap::

The program actually already does that.:D When you run N threads, it will usually run 2N and occasionally 4N threads.

In any case, if you have a non-power of 2 cores, it rounds up. So on your rig, the program is running in 16-core mode which uses anywhere from 16 - 64 threads. (There's an option in task manager that shows how many threads a process is using.)

You can manually set your settings in the "Custom Compute a Constant" option. But there's no validation.

Just remember that higher % cpu usage doesn't always mean faster time.
08-04-2009, 02:40 PM
Particle

Oh wow...I see you're creating and destroying threads during the computation cycle itself. In my own programming I've found it to be a good idea to create x number of threads and then use them all to process pieces of work dispatched from a synchronous controller. That may or may not be practical or applicable to your particular algorithm of course--I won't pretend to be familiar with your project. :) In any case, that does make sense now. I saw as few as 8 threads and as many as 60-some.
08-04-2009, 03:06 PM
poke349
Quote:

Originally Posted by Particle

Oh wow...I see you're creating and destroying threads during the computation cycle itself. In my own programming I've found it to be a good idea to create x number of threads and then use them all to process pieces of work dispatched from a synchronous controller. That may or may not be practical or applicable to your particular algorithm of course--I won't pretend to be familiar with your project. :) In any case, that does make sense now. I saw as few as 8 threads and as many as 60-some.

I've thought about using thread pools, but I decided against it for a few reasons:
1. I couldn't figure out how to use that API. :ROTF:
2. The program was written for extremely large computations (for breaking size-records). And on large computations, threading overhead is negligible.
3. My intuition told me that a synchronous work-dispatcher might have problems scaling into "many" cores... by many, I mean tens or hundreds...
  (Specifically, it would take linear time to dispatch N-loads of work for N cores, whereas recursive thread-creation would take only log(N) provided that the memory allocator was efficient.)
4. Lastly, I was just plain lazy... :rofl:
08-05-2009, 02:11 PM
Calmatory

Only if there was Linux binaries...
08-05-2009, 06:53 PM
poke349

If I had more experience with Linux, then there would be binary for it... :(

"Eventually", I'll have Linux binaries... whenever I get the time... :rolleyes:
The entire program has been written to be easily ported to Linux... So I don't expect it to be too hard to do so when the time comes.

Have you tried running it under Wine?
08-06-2009, 06:28 AM
Chumbucket843

if you compiled for linux and the cell i could run this thing on my ps3.:D actually the cell has been beaten in flops by x86 by now though and its a PITA to work with.
08-06-2009, 09:28 AM
poke349

Quote:

Originally Posted by Chumbucket843

if you compiled for linux and the cell i could run this thing on my ps3.:D actually the cell has been beaten in flops by x86 by now though and its a PITA to work with.

I can't "just" compile it. :(
I would need to learn how to use the linux threading libraries first.

To compile for Linux without any code modification, I'd have to disable multi-threading... :mad: which pretty much defeats the purpose of the program.

Judging by the examples I found on how the linux "pthread" library is used, it should be a simple drop-in replacement for the Windows threading library...
But I don't have a machine with linux to try it, nor do I have the time.

As for cell processors... It's a different type of processor so much of the program would have to be rewritten and re-optimized to be efficient.
08-06-2009, 03:45 PM
Calmatory

Have you looked at the boost library?
08-06-2009, 05:06 PM
poke349

Quote:

Originally Posted by Calmatory

Have you looked at the boost library?

Thanks for the input.:cool: No, I haven't heard about it.

Seems interesting and it appears to be a "semi-standard" that VS supports...
I didn't know VS actually supported a thread-library other than itself. :eek:

If VS supported pthreads, I would've use pthreads from the start since it's more portable... but no... it just HAS to force you to use the windows one... :down::down::down:

The thing I don't like about Boost right now is that it's C++. The source code for the program is 99% C - and I kinda want to keep it that way.

It also doesn't look it can be drop in replacement for WinAPI because of all that object management stuff... :mad:
Wheras, WinAPI and pthreads have almost the same usage format.

Also, v0.4.2 should be out in a week or so...:D I added a batch mode as requested by a number of people. Right now, I'm still testing it...

The only thing is that the batch benchmarks aren't validated. It'd be kind of a mess if I generated a checksum for every single benchmark.
I might add validation for them in the future. But as of right now, I'll leave it out.
08-08-2009, 05:49 PM
spdycpu

Got a test cpu, mobo & ram to goof around with. When I get bored with it I'll throw in a Q9550 or similar and give it to the wife. :)

Pentium 4 661 in Windows XP 64 bit running the SSE3 x64 executable @ 4.7GHz with and without HT:

25m no HT

Code:

Benchmark Successful. The digits appear to be OK. Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3) Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz CPU Frequency: 4,717,549,367 Hz (frequency may be inaccurate) Thread(s): 1 Digits: 25,000,000 Total Time: 73.1281 seconds Checksum: 65cb672d996bc3240db0fda7909acc3b

25m+HT

Code:

Benchmark Successful. The digits appear to be OK. Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3) Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz CPU Frequency: 4,717,524,888 Hz (frequency may be inaccurate) Thread(s): 2 Digits: 25,000,000 Total Time: 65.6558 seconds Checksum: 085dd47af97911d30c9fa6d03a9ce1e0

50m no HT

Code:

Benchmark Successful. The digits appear to be OK. Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3) Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz CPU Frequency: 4,717,561,175 Hz (frequency may be inaccurate) Thread(s): 1 Digits: 50,000,000 Total Time: 168.288 seconds Checksum: cd31559715e07e3619136cc16cd2cf2b

50m+HT

Code:

Benchmark Successful. The digits appear to be OK. Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3) Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz CPU Frequency: 4,717,547,001 Hz (frequency may be inaccurate) Thread(s): 2 Digits: 50,000,000 Total Time: 153.715 seconds Checksum: 7b0843fca01a0b691b42d7d33fe11537

100m

Code:

Benchmark Successful. The digits appear to be OK. Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3) Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz CPU Frequency: 4,717,545,210 Hz (frequency may be inaccurate) Thread(s): 1 Digits: 100,000,000 Total Time: 377.512 seconds Checksum: ae6656d43ce984db08c62a84afc1ec02

100m+HT

Code:

Benchmark Successful. The digits appear to be OK. Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3) Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz CPU Frequency: 4,717,527,780 Hz (frequency may be inaccurate) Thread(s): 2 Digits: 100,000,000 Total Time: 343.165 seconds Checksum: 1f7061c8731d2da31a774b0d16de8145

Yep, even at 4.7GHz a P4 is still garbage. :yepp:
08-09-2009, 03:37 PM
poke349

Ooh...

64-bit P4... :clap:

Wouldn't shoving a C2Q into a an old 775 require more than a BIOS update?

I thought about putting a Q6600 into my Pentium D machine... but then I checked that the chipset doesn't support C2Q even with a bios update. :(

Here's a screenie of v0.4.2:
It should be out in a couple days. It isn't quite ready for release right now - there's (yet another) a bug in the VS compiler that I'm wrestling with... :down:

http://www.numberworld.org/y-crunche...0.4.2_peak.jpg

The text file can be copy and pasted into excel. :)

I've found 3 bugs in the VS compiler in the last 8 months... :yepp: 2 of them related to high memory usage in x64... I think it's safe to say that not too many people use VS for high-memory programming.
08-09-2009, 04:10 PM
spdycpu

Quote:

Originally Posted by poke349

Ooh...

64-bit P4... :clap:

Wouldn't shoving a C2Q into a an old 775 require more than a BIOS update?

I thought about putting a Q6600 into my Pentium D machine... but then I checked that the chipset doesn't support C2Q even with a bios update. :(

I was going to built her a quad setup anyway so I got the DFI DK P35-T2RS which supports the duo/quad chips. I saw a cheapo P4 661 so I figured I'd get it and test how far it'll go and how it compared with my old Opteron @ 3GHz. So far even at 4.7GHz and with some low timings on some Crucial BallistiX Tracer memory it isn't looking good vs the Opty. :D

Quote:

Originally Posted by poke349

Here's a screenie of v0.4.2:
It should be out in a couple days. It isn't quite ready for release right now - there's (yet another) a bug in the VS compiler that I'm wrestling with... :down:

The text file can be copy and pasted into excel. :)

I've found 3 bugs in the VS compiler in the last 8 months... :yepp: 2 of them related to high memory usage in x64... I think it's safe to say that not too many people use VS for high-memory programming.

Good news about the release of v0.4.2, I'm glad to see the batch mode added. :) Have you tried testing the Intel compiler? I had some decent luck with it even on AMD machines. The profile guided optimization can really give some programs a boost. I have a feeling most of your cpu time is spent on hand optimized assembly so PGO likely wouldn't be of any benefit but I figure I'd mention it just in case you haven't messed with it yet.
08-09-2009, 04:22 PM
poke349

Quote:

Originally Posted by spdycpu

I was going to built her a quad setup anyway so I got the DFI DK P35-T2RS which supports the duo/quad chips. I saw a cheapo P4 661 so I figured I'd get it and test how far it'll go and how it compared with my old Opteron @ 3GHz. So far even at 4.7GHz and with some low timings on some Crucial BallistiX Tracer memory it isn't looking good vs the Opty. :D

Good news about the release of v0.4.2, I'm glad to see the batch mode added. :) Have you tried testing the Intel compiler? I had some decent luck with it even on AMD machines. The profile guided optimization can really give some programs a boost. I have a feeling most of your cpu time is spent on hand optimized assembly so PGO likely wouldn't be of any benefit but I figure I'd mention it just in case you haven't messed with it yet.

Actually... There's is no assembly whatsoever in the entire program. :p:
As an undergraduate, my programming knowledge is too narrow to write good assembly. So that right there is a major potential for improvement.

Although I haven't run a profiler on it yet, my guess is that it spends most of it's time pipeline-stalling and waiting for memory (cache misses, or simply insufficient memory bandwidth...)

Specifically, I have a feeling that unprefeched cache misses are significant, but I haven't played with my memory timings nor have I run a profiler to determine its impact.
08-10-2009, 07:01 PM
poke349

Version v0.4.2 is out!

Here's a screenie of that stress tester...

Temperature-wise, it's on par with Prime95.
But I don't know if it can detect stability issues as well as Prime95 - let alone Linpack.

http://www.numberworld.org/y-crunche...test_small.jpg
08-11-2009, 07:13 AM
Chumbucket843

hey poke, there is a 16 core quad socket system in WCG section by jcool and i think it would pwn on this benchmark. its only got 4 gigs right now but it supports a junkload of memory. i wonder how much ram this would use idling in windows?
08-11-2009, 09:22 AM
poke349

Quote:

Originally Posted by Chumbucket843

hey poke, there is a 16 core quad socket system in WCG section by jcool and i think it would pwn on this benchmark. its only got 4 gigs right now but it supports a junkload of memory. i wonder how much ram this would use idling in windows?

I think he already knows.

Quote:

Originally Posted by jcool

Oh heh, a new benchie - I'll def. check it out sometime :)

But the fact the there's no ram is gonna be severely limiting... With THAT many cores, you're gonna need a massive computation size to keep them busy. 4 GB won't be enough to keep those cores fed... But... it might be enough to break some of the current speed records anyway. :D
08-13-2009, 01:21 AM
SamHughe

25,000,000 digits:
Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Intel(R) Xeon(R) CPU W3520 @ 2.67GHz
CPU Frequency: 2660015200
Thread(s): 2^3
Digits: 25000000
Total Time: 12.924
Checksum: c90910e6b1387d740fe4352132ee4855

2,500,000,000 digits:
Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Intel(R) Xeon(R) CPU W3520 @ 2.67GHz
CPU Frequency: 2660034352
Thread(s): 2^3
Digits: 2500000000
Total Time: 2404.996
Checksum: c2abaca2a09340b91922847dd0ffc278
08-13-2009, 10:55 PM
Hawkeye4077

Here is my Dual Shanghai System.

Code:

Program Version: 0.4.2 Build 7438 (x64 SSE3) Processor(s): Quad-Core AMD Opteron(tm) Processor 2389 CPU Frequency: 2900166909 Thread(s): 2^3 Digits: 25000000 Total Time: 11.406 Checksum: e195c35a015ca7c4079e7e2c4c727037

Code:

Program Version: 0.4.2 Build 7438 (x64 SSE3) Processor(s): Quad-Core AMD Opteron(tm) Processor 2389 CPU Frequency: 2900164958 Thread(s): 2^3 Digits: 50000000 Total Time: 23.549 Checksum: 8d2ea0a8ccbda4511edad184dd0c405f

Code:

Program Version: 0.4.2 Build 7438 (x64 SSE3) Processor(s): Quad-Core AMD Opteron(tm) Processor 2389 CPU Frequency: 2900166047 Thread(s): 2^3 Digits: 100000000 Total Time: 49.112 Checksum: 77fec20038094164021df20c23143c6c

Code:

Program Version: 0.4.2 Build 7438 (x64 SSE3) Processor(s): Quad-Core AMD Opteron(tm) Processor 2389 CPU Frequency: 2900172767 Thread(s): 2^3 Digits: 250000000 Total Time: 139.788 Checksum: e0befcb870941172171bfc300fa992bd

Code:

Program Version: 0.4.2 Build 7438 (x64 SSE3) Processor(s): Quad-Core AMD Opteron(tm) Processor 2389 CPU Frequency: 2900162991 Thread(s): 2^3 Digits: 500000000 Total Time: 301.514 Checksum: cf5090c8ca14560219b83cc1de7f90e2

Edit: I will run more over the next couple of days. I have 8gb of ram currently, and when I get to 16gb I will run the larger ones
08-14-2009, 09:14 AM
poke349

@ SamHughe
Nice 2.5b run. Haven't been seeing many of those. :yepp::D;)

Did you pretty much have to close everything to get it to fit into 12GB?
Or did you just let it page out?

Quote:

Originally Posted by Hawkeye4077

Here is my Dual Shanghai System.

Edit: I will run more over the next couple of days. I have 8gb of ram currently, and when I get to 16gb I will run the larger ones

Woah... the first AMD dualie. :cool:

Quick question: Is that running at 2.9 GHz as the program says or is it @ 3.3 GHz (in your siggy)?

Since it's the first time I've seen this program run on an AMD dualie, would you be able to do benchmarks for:
1,000,000
10,000,000
1,000,000,000

I'd love to add it to the comparison chart that's here: :D
http://www.numberworld.org/y-cruncher/#Benchmarks

The 1m and 10m aren't in the benchmark options, so you'll need to use either the batch-mode or the custom compute option.
08-14-2009, 10:09 AM
SamHughe

Quote:

Originally Posted by poke349

@ SamHughe
Nice 2.5b run. Haven't been seeing many of those. :yepp::D;)

Did you pretty much have to close everything to get it to fit into 12GB?
Or did you just let it page out?

Thanks!
I ran msconfig and selected diagnostic startup. I still think it paged out some though.
08-15-2009, 04:03 PM
Hawkeye4077

Quote:

Originally Posted by poke349

Woah... the first AMD dualie. :cool:

Quick question: Is that running at 2.9 GHz as the program says or is it @ 3.3 GHz (in your siggy)?

Since it's the first time I've seen this program run on an AMD dualie, would you be able to do benchmarks for:
1,000,000
10,000,000
1,000,000,000

I'd love to add it to the comparison chart that's here: :D
http://www.numberworld.org/y-cruncher/#Benchmarks

The 1m and 10m aren't in the benchmark options, so you'll need to use either the batch-mode or the custom compute option.

Was at 2.9 GHZ.. I will rerun @ 3.33
08-15-2009, 04:05 PM
Hawkeye4077

As above here is my dual Shanghai 2389 @ 3.33ghz.

Code:

Program Version: 0.4.2 Build 7438 (x64 SSE3) Processor(s): Quad-Core AMD Opteron(tm) Processor 2389 CPU Frequency: 3336182273 Thread(s): 2^3 Digits: 25000000 Total Time: 9.720 Checksum: 409d8961151453331ebd8fd5b49ffab9

Code:

Program Version: 0.4.2 Build 7438 (x64 SSE3) Processor(s): Quad-Core AMD Opteron(tm) Processor 2389 CPU Frequency: 3336181988 Thread(s): 2^3 Digits: 50000000 Total Time: 20.084 Checksum: f18969bf2caaadc01bd3a3a05bdca6c0

Code:

Program Version: 0.4.2 Build 7438 (x64 SSE3) Processor(s): Quad-Core AMD Opteron(tm) Processor 2389 CPU Frequency: 3336180534 Thread(s): 2^3 Digits: 100000000 Total Time: 42.306 Checksum: 9af038030bc9e0370ee84f1c8ad8babf

Quote:

Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336181279
Thread(s): 2^3
Digits: 250000000
Total Time: 113.583
Checksum: c7a217871dbd8aac1fd1a0a86c0c6b23

Quote:

Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336180470
Thread(s): 2^3
Digits: 500000000
Total Time: 249.244
Checksum: 330ec95f5a05c4971e373ab94be5664d

Quote:

Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336183313
Thread(s): 2^3
Digits: 1000000000
Total Time: 552.905
Checksum: 52c37453ac4eb2ba0abec8f7913ea03b

Code:

1,000,000 Digits Writing Decimal Digits: 1,000,001 digits written Total Computation Time: 0.617 seconds ( 0.000 hours ) Total Time (including writing digits): 0.709 seconds ( 0.000 hours )

Code:

10,000,000 Digits Writing Decimal Digits: 10,000,001 digits written Total Computation Time: 4.288 seconds ( 0.001 hours ) Total Time (including writing digits): 4.659 seconds ( 0.001 hours )
08-16-2009, 12:11 AM
poke349

Quote:

Originally Posted by Hawkeye4077

As above here is my dual Shanghai 2389 @ 3.33ghz.

Nice results. I've added your Opterons to that comparison list. :D

The results are interesting. Compared with my dual-Harpertown rig, it beats it for less than 100m. But any higher and it starts to get slower.

Smaller cache? :scope:
08-16-2009, 11:00 AM
Peen

heres mine, can do more later. this is 32m

4.51ghz, 3.87ghz Uncore. 32m 11.297s

http://i25.tinypic.com/2a5dt2p.png
08-16-2009, 07:25 PM
poke349

Quote:

Originally Posted by Peen

heres mine, can do more later. this is 32m

4.51ghz, 3.87ghz Uncore. 32m 11.297s

Woah, nice. New single socket record! :D

Was turbo-throttling in effect during the run? Your temps are hitting 88C, but 75-80C is usually the point where turbo throttling starts.
Are you on air?

As for all the suggestions about the Intel Compiler.
I've tried it, and the results are interesting.

Initially, after messing with the compiler options, it couldn't do any better than 5% slower than the current build (with the Visual Studio Compiler).
After tweaking the source code a bit, I've gotten it to about 1% faster than Visual Studio. :D:D:D And I don't think I'm done yet. :yepp:

I'll need to finish my tweaks (and hopefully do better than 1%) and do a ton of other tests on my workstation to make sure it's consistently faster.

On the other hand... The binary (when compiled with ICC) is HUGE... :shocked:
08-16-2009, 08:03 PM
Peen

I think I ran prime earlier thats why it said it was so hot. But my chip is already high leakage chip so it runs hot. My chip starts throttling at 100c, Oh and this is on water too. Ill try 4.6ghz later but my brand new DFI X58 UT just died, went pop!
08-17-2009, 01:43 PM
poke349

Quote:

Originally Posted by Peen

I think I ran prime earlier thats why it said it was so hot. But my chip is already high leakage chip so it runs hot. My chip starts throttling at 100c, Oh and this is on water too. Ill try 4.6ghz later but my brand new DFI X58 UT just died, went pop!

You killed your DFI? :( Sorry to hear that... RMA?

The "main" throttling occurs at 100C, but Turbo mode gets disabled when you hit 80C or so. So the multiplier drops from 21 to 20.

It will actually bounce back and fourth between 20 and 21 many times per second. Once you get high enough (85+ C), then it pretty much stays at 20.
08-17-2009, 04:33 PM
Kurumi

Core i7 920 @ 3.927GHz

SuperPi 32M - 9m31.187s
y-cruncher:
250M - 127.051s
500M - 280.013s
1 Billion - 616.635s
2.5 Billion - 4129.34s

http://i431.photobucket.com/albums/q...rsuperPi-1.png
08-18-2009, 10:46 AM
poke349

Quote:

Originally Posted by Kurumi

Core i7 920 @ 3.927GHz

SuperPi 32M - 9m31.187s
y-cruncher:
250M - 127.051s
500M - 280.013s
1 Billion - 616.635s
2.5 Billion - 4129.34s

Oh :D Large runs. :p: Cool.

Was the 2.5b paging like crazy? It shouldn't be that slow.

Unless you were doing tons of stuff in the background, it should've kicked everything else out of ram to allow 2.5b to run without much page-thrashing.
08-20-2009, 02:51 AM
Kurumi

It was trashing like crazy for 2.5b. I wasn't running anything else in the background other than my normal apps. I expected 2.5b to be hard on my system though.
08-20-2009, 08:04 PM
bakalu

Quote:

Originally Posted by poke349

Version v0.4.2 is out!

Here's a screenie of that stress tester...

Temperature-wise, it's on par with Prime95.
But I don't know if it can detect stability issues as well as Prime95 - let alone Linpack.

Quote:

Temperature-wise, it's on par with Prime95.

yes it is.

your software is very good.

If i run Multi-Threaded Pi with 350.000.000, i run well Prime95 small FFT in 30 minutes without any error.

My PC Q9550 @ 4GHz.

P.S : Sorry my english.
08-21-2009, 08:38 AM
poke349

Quote:

Originally Posted by Kurumi

It was trashing like crazy for 2.5b. I wasn't running anything else in the background other than my normal apps. I expected 2.5b to be hard on my system though.

Interesting. At least on my machine, even when I have more than 500MB of stuff open, the program will page it out enough to let the program have the full 11.5 GB for the entire run.

Win7 change in memory manager?

Anyone want to try and confirm this?

Quote:

Originally Posted by bakalu

yes it is.

your software is very good.

If i run Multi-Threaded Pi with 350.000.000, i run well Prime95 small FFT in 30 minutes without any error.

My PC Q9550 @ 4GHz.

P.S : Sorry my english.

thx :)

Did you verify the 350m? Only the benchmarks and the two of the batch mode options will verify to see if the digits are correct. The "Custom Compute" option does NOT since there are no checksums for them.

Just because it finishes without warnings doesn't mean it finished without any errors.

The stress test, on the otherhand, DOES verify to see if everything is correct.

The way it works is that it cycles through all the major constants that the program can compute. For each constant, it runs two computations using different algorithms. Then it matches them to see if they are correct.

It uses two threads that run completely independently of each other. Because of "dips" in the CPU usage for each computation, one isn't quite enough to keep everything busy. But two does pretty well - at least up through 8 cores. ;)
08-24-2009, 12:07 PM
KB24LA

http://img297.imageshack.us/img297/9...uncher500m.jpg
lame result but still...
08-25-2009, 05:06 PM
poke349

Quote:

Originally Posted by KB24LA

lame result but still...

nah... That looks about right for your processor/frequency. :cool:

Something interesting here:
http://www.itocp.com/bbs/thread-35879-1-1.html

The guy did a string table hack on the binaries. It'll print correctly only if you have your regional settings set to read ascii in some specially encoded way that works for Chinese characters.
And it isn't complete. Some of the other "secondary features" aren't translated at all.

Which gives me an idea. For next the release (v0.4.3), how about I link to an external .ini settings file that has the full string table in unicode?
That way it can be easily translated to any language by simply editing the .ini string table. :yepp::yepp::yepp:

It's kinda drastic. But, I plan on implementing a very aggressive anti-tamper protection into v0.4.3 that will pretty much block any changes to the binary.
Which would make translation (via string-table hack) impossible unless the protection itself is broken. So the only way to allow any sort of language support is to move the strings out of the binary. :D

Any other ideas?
08-26-2009, 02:43 AM
Ket

Heres my score with Q8400 @ 3.6GHz, 485FSB. RAM @ 1165MHz

http://img.photobucket.com/albums/v1...pboard02-7.jpg
08-31-2009, 12:54 AM
Peen

Little bit faster 4.6ghz, 32m
http://i30.tinypic.com/o7iv55.png

and 25k
http://i32.tinypic.com/4sm9gm.png

and 50k
http://i28.tinypic.com/sp7uw4.png
08-31-2009, 01:50 PM
Chumbucket843

dude your records are on wikipedia.:clap:
08-31-2009, 01:55 PM
Peen

Mine are? Link me :D I know I can go faster
08-31-2009, 02:11 PM
Chumbucket843

Quote:

Originally Posted by Peen

Mine are? Link me :D I know I can go faster

you probly set a single socket speed record but poke set the record for most digits calculated on several constants. i couldn't find some of the other ones but they need to updated.
http://en.wikipedia.org/wiki/Euler-Mascheroni_constant
http://en.wikipedia.org/wiki/Catalan's_constant
looks like kondo beat you on e!
http://en.wikipedia.org/wiki/E_(mathematical_constant)
08-31-2009, 02:33 PM
Peen

Where do you see the times at? BTW cool links, wish I was smart enough to understand math like that!
08-31-2009, 06:52 PM
poke349

Quote:

Originally Posted by Chumbucket843

dude your records are on wikipedia.:clap:

If anyone is still wondering what all that ram was for... ;)

Quote:

Originally Posted by Chumbucket843

you probly set a single socket speed record but poke set the record for most digits calculated on several constants. i couldn't find some of the other ones but they need to updated.
http://en.wikipedia.org/wiki/Euler-Mascheroni_constant
http://en.wikipedia.org/wiki/Catalan's_constant
looks like kondo beat you on e!
http://en.wikipedia.org/wiki/E_(mathematical_constant)

Here's the one you missed:
http://en.wikipedia.org/wiki/Ap%C3%A9ry%27s_constant

There aren't any pages for the two Natural Logs.

As for e... well... :(

Here's my excuse: :rofl::rofl::rofl:

Using the Basic Swap Mode, all computations of N digits require a little more than 2N memory.
So with 64GB of ram, the largest computation I can do is 31 billion digits.
So until I write an Advanced Swap Mode (the thing that PiFast and QuickPi have), I'm stuck at 31 billion digits with my "puny" 64 GB of ram. :ROTF::(:rofl:

32GB would've been enough to break the records. But at the time we built the rig (September 2008), the memory requirement was expected to be 4N (I hadn't implemented it yet).
So 64GB was needed to break 10 billion. (which was what the records were at the time.)

Then in December, I realized a 2N algorithm - which effectively allowed us to hit 31 billion.

Any higher, and multiplication will no longer fit in ram. So it will require a completely different scheme for operating on HDs.

Quote:

Originally Posted by Peen

Where do you see the times at? BTW cool links, wish I was smart enough to understand math like that!

I've updated the "Fastest Times" section on my site with your new numbers... :up::up::up: Insane 4.6 GHz...
09-01-2009, 03:09 AM
rge

Run with my i950
http://img216.imageshack.us/img216/858/ycrunch1pst.jpg
http://img198.imageshack.us/img198/3552/ycrunch2pst.jpg
09-01-2009, 03:38 PM
poke349

Woah :cool:

New single-socket records again!!! :D
EDIT: Is it safe to assume that's on water?

Single-socket is closing in on Dave's Gainestowns for 25m...:rolleyes:

Just curious, how stable is that OC? I noticed that turbo is disabled.

And if anyone has screenies of a failed benchmark. I'm curious to see them. :p:
Whether it fails because the digits don't match at the end, or if it catches and corrects an error....

Be kind of interesting to see what the most common non-crash/BSOD failure is. :):D;)
09-01-2009, 05:00 PM
rge

Quote:

Originally Posted by poke349

EDIT: Is it safe to assume that's on water?

Just curious, how stable is that OC? I noticed that turbo is disabled.

Yep water, pic is in sig. Turbo on GB board is just 24 multi all the time, 23 multi OC's better, so turbo is off.

4.6ghz is stable enough to run prime blend 20 mins without crashing, some with bigger cojones primed 4.6ghz for 8hrs posted in the i7 database intel thread.

4.65 stable to run chess computer stress, posted there.

4.69, ran 25 and 50 several times, never crashed. 100 tried to run once crashed (bsod) about 3/4 way through. Did not try upping vcore more, probably wont until winter and can set rad in cooler temps. It is between 4.6 and 4.7 where stability markedly changes.
09-01-2009, 05:30 PM
Peen

hmmm looks like I'll have to try 4.75ghz now hehe. Kinda board limited though but haven't really pushed too hard
09-02-2009, 04:11 PM
xman01

did batch run on rig 1 and rig 2 on winxpx64
hope this is valid enough

rig2
http://img10.imageshack.us/img10/1484/79440791.jpg

rig1
http://img186.imageshack.us/img186/1314/x41.jpg
09-03-2009, 07:59 PM
poke349

Quote:

Originally Posted by xman01

did batch run on rig 1 and rig 2 on winxpx64
hope this is valid enough

Good enough... lol

I haven't been validating any of these anyways... As long as they "seemed right", good enough, lol.

I'll probably put in a hash for the batch benchmarks later on.

Been really busy lately... Not gonna have anytime until after GREs...
09-04-2009, 06:40 PM
rge

downstairs with ~18-19C ambients got 100 to run, and little higher, 4.76ghz, 1.55vcore.
http://img44.imageshack.us/img44/3193/ycrunch476125.jpg
http://img154.imageshack.us/img154/6...unch476150.jpg
http://img156.imageshack.us/img156/3...nch4761100.jpg
09-06-2009, 11:30 AM
dj883u2

Run with my Core i7920@4210Mhz

http://img30.imageshack.us/img30/2724/96969833.th.jpg

:up:
09-06-2009, 11:39 AM
poke349

Quote:

Originally Posted by dj883u2

Run with my Core i7920@4210Mhz

http://img30.imageshack.us/img30/2724/96969833.th.jpg

:up:

Woah... :eek: Is that voltage for real? Is that one of those 3845/3849 batches?

With 1.15v I can only get up to 3.9 GHz... and only marginally stable.
I need 1.225 to bench @ 4.2, 1.25 to "barely"pass 2G Pi, and 1.275 to get prime/linpack stable.

And just wondering, did you play with memory timings? I'm curious to see how sensitive the program is to timings.
09-06-2009, 12:51 PM
dj883u2

Quote:

Originally Posted by poke349

Woah... :eek: Is that voltage for real? Is that one of those 3845/3849 batches?

With 1.15v I can only get up to 3.9 GHz... and only marginally stable.
I need 1.225 to bench @ 4.2, 1.25 to "barely"pass 2G Pi, and 1.275 to get prime/linpack stable.

And just wondering, did you play with memory timings? I'm curious to see how sensitive the program is to timings.

Yes, the voltage is real. The Batch is 3845B027.:up:

What tests do you need?:)
09-06-2009, 02:15 PM
poke349

Quote:

Originally Posted by dj883u2

Yes, the voltage is real. The Batch is 3845B027.:up:

What tests do you need?:)

Ah... Something like different timings for 25m and something larger like 250m. And maybe with and without HT. But I shouldn't be asking for anything.

Threading overhead and randomness makes the small benchmarks somewhat unreliable as an indicator of program efficiency.
HT has an effect of "hiding" latencies because when one thread stalls for memory access, the other thread gets the whole core to itself...

I haven't gotten around to trying it myself because my rig is usually busy with stuff that I can't pause.
Obviously this program is very memory intensive. And I know for sure that Core 2 bandwidth is bottlenecking. But I've never been able to gauge the effect of latencies.

Timing sensitivity, I would guess, is a good indicator to whether the program will scale well on EX servers.
09-06-2009, 02:23 PM
Chumbucket843

you could buy vtune for like 1000 dollars.:D
http://software.intel.com/en-us/intel-vtune/
09-06-2009, 04:21 PM
poke349

Quote:

Originally Posted by Chumbucket843

you could buy vtune for like 1000 dollars.:D
http://software.intel.com/en-us/intel-vtune/

:rofl:

I don't take optimization seriously enough to justify that kind of expenditure.:ROTF:
09-06-2009, 04:41 PM
Peen

Was about to do 4.7ghz+ but then my board died again! damn u dfi
09-06-2009, 05:51 PM
maxgull

http://i599.photobucket.com/albums/t...1Sep062144.jpg

thought i'd join in..

Show 100 post(s) from this thread on one page

All times are GMT -8. The time now is 05:06 PM.

XtremeSystems