New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

Printable View

Show 100 post(s) from this thread on one page

02-08-2010, 08:02 PM
Alpha

Quote:

Originally Posted by bonis62

difference of six seconds , not bad, my 32-bit code is good, but
your code is truly remarkable......... :cussing:

:up:

ya, he has me beat by about 100 seconds too. guess i need to work on my code...

can't wait to see your new versions!
02-09-2010, 02:22 AM
bonis62

Quote:

Originally Posted by poke349

Well... I compiled it for x64... :wasntme:

To me.... this smells like a compliment :D:D:D:D
02-09-2010, 06:46 AM
poke349

Quote:

Originally Posted by bonis62

To me.... this smells like a compliment :D:D:D:D

I recompiled it for x86 and got 17 seconds. So as expected, x64 is a bit faster.
02-09-2010, 09:02 AM
bonis62

Quote:

Originally Posted by poke349

I recompiled it for x86 and got 17 seconds. So as expected, x64 is a bit faster.

:clap:
Indeed,
the difference is small, (bit)
i often wonder whether it is worth making two versions (64 / 32)
for this applications type ,
since the difference is minimal... :shrug::confused:
02-09-2010, 09:30 AM
poke349

Quote:

Originally Posted by bonis62

:clap:
Indeed,
the difference is small, (bit)
i often wonder whether it is worth making two versions (64 / 32)
for this applications type ,
since the difference is minimal... :shrug::confused:

As of our current implementations, the difference is small.
But a more optimized version that stays in cache and uses hard-coded loop-unrolling will benefit a lot from the x64 registers.

x64 also lets you go over 2GB of ram.
02-09-2010, 12:24 PM
mattkosem

1 Attachment(s)

Here's what the N270 in my EEE 1000HE does (win 7 - completely stock).

--Matt
02-09-2010, 01:30 PM
Particle

1 Attachment(s)

I may not set any world records, but at least I'm on the board. Was a bit weird having 16 threads spawned for 12 physical cores.

2x AMD Opteron 2427 @ 2.2GHz
8GB ECC/Reg DDR2-800
Windows Vista SP2 x64

Awaiting the new swap mode. :)
02-09-2010, 02:05 PM
poke349

Quote:

Originally Posted by mattkosem

Here's what the N270 in my EEE 1000HE does (win 7 - completely stock).

--Matt

Nice, another Atom to add to the list. :D

Quote:

Originally Posted by Particle

I may not set any world records, but at least I'm on the board. Was a bit weird having 16 threads spawned for 12 physical cores.

2x AMD Opteron 2427 @ 2.2GHz
8GB ECC/Reg DDR2-800
Windows Vista SP2 x64

Awaiting the new swap mode. :)

Nice. That's the first 12-core that I can add to the list. :D

If you haven't taken a look yet:
I've added 3 entries of the new swap mode to the table.
Each of them showing a computation that is MUCH too large than would fit in ram.
And if it isn't obvious enough already, disk bandwidth is pretty much the only thing that matters.

Even on my workstation with 4 x 1TB (~400 - 500 MB/s), it is still highly bottlenecked by disk.

So if you have those velociraptors ready by then...
Also, if bandwidth doesn't scale linearly in raid 0, than you'll wanna undo the raid. The program is able to manage multiple HDs separately and get perfect linear scaling.

If it goes well enough upon release, I might start a new thread for it:

"Pi-based Hard Drive Benchmark for the EXTREMELY Patient"

:rofl::rofl::rofl:

EDIT:
About the 16 threads.
There's a number of algorithms that simply don't work with non-powers of two, so to get around it, I simply round up and let the scheduler take care of it.
02-09-2010, 02:49 PM
Particle

Is it more latency sensitive or bandwidth sensitive? The hard drives I've got aren't velociraptors--they're Fujitsu MBA3147RC 147GB 15K SAS drives. They do about 115MB/s each but offer an impressive 5.2ms (in real life) average random access time. I'll have eight of them at the end of the week. Would it be possible to help you test a beta? :D
02-09-2010, 03:02 PM
poke349

Quote:

Originally Posted by Particle

Is it more latency sensitive or bandwidth sensitive? The hard drives I've got aren't velociraptors--they're Fujitsu MBA3147RC 147GB 15K SAS drives. They do about 115MB/s each but offer an impressive 5.2ms (in real life) average random access time. I'll have eight of them at the end of the week. Would it be possible to help you test a beta? :D

Bandwidth sensitive. I've pretty much optimized out all the extremely non-sequential stuff...

With 8GB of ram, latency isn't gonna matter until you push above 150 - 300 billion digits.
(With 12GB of ram, I start getting latency issues at about 300 - 600 billion digits.
On my workstation with 64GB, 4TB of disk space it isn't enough for me reach the sizes that are large enough to hit latency slowdowns...)

EDIT:
It scales quadratically. Doubling your memory will quadruple this limit...
Should someone be crazy enough to push higher, the program may automatically do an algorithm switch that trades latency for bandwidth. (So less sensitive to latency, but uses more bandwidth.)
So it's likely that you'll never feel the latency unless you override the buffering settings to something that's completely messed up... (which the program lets you do)

Which at those sizes... I doubt anyone is gonna want to tie down their machines for too long.
(I would estimate a 100 billion digit computation to take at least 100 hours with a single Core i7 and 250 MB/s disk bandwidth.)

So for everyone else, this isn't gonna be a test for SSDs. With the sheer amount of writes it will do, it'll probably kill an SSD with a couple of runs... :rofl:

If everything goes well... I should have a working beta ready in less than 10 days.
02-09-2010, 08:51 PM
poke349

I've added some new details on the new version:
http://www.numberworld.org/y-crunche...n_history.html

And with that,
Here's the first two fully validated Advanced Swap computations with v0.5.2: (click to enlarge)

10 billion digits of Pi on my File Server - Pentium D @ 2.8 GHz + 3 GB DDR2 + 160 GB WD
x86 SSE3 - memory limit set to 1.84 GB.
http://www.numberworld.org/y-crunche...2010_small.jpg

10 billion digits of Pi on my Laptop - Core i7 720QM @ 1.6 GHz (stock) + 6 GB DDR3 + 500 GB Seagate
x64 SSE4.1 ~ Ushio - memory limit set to 3.00 GB.
http://www.numberworld.org/y-crunche...2010_small.jpg

Only the old benchmark mode will be able to verify if the digits are correct. (I obviously can't cache the last few digits of every computation size... lol)
For all other computations, (including these swap computations), the digits that it prints out will have to match the accepted values in order to complete the validation.
(the accepted values can be easily found online)

These two 10 billion digit computations took quite a long time (21 and 38 hours).
But that's because neither of the machines were "Xtreme" in any way.
In particular, my laptop's 500 HD is almost full so it only sustained 50 MB/s bandwidth.

I would expect that any "real" system (say a desktop C2Q/Ci7/PII X4) with a decent SATA II 7200 RPM drive will make 10 billion digits an "overnight job", or a "start in the morning, done before getting back from work" job...
02-10-2010, 02:07 AM
bonis62

Quote:

Originally Posted by bonis62

:clap:
Indeed,
the difference is small, (bit)
i often wonder whether it is worth making two versions (64 / 32)
for this applications type ,
since the difference is minimal... :shrug::confused:

I want to compile my code in 64 bit,
i want to know what compiler you use for 64-bit :)

ty

http://www.xstreme.it/p64.jpg

http://www.xstreme.it/primes64.zip

my first attempt...
02-10-2010, 10:06 AM
poke349

Quote:

Originally Posted by bonis62

I want to compile my code in 64 bit,
i want to know what compiler you use for 64-bit :)

ty

http://www.xstreme.it/p64.jpg

http://www.xstreme.it/primes64.zip

my first attempt...

I use Visual Studio for general coding and compiling because it compiles very quickly.
But all final versions that use SSE are compiled using the Intel Compiler. It optimizes better than Visual Studio.

Both support x64.

Now you can try scaling up the size to more than 4GB of ram.
Though you won't know that you're bug-free until you test sizes that are large enough to overflow all 32-bit indexing in your program.
Which is where a lot of ram becomes useful... An array of 32-bit integers won't overflow a 32-bit index until it's larger than 16GB. And for 64-bit double, you need 32GB of ram...

But that's just supersizing... way more than what's needed for most applications.
02-10-2010, 10:53 AM
bonis62

Quote:

Originally Posted by poke349

I use Visual Studio for general coding and compiling because it compiles very quickly.
But all final versions that use SSE are compiled using the Intel Compiler. It optimizes better than Visual Studio.

Both support x64.

Now you can try scaling up the size to more than 4GB of ram.
Though you won't know that you're bug-free until you test sizes that are large enough to overflow all 32-bit indexing in your program.
Which is where a lot of ram becomes useful... An array of 32-bit integers won't overflow a 32-bit index until it's larger than 16GB. And for 64-bit double, you need 32GB of ram...

But that's just supersizing... way more than what's needed for most applications.

T Y V M :up:

the FLOAT instead of DOUBLE saves bits and gains efficiency,
but i do not know if YOUR APP can work with FLOAT...
02-10-2010, 12:10 PM
poke349

Quote:

Originally Posted by bonis62

T Y V M :up:

the FLOAT instead of DOUBLE saves bits and gains efficiency,
but i do not know if YOUR APP can work with FLOAT...

I'm already pushing "double" to its limit of precision.
"float" has less than half the precision of "double" so that would require more than 4x the work.

Actually, because of the way the algorithm works, using type "float" would require MUCH more than 4x the work. It would actually fail above a certain (small) size.

The run-time complexity is this:
http://www.numberworld.org/y-cruncher/images/FFT.jpg
where "n" is the # of digits.
and "w" is the # of bits of precision in the floating-point.

When the denominator goes to zero, the run-time (and memory) blows up to infinity - in other words, the algorithm fails.

This is the reason why I can't use GPU. :(

If there was a 128-bit floating-point type that was supported by hardware, the program would actually be MUCH faster.

EDIT: That complexity is just a reasonable approximation to the true complexity.
The true complexity (ignoring round-off error), has special functions in it... so it's unreadable to normal people. (even myself)
02-10-2010, 01:40 PM
bonis62

Quote:

Originally Posted by poke349

I'm already pushing "double" to its limit of precision.
"float" has less than half the precision of "double" so that would require more than 4x the work.

Actually, because of the way the algorithm works, using type "float" would require MUCH more than 4x the work. It would actually fail above a certain (small) size.

The run-time complexity is this:
http://www.numberworld.org/y-cruncher/images/FFT.jpg
where "n" is the # of digits.
and "w" is the # of bits of precision in the floating-point.

When the denominator goes to zero, the run-time (and memory) blows up to infinity - in other words, the algorithm fails.

This is the reason why I can't use GPU. :(

If there was a 128-bit floating-point type that was supported by hardware, the program would actually be MUCH faster.

EDIT: That complexity is just a reasonable approximation to the true complexity.
The true complexity (ignoring round-off error), has special functions in it... so it's unreadable to normal people. (even myself)

yeah

if
n=100.5
w=64.0

o= n/(w-Log(n))Log(n/(w-Log(n))) = 1.6922085689143893

it's your formula ?

but if you make this :

o= n/(w-Log(n))Log(n/(w-Log(n))) = 1.6922085689143893
o2= sqrt(o) = 1.3008491722388069

you have a large control predictor...

this is only my theory to bypass error :eek:

read this :

http://en.wikipedia.org/wiki/Floating_point
02-10-2010, 03:33 PM
poke349

Quote:

Originally Posted by bonis62

yeah

if
n=100.5
w=64.0

o= n/(w-Log(n))Log(n/(w-Log(n))) = 1.6922085689143893

it's your formula ?

but if you make this :

o= n/(w-Log(n))Log(n/(w-Log(n))) = 1.6922085689143893
o2= sqrt(o) = 1.3008491722388069

you have a large control predictor...

this is only my theory to bypass error :eek:

read this :

http://en.wikipedia.org/wiki/Floating_point

It's just a Big-O... I didn't put any of the constants in there.

Basically:
With "double", w = 53. The algorithm becomes impractical when n goes above a billion. The asymptote is reached when n reaches ~25 trillion digits. (give or take a factor of 10 or so...)

With "float", w = 24. The algorithm becomes becomes impractical at just a few thousand digits. The asymptote is reached when n goes to a mere 60,000 digits... (again, give or take a factor of 10 or so...)

This is just one of several major algorithms in the program.
The program doesn't rely on it solely. So 25 trillion digits isn't the limit of the program.

EDIT:
Basically, when "w" is much larger than "log(n)", the complexity is roughly O(n log(n)). That's where the algorithm rules since it's quasi-linear.
So as long as you stay in that range, the algorithm remains efficient.
But as you push higher and higher, the "log(n)" starts creeping up. Eventually, it becomes inefficient. Then impractical...
And when n is large enough such that w = log(n), the algorithm fails completely.

This algorithm is called FFT multiplication.
Virtually all fast pi-programs use it, SuperPi, PiFast, QuickPi, etc...
All implementations of it stay in the "efficient" sizes where "log(n)" is much smaller than "w".
02-10-2010, 04:08 PM
bonis62

Quote:

Originally Posted by poke349

It's just a Big-O... I didn't put any of the constants in there.

Basically:
With "double", w = 53. The algorithm becomes impractical when n goes above a billion. The asymptote is reached when n reaches ~25 trillion digits. (give or take a factor of 10 or so...)

With "float", w = 24. The algorithm becomes becomes impractical at just a few thousand digits. The asymptote is reached when n goes to a mere 60,000 digits... (again, give or take a factor of 10 or so...)

This is just one of several major algorithms in the program.
The program doesn't rely on it solely. So 25 trillion digits isn't the limit of the program.

EDIT:
Basically, when "w" is much larger than "log(n)", the complexity is roughly O(n log(n)). That's where the algorithm rules since it's quasi-linear.
So as long as you stay in that range, the algorithm remains efficient.
But as you push higher and higher, the "log(n)" starts creeping up. Eventually, it becomes inefficient. Then impractical...
And when n is large enough such that w = log(n), the algorithm fails completely.

This algorithm is called FFT multiplication.
Virtually all fast pi-programs use it, SuperPi, PiFast, QuickPi, etc...
All implementations of it stay in the "efficient" sizes where "log(n)" is much smaller than "w".

last question :)

with the binary splitting method

"the impractical" resolves anything ?

http://numbers.computation.free.fr/C.../programs.html
02-10-2010, 04:14 PM
poke349

Quote:

Originally Posted by bonis62

last question :)

with the binary splitting method

"the impractical" resolves anything ?

http://numbers.computation.free.fr/C.../programs.html

I don't think I understand what you mean... :confused:
02-11-2010, 05:29 AM
bonis62

Quote:

Originally Posted by poke349

I don't think I understand what you mean... :confused:

see this

http://numbers.computation.free.fr/C...splitting.html

and this

http://algolist.ru/download.php?path...rc.zip&pspdf=1

is useful ?
02-11-2010, 06:08 AM
poke349

Quote:

Originally Posted by bonis62

see this

http://numbers.computation.free.fr/C...splitting.html

and this

http://algolist.ru/download.php?path...rc.zip&pspdf=1

is useful ?

And your point?

EDIT:
If you're wondering is any of that is gonna help. I've already been using most of those algorithms since the very beginning.
02-11-2010, 09:39 AM
bonis62

Quote:

Originally Posted by poke349

And your point?

EDIT:
If you're wondering is any of that is gonna help. I've already been using most of those algorithms since the very beginning.

I really wanted to know your point of view....:)
I am new to this type of algorithm,
i are documented on this subject,
i thought that theory to solve the problem of zero value,
with a NAN routine,
if the data is zero, replace with NaN,
but if you say that is impossible,
then it is just impossible
:)
02-11-2010, 09:58 AM
poke349

Quote:

Originally Posted by bonis62

I really wanted to know your point of view....:)
I am new to this type of algorithm,
i are documented on this subject,
i thought that theory to solve the problem of zero value,
with a NAN routine,
if the data is zero, replace with NaN,
but if you say that is impossible,
then it is just impossible
:)

k, sorry.

Now I'm REALLY confused. :confused:

I never mentioned anything about zero or NaN.
I only mentioned that it was impossible to use "float" to any degree of efficiency. (impossible is kinda strong of a word, I'll just say "extremely difficult")
02-11-2010, 11:21 AM
bonis62

Quote:

Originally Posted by poke349

k, sorry.

Now I'm REALLY confused. :confused:

I never mentioned anything about zero or NaN.
I only mentioned that it was impossible to use "float" to any degree of efficiency. (impossible is kinda strong of a word, I'll just say "extremely difficult")

you says
When the denominator goes to zero, the run-time (and memory) blows up to infinity - in other words, the algorithm fails.

i reply
use NaN and Float

sorry, i am not a native speaker, might, indeed, surely i expressed myself badly :D
02-11-2010, 01:43 PM
poke349

Quote:

Originally Posted by bonis62

you says
When the denominator goes to zero, the run-time (and memory) blows up to infinity - in other words, the algorithm fails.

i reply
use NaN and Float

sorry, i am not a native speaker, might, indeed, surely i expressed myself badly :D

I understand you're not a native speaker. ;)

Either you're being sarcastic, or you're completely missing the point... :ROTF:
That complexity is just part of the analysis of the algorithm.
It has nothing to do with 0 or NaN or anything hardware...

Basically I'm saying that (now, I'm just making these numbers up):

at 10,000,000 digits, it needs 100MB and 1 second
at 100,000,000 digits, it needs 2GB and 20 seconds
at 1 billion digits, it needs 50GB and 10 minutes
at 10 billion digits, it needs 5TB and 10 hours
at 100 billion digits, it needs 1000TB and 10 years
at 1 trillion digits, it needs infinite memory, and infinite time.

That's what I mean by blows up.

Basically, you get to the point where you have so many operations that no matter what you do, 53 bits of precision isn't enough, because roundoff error alone takes up 53 bits of precision.
02-11-2010, 03:40 PM
jcool

1 Attachment(s)

Did some runs on your latest version mate... not the slowest CPU out there ;)
It doesn't quite beat a Dual W5590, but it gets damned close for a single CPU :D
Just wish I could select 12 Threads, would probably net a better score.
02-11-2010, 04:47 PM
poke349

Quote:

Originally Posted by jcool

Did some runs on your latest version mate... not the slowest CPU out there ;)
It doesn't quite beat a Dual W5590, but it gets damned close for a single CPU :D
Just wish I could select 12 Threads, would probably net a better score.

Oh hey! A Gulfie! :D

There's no 12-thread option because a lot of the internal algorithms simply don't allow non-powers of two. It's inherent in the math. :(
So I'm forced to just round it up.
02-11-2010, 04:53 PM
jcool

Quote:

Originally Posted by poke349

Oh hey! A Gulfie! :D

Nope.. it's a Westmere-EP (--> DP CPU) ;)
Gulftowns are single QPI chips, like the upcoming Core i7 980 XE.

Quote:

There's no 12-thread option because a lot of the internal algorithms simply don't allow non-powers of two. It's inherent in the math. :(
So I'm forced to just round it up.

I see.. oh well, I really hate math, so I can live with that :rofl:
02-11-2010, 05:10 PM
poke349

Quote:

Originally Posted by jcool

Nope.. it's a Westmere-EP (--> DP CPU) ;)
Gulftowns are single QPI chips, like the upcoming Core i7 980 XE.

hmm... I thought all the dual-sockets ended in "-town".

Clovertown
Harpertown
Gainestown
Gulftown?

Also, since these dominate the single-socket categories, is it safe for me to update the records list on my webpage? They're ES and not yet retail, so... your call. :)

Quote:

Originally Posted by jcool

I see.. oh well, I really hate math, so I can live with that :rofl:

haha :rofl:
02-11-2010, 05:25 PM
jcool

ES Samples have been circulating since September 2009 or so, I think I got my first one in October, so it's been a while. Retail launch of this particular CPU will be March 16th, so it's not that far off.
Include them if you like :)

You are right, Intel broke their consistent naming scheme for this one. While the whole LGA1366 32nm Lineup (Quads and Hexas) with Hardware AES Support falls under the "Westmere" family (Tick - Tock, Nehalem was Tick, Westmere is Tock), Gulftown is the codename for single QPI desktop parts, like Bloonmfield was for the i7's.
Westmere-EP is the successor of Gainestown, they were probably looking to put that in line with the later-to-arrive Westmere-EX (<-> Nehalem-EX/Beckton successor).
02-11-2010, 05:29 PM
poke349

Quote:

Originally Posted by jcool

ES Samples have been circulating since September 2009 or so, I think I got my first one in October, so it's been a while. Retail launch of this particular CPU will be March 16th, so it's not that far off.
Include them if you like :)

You are right, Intel broke their consistent naming scheme for this one. While the whole LGA1366 32nm Lineup (Quads and Hexas) with Hardware AES Support falls under the "Westmere" family (Tick - Tock, Nehalem was Tick, Westmere is Tock), Gulftown is the codename for single QPI desktop parts, like Bloonmfield was for the i7's.
Westmere-EP is the successor of Gainestown, they were probably looking to put that in line with the later-to-arrive Westmere-EX (<-> Nehalem-EX/Beckton successor).

Funny, lol
Wiki says "Gulftown" will apply to both Core i7 9XX and Xeon 56xx.
But that's just wiki... lol
02-12-2010, 02:17 AM
jcmarfilph

This Turion M520 should be as fast as T6600 or P8400 ^_^

http://inlinethumb52.webshots.com/46...600x600Q85.jpg
02-12-2010, 07:22 AM
bonis62

Quote:

Originally Posted by poke349

I understand you're not a native speaker. ;)

Either you're being sarcastic, or you're completely missing the point... :ROTF:
That complexity is just part of the analysis of the algorithm.
It has nothing to do with 0 or NaN or anything hardware...

Basically I'm saying that (now, I'm just making these numbers up):

at 10,000,000 digits, it needs 100MB and 1 second
at 100,000,000 digits, it needs 2GB and 20 seconds
at 1 billion digits, it needs 50GB and 10 minutes
at 10 billion digits, it needs 5TB and 10 hours
at 100 billion digits, it needs 1000TB and 10 years
at 1 trillion digits, it needs infinite memory, and infinite time.

That's what I mean by blows up.

Basically, you get to the point where you have so many operations that no matter what you do, 53 bits of precision isn't enough, because roundoff error alone takes up 53 bits of precision.

option two:
completely missing the point :ROTF:
02-12-2010, 07:29 AM
Particle

Looks like the swap bench will have to wait. They shipped me MAX3147RCs instead of MBA3147RCs. :(
02-12-2010, 02:36 PM
poke349

Quote:

Originally Posted by Particle

Looks like the swap bench will have to wait. They shipped me MAX3147RCs instead of MBA3147RCs. :(

Sorry to hear that. :(

I still have a number of different things that need to be fixed/redone before the program is stable enough for beta-testing.

I'm probably just gonna release it as a public alpha... Alpha instead of Beta since it'll probably be the first in a series of successive optimizations - so there's gonna be a lot of changes to come... even after I release it.
02-13-2010, 07:49 AM
bonis62

Quote:

Originally Posted by bonis62

option two:
completely missing the point :ROTF:

hi, i dont know intel compiler.....

1) if you've tried the GCC compiler , it is much slower than the Intel compiler ?

2) for porting my code into Linux , Intel compiler work on linux ?

TY
02-13-2010, 09:21 AM
Qkjhfhaiguihfma

Quote:

Originally Posted by bonis62

hi, i dont know intel compiler.....

1) if you've tried the GCC compiler , it is much slower than the Intel compiler ?

2) for porting my code into Linux , Intel compiler work on linux ?

TY

1) yes!
2) yes, i think it's free for non-commercial use.
02-14-2010, 12:02 AM
mattkosem

1 Attachment(s)

Here's what I get with HT on at 4.4.

--Matt
02-14-2010, 12:23 PM
poke349

Quote:

Originally Posted by mattkosem

Here's what I get with HT on at 4.4.

--Matt

Wow, a whole second with a lower clock... ;)
02-15-2010, 06:52 PM
poke349

Looks like I've been beat pretty well now... :rolleyes:

500m Pi:
TPi v0.9.2: 554 seconds
y-cruncher v0.5.2: 588 seconds

http://www.numberworld.org/y-crunche..._2_15_2010.jpg

Process Explorer shows TPi as not using a lot of CPU, especially towards the end of the computation... and yet it is still so fast... lol

TPi:
Series + Division: 410.28
Square Root: 23.861
Final Multiply: 20.467
Base Conversion: 100.068
Total: 554 seconds

y-cruncher:
Series + Division: 435.572
Square Root: 15.695
Final Multiply: 9.008
Base Conversion: 128.166
Total: 588 seconds

y-cruncher may have faster arithmetic, but it doesn't mean anything if I'm not using the formula properly... :rofl:

If only I'm a lot better at math...
02-17-2010, 08:18 AM
mk_dir

Here's a dual Gainestown EP config; dual Xeon E5520 @ Asus Z8NR-D12 with 24GB DDR3-1066Mhz ECC RAM (stock clocks).
I'm using the y-cruncher v.0.4.4.7762b version. Some Large Benchmarks:
(All RAM)

500,000,000 digits (and CPU-z):

http://img100.imageshack.us/img100/2204/500a.jpg

1,000,000,000 digits:

http://img130.imageshack.us/img130/6165/1000l.jpg

2,500,000,000 digits:

http://img442.imageshack.us/img442/8441/2500w.jpg

And here a 5,000,000,000 digits resluts:

http://img694.imageshack.us/img694/5223/50000n.jpg

A quick run too:

http://img94.imageshack.us/img94/954/88462521.jpg

Regards
02-17-2010, 06:46 PM
poke349

Quote:

Originally Posted by mk_dir

Here's a dual Gainestown EP config; dual Xeon E5520 @ Asus Z8NR-D12 with 24GB DDR3-1066Mhz ECC RAM (stock clocks).
I'm using the y-cruncher v.0.4.4.7762b version. Some Large Benchmarks:
(All RAM)

500,000,000 digits (and CPU-z):

1,000,000,000 digits:

2,500,000,000 digits:

And here a 5,000,000,000 digits resluts:

A quick run too:

Regards

WOW :eek::eek::eek:
5b...
The first of the memory monsters... :clap: (excluding myself...)
02-18-2010, 01:12 AM
mk_dir

Quote:

Originally Posted by poke349

WOW :eek::eek::eek:
5b...
The first of the memory monsters... :clap: (excluding myself...)

Thanks, I'm testing some new server hardware and this benchmark got my attention. Full multi-threaded and a masive RAM eater.
02-19-2010, 02:58 PM
poke349

Quote:

Originally Posted by mk_dir

Thanks, I'm testing some new server hardware and this benchmark got my attention. Full multi-threaded and a masive RAM eater.

;)

Another update on v0.5.2:

The code is done. And I've compiled the build that will "most likely" be the one that I release.
It's taken quite a while because Advanced Swap Mode required a fairly large program-design restructuring at the top level - which more or less broke half the existing features.
I didn't get everything working again until yesterday. Not to mention that I had a ton of midterms over the past 2 weeks...

I have one test I wanna do before I release it, but I can't start it until my Core i7 rig finishes a VERY LARGE test/task that it's doing right now. (it's almost done...)

In the meantime, here's a small screenie of the option selection menu for Advanced Swap Mode:

http://www.numberworld.org/y-crunche....2_preview.jpg

This will be the final test that I wanna run before I release the program.
ETA: 2 - 5 days on my Core i7 rig - which has gotten a bit of an upgrade for this purpose... :wasntme:
02-23-2010, 07:04 AM
poke349

Version 0.5.2 is out!!!

The greatest feature is of course: Advanced Swap Mode.

This feature is accessible in the "Custom Compute a Constant" option.
It's under "Computation Mode". But you must select at least 100,000,000 digits for the option to appear.

Now for starters:

Let's see who can beat this? :rofl::rofl::rofl:

http://www.numberworld.org/y-crunche...10_options.jpg

(click to enlarge)
http://www.numberworld.org/y-crunche...2010_small.jpg

Aside from a couple of large computations (including a new world record of 500 billion digits of e), this version is largely untested.
So please let me know of any bugs or errors you find.

I may start a new thread for this - this time, for hard drive benchmarking... ;)
02-23-2010, 08:06 AM
jcmarfilph

umm where is the download link for version 0.5.2? :)
02-23-2010, 08:11 AM
poke349

Quote:

Originally Posted by jcmarfilph

umm where is the download link for version 0.5.2? :)

The same place as the old link. Did you refresh?
I literally added it like a hour ago... So it might still be cached.
02-23-2010, 09:08 AM
jcmarfilph

Very nice improvement :D

http://inlinethumb52.webshots.com/46...600x600Q85.jpg

http://inlinethumb08.webshots.com/43...600x600Q85.jpg
02-23-2010, 09:36 AM
Movieman

:wasntme:
02-23-2010, 09:49 AM
poke349

Quote:

Originally Posted by jcmarfilph

Very nice improvement :D

Only AMD gets that much improvement. Core i7 doesn't even come close.
I haven't done any small Core 2 benches yet, so I don't know how much it gets. But I don't think it gets as much as K10 either...

Quote:

Originally Posted by Movieman

:wasntme:

Are you telling me you've got as many drives as cores? :yawn:
02-23-2010, 09:55 AM
Movieman

Quote:

Originally Posted by poke349

Are you telling me you've got as many drives as cores? :yawn:

I'm telling you that come next month your going to be very busy changing numbers on your records page!:rofl:
03-05-2010, 01:25 PM
onex

lol ^^:),

/post

that X5650 is a monster :cool:.
03-06-2010, 11:47 AM
poke349

Quote:

Originally Posted by onex

lol ^^:),

/post

that X5650 is a monster :cool:.

Wait for Dave when the NDA lifts on the Gulftowns... :yepp:
03-07-2010, 03:35 AM
onex

yeah, this is one curiosity teaser :)...
03-07-2010, 08:39 AM
Particle

I tried advanced swap mode, but when I set it to use G:\ the program went into an infinite loop (still in the menu, hadn't even started yet).
03-07-2010, 10:10 AM
poke349

Quote:

Originally Posted by Particle

I tried advanced swap mode, but when I set it to use G:\ the program went into an infinite loop (still in the menu, hadn't even started yet).

Where in the menu?

It asks you for the # of drives first. Then it asks for the paths.

http://www.numberworld.org/y-crunche...s_3_7_2010.jpg

I should probably write my own prompting function instead of using cin >> c;
Since cin >> x; does that infinite loop when it gets a letter instead of an integer.
03-07-2010, 10:26 AM
Movieman

Quote:

Originally Posted by onex

lol ^^:),

/post

that X5650 is a monster :cool:.

Nope, not that slow!:rofl:
03-07-2010, 04:42 PM
Particle

Ah, I see. I didn't see the first line and thus thought it wanted a comma-separated list or something.
03-10-2010, 03:31 PM
skycrane

5 Attachment(s)

can you put my runs in?
03-10-2010, 03:33 PM
skycrane

3 Attachment(s)

ive got a few more

im working on the 10 and 25 billion right now, hopefully it works :)
03-10-2010, 06:10 PM
poke349

HOLY @#&%^*&!!! :slobber::slobber::slobber:
WOW!!!

Quad socket + 32 GB of ram! (Though you were switching between 16GB and 32GB?)

Quick question: Do you actually have enough ram for the 10b and 25b runs? Or are you using the swap modes?
03-10-2010, 08:00 PM
skycrane

1 Attachment(s)

thanks, this is the first time ive ever benched the rig, usualy it just sits and crunches away on my boinc projects. ive got a total of 3 quads :D.. that one normaly runs 16gb, the other 2 have 8gb each. i borrowed the mem from the others to use all 32 for this run. its back to 16 now, and ill just start using the swap modes for the other runs

this is the 10 billion run :)

Validation Version: 1.0

Processor(s): Quad-Core AMD Opteron(tm) Processor 8356
Logical Cores: 16
Physical Memory: 34,356,379,648 ( 32.0 GB )
CPU Frequency: 2,311,021,151

Program Version: 0.5.2 Build 9082 Alpha 3 (x64 SSE3 - Windows ~ Kasumi)
Constant: Pi
Algorithm: Chudnovsky Formula
Decimal Digits: 10,000,000,000
Hexdecimal Digits: 8,304,820,238
Threading Mode: 16 threads
Computation Mode: Basic Swap
Swap Disks: 1
Working Memory: 20.1 GB

Start Time: Wed Mar 10 18:14:47 2010
End Time: Wed Mar 10 22:20:16 2010

Computation Time: 14,081.835 seconds
Total Time: 14,726.781 seconds

CPU Utilization: 667.83 %
Multi-core Efficiency: 41.73 %

Last Digits:
9763261541 1423749758 2083180752 2573977719 9605119144 : 9,999,999,950
9403994581 8580686529 2375008092 3106244131 4758821220 : 10,000,000,000

Timer Sanity Check: Passed
Frequency Sanity Check: Passed
ECC Recovered Errors: 0

----

Checksum: 1117a7c3424cae6a13185f85c7b024f2381a2428ffed75e218 078c94ad268809

im running the 25 now, wish me luck :)
03-10-2010, 10:42 PM
poke349

3 of them... :slobber::slobber::slobber::slobber::slobber:

If you have any extra hard drives lying around, they'll really help out the larger runs using Advanced Swap.

Since you have sooo much computation power, it'll obviously be heavily bottlenecked by disk bandwidth. ;)

Someday... I need to optimize the program better for NUMA machines like your quad-sockets... Only then will it be able to bring out their true potential... :rolleyes:
03-11-2010, 02:37 PM
skycrane

id be more than wiling to run the optimized program anytime you get it ready and report the results on it. i have a perfect way to test the difference between the 2. all 3 rigs have the same tyan s4985 mobo, same memory, same opti 8356's with the same batch number and same hdd.

for the hdd, are you saying that the best bet would be to run 15gb of mem, and the hdd in a raid0 array??? to get the bandwith that i need?
would 2 75 gig raptors work better than 4 regular 7200 rpm hdds?
03-11-2010, 02:56 PM
poke349

Quote:

Originally Posted by skycrane

id be more than wiling to run the optimized program anytime you get it ready and report the results on it. i have a perfect way to test the difference between the 2. all 3 rigs have the same tyan s4985 mobo, same memory, same opti 8356's with the same batch number and same hdd.

for the hdd, are you saying that the best bet would be to run 15gb of mem, and the hdd in a raid0 array??? to get the bandwith that i need?
would 2 75 gig raptors work better than 4 regular 7200 rpm hdds?

It's not a simple fix. Virtually the entire multi-threading structure of the program needs to be redesigned from scratch and re-written to do that.
That's not something I have on my plans. So it may not be for years. :( (assuming I'll still have interest in the program by then)

The hardware that I need to do it would be well beyond my budget. :(:(:(
(I'd basically have to build myself a Beowulf cluster of high-end dual or quad-socket machines fully loaded with an absolutely obscene amount of ram.)

Perhaps a single 4P Beckton or a 4P Magny Cours machine will be enough... both of which are well beyond my budget. (And I'm speaking as a college student so i have no money or income... lol)
Basically I need a machine that is VERY Non-Uniform in memory to be able to write for it...

About the hard drives:
From the results that I have gotten so far, letting the program manage your HDs separately does seem to be more efficient than RAID 0.

So RAID doesn't seem to be useful until you run out of drive letters.

With your amount of ram (> 16 GB), disk seeks won't become significant until you push over 100 billion digits. So bandwidth will be the only that really matters. (In other words: 4 x 7200RPM will beat out 2 x raptors.)
03-12-2010, 09:26 AM
skycrane

sounds good :) im heading out now to grab a few more 1tb drives :) maybe ill be the first one to 1t :D
03-12-2010, 10:03 AM
Particle

I get about 50% loading on 12 CPUs using swap mode on my 15K disks. :) That's actually quite impressive, imo. Good work on efficient swapping!
03-12-2010, 03:12 PM
tj_jackson

6 Attachment(s)

my results

Config:
Supermicro X8DA6
2x Xeon X5550
6x2gb ddr3 1333mhz 99928
03-12-2010, 06:32 PM
LiquidNitrogen

I just approximate Pi as 3.14, and it's pretty fast

:)
03-13-2010, 12:31 AM
poke349

Woah... I go away for day, and there are 4 new posts... lol

Quote:

Originally Posted by skycrane

sounds good :) im heading out now to grab a few more 1tb drives :) maybe ill be the first one to 1t :D

Wow... you're probably the first person (besides me), who actually went out to buy hardware to run this program. :shocked:
Good luck with that. Gonna be very interesting. :D

p.s. Lemme warn you though... 1 trillion digits is gonna take a LONG time. As in: more than 20 days...

Quote:

Originally Posted by Particle

I get about 50% loading on 12 CPUs using swap mode on my 15K disks. :) That's actually quite impressive, imo. Good work on efficient swapping!

Thanks. :D Granted, you've got 8? hard drives... :)
Using just one would be an absolute pain... :p:

The swap mode wasn't an easy task. I've been working on it (on and off) since November 2008. (well before the first release of this program)

I actually had it completely designed and laid out on paper about a year ago, but I never found the time to actually finish it.
It wasn't until this winter quarter, that my class load was low enough to let me goof off a bit. :(

And when I actually had it partially working, testing it was a complete nightmare since swap computations take forever.
Due to the nature of the algorithms, things behave differently for small and large computations. So a "simple" test on the larger sizes would take hours. Of course, nobody gets it right on the first try...
So I spent much of January abusing my workstation with 40GB ram drives to actually test and debug this thing...

I remember sitting there in horror when a 30 hour test failed on my Core i7 rig...
Then I got the idea to ram drive it on my workstation... made it sooo much easier to fix the bug...
(This was prior to getting the 4 x 2TB. So everything was REALLY slow.)

Quote:

Originally Posted by tj_jackson

my results

Config:
Supermicro X8DA6
2x Xeon X5550
6x2gb ddr3 1333mhz 99928

Wow. Another i7 dualie. :up:
Is your system reserving 1GB for video or something? Since the program is only reading 11GB.

Been getting mostly multi-socket results these few days... :D

Quote:

Originally Posted by LiquidNitrogen

I just approximate Pi as 3.14, and it's pretty fast

:)

22/7 is better. ;)

Both 3.14 and 22/7 need 4 characters to write, but 22/7 is more accurate. :D:D:D

22/7 is faster too. Fewer strokes to write. :p:
03-14-2010, 07:22 AM
Particle

Guilty! I've got eight Fujitsu MBA3147RCs on a HighPoint RocketRAID 4320 in RAID-5. The program moves hundreds of megabytes per second while running. :D

It was running on the fastest part of the disk:

http://www.pcrpg.org/pics/computer/rr4320/hdtune.png
03-14-2010, 11:10 AM
poke349

Quote:

Originally Posted by Particle

Guilty! I've got eight Fujitsu MBA3147RCs on a HighPoint RocketRAID 4320 in RAID-5. The program moves hundreds of megabytes per second while running. :D

It was running on the fastest part of the disk:

http://www.pcrpg.org/pics/computer/rr4320/hdtune.png

haha. yep! :rofl::rofl::rofl:

Disk is the clear bottleneck... So I made it so that you can fix that by just throwing more drives at it. :D:D:D

@skycrane

You've got very some serious competition to 1 trillion digits. :rolleyes:
Want me to PM you the specifics?

Actually, if you're willing to run that awesome machine of yours for more than a week like that, you're actually in good shape of breaking some of the world records for the other constants.
(Namely the records that I set between March - May last year.)

e, Log(2), and Zeta(3) are your best bets.

e:
I recently set this to 500 billion digits on my Core i7 machine with 12 GB ram + 4 x 2 TB. It took 12.8 days to compute and verify.

Log(2):
This was done to 31 billion digits last year on my 64GB workstation. But it only took 40 hours to compute and verify. So 50 billion digits could possibly go sub-one week if you have enough drives running in parallel.
Same applies to Log(10), but no one gives a crap about Log(10). So lol.

Zeta(3) - Apery's Constant:
This was also done to 31 billion digits last year on my 64GB workstation. It took 4 days on my workstation to compute and verify. So it's slower. But there's a Wikipedia article on it with a list of records.

I didn't mention Square Root of 2 and Golden Ratio because there's someone already working on that.
(They're both already computed to much more than the current official records, but are both pending verification.)

Catalan's Constant and the Euler-Mascheroni Constant don't support Advanced Swap Mode yet, so you'll need more than 64GB of ram to beat those. (Not to mention they are both FREAKING slow...)
03-14-2010, 03:25 PM
onex

Quote:

Nope, not that slow!:rofl:

:rofl:
man, your teasing around too much :rofl:.

E:
that cpu was seem listed somewhere.. since mid February..:yepp:
03-14-2010, 06:02 PM
Particle

A competition! Hmm. I've got another array that does 300MB/s I could add to the mix...
03-15-2010, 07:30 AM
Particle

1 Attachment(s)

You guys will never guess what I'm doing...

*whistles*

PS: Samples are an average of about 2 minutes, so peak utilizations aren't shown.
03-15-2010, 07:34 AM
poke349

Quote:

Originally Posted by Particle

You guys will never guess what I'm doing...

*whistles*

PS: Samples are an average of about 2 minutes, so peak utilizations aren't shown.

CPU and disk usage graph of a large swap run. Am I right? :D

*I've been doing a lot of this myself using processor explorer over shorter intervals. So the patterns kinda look familiar...
03-15-2010, 07:53 AM
Particle

Yep! :) I started it last night primarily as a variable stress test, and the funny part is that I don't remember how big I set it to. It was 22% on summing this morning, so I guess the result will be a surprise. Can't speak for if it'll validate. Runs on my machine frequently don't even at stock.
03-15-2010, 08:04 AM
poke349

Quote:

Originally Posted by Particle

Yep! :) I started it last night primarily as a variable stress test, and the funny part is that I don't remember how big I set it to. It was 22% on summing this morning, so I guess the result will be a surprise. Can't speak for if it'll validate. Runs on my machine frequently don't even at stock.

It should be clearly visible what the size is? Unless you somehow hid the window or something?

The program will validate as long as it finishes - regardless of whether it fixes errors or detects cheating, etc... since those are just flags in the Validation.txt which is protected by the hash.

The only time it won't get to the validation is if it crashes or encounters some error that keeps it from finishing.
You'd have to be very unstable for it crash. And it usually only errors out during IO errors (which is usually insufficient disk space, or a failing disk).
It will also error out if it runs into a clear implementation bug... but hopefully that won't be happenning...

8 hours and only 22%... That's gotta be a gigantic run... :eek:
03-15-2010, 08:16 AM
Particle

I just didn't remember. I've remoted in and it looks like it's a 100B run. 50h CPU time so far, 3.7TB disk read, and 3.6TB disk written. Works out to an overall average of around 40% CPU and 100MB/s disk (both read and write).
03-15-2010, 10:09 AM
poke349

Quote:

Originally Posted by Particle

I just didn't remember. I've remoted in and it looks like it's a 100B run. 50h CPU time so far, 3.7TB disk read, and 3.6TB disk written. Works out to an overall average of around 40% CPU and 100MB/s disk (both read and write).

Ah, you're running it headless...

100b, that would be an awesome entry to the list. :D

Hopefully I'll be able to fix my workstation over Spring Break.
It went down shortly before I finished v0.5.2. But it did last long enough to do all the critical dev work for v0.5.2.
So it hasn't actually done a full pi computation using swap yet... (hence why the benchmarks that I posted were mostly from my Core i7 lanbox.)
03-16-2010, 05:54 AM
Particle

Looks like I'm squarely on track for the worst score on record. It seems to be taking progressively longer for each successive percent. The first ten hours got me to 22%. Another twenty-four have me at 43%. At this pace, I should finish just before we make it back to the moon. heh
03-16-2010, 07:24 AM
poke349

Not necessarily.

There are huge steps in the %s.

I don't remember exactly what they are for 100b: (they differ depending on the size)

22%
~30%
43%
~60%
~79%
Done

The gaps between the %s increase as it goes on.
You're probably nearing the point just before it reaches the next %.

It's true that it will slow down as it goes on - and this because it swaps more and more. But with that many drives, it shouldn't be that much.
03-16-2010, 02:07 PM
skycrane

5 Attachment(s)

Quote:

Originally Posted by poke349

haha. yep! :rofl::rofl::rofl:

Disk is the clear bottleneck... So I made it so that you can fix that by just throwing more drives at it. :D:D:D

@skycrane

You've got very some serious competition to 1 trillion digits. :rolleyes:
Want me to PM you the specifics?

Actually, if you're willing to run that awesome machine of yours for more than a week like that, you're actually in good shape of breaking some of the world records for the other constants.
(Namely the records that I set between March - May last year.)

e, Log(2), and Zeta(3) are your best bets.

e:
I recently set this to 500 billion digits on my Core i7 machine with 12 GB ram + 4 x 2 TB. It took 12.8 days to compute and verify.

Log(2):
This was done to 31 billion digits last year on my 64GB workstation. But it only took 40 hours to compute and verify. So 50 billion digits could possibly go sub-one week if you have enough drives running in parallel.
Same applies to Log(10), but no one gives a crap about Log(10). So lol.

Zeta(3) - Apery's Constant:
This was also done to 31 billion digits last year on my 64GB workstation. It took 4 days on my workstation to compute and verify. So it's slower. But there's a Wikipedia article on it with a list of records.

I didn't mention Square Root of 2 and Golden Ratio because there's someone already working on that.
(They're both already computed to much more than the current official records, but are both pending verification.)

Catalan's Constant and the Euler-Mascheroni Constant don't support Advanced Swap Mode yet, so you'll need more than 64GB of ram to beat those. (Not to mention they are both FREAKING slow...)

yea poke, can you send me the pms id like to see what this can do before MM destroyes us all with his 12core machine hes got... lol

well this sucks, looks like the disks i got wont work, they are to slow to move that amount of data Particle was talking about. maybe ill see if i can use my brothers NAS hes got. do you think on a gigabit line, if i have enough disks for it, that it would be fast enough to handle the bandwidth?
or would it be better to run 7 or 8 drives off the mobo?

poke, here are the updated of my runs i did. they are a bit faster. i had some programs running in the background , and i was doing this over my NetOp and it was slowing down all my runs
03-16-2010, 02:23 PM
poke349

Quote:

Originally Posted by skycrane

yea poke, can you send me the pms id like to see what this can do before MM destroyes us all with his 12core machine hes got... lol

well this sucks, looks like the disks i got wont work, they are to slow to move that amount of data Particle was talking about. maybe ill see if i can use my brothers NAS hes got. do you think on a gigabit line, if i have enough disks for it, that it would be fast enough to handle the bandwidth?
or would it be better to run 7 or 8 drives off the mobo?

poke, here are the updated of my runs i did. they are a bit faster. i had some programs running in the background , and i was doing this over my NetOp and it was slowing down all my runs

Do whatever you can to maximize your total combined bandwidth. (though it's still bottlenecked by the slowest drive)
So I don't think gigabit network is gonna work since that's only 128 MB/s. Keep everything on the mobo. SATA and SAS cards are fine - hardware RAID support isn't necessary since the program can take care of that.

Basically, whatever will preserve the combined total bandwidth will work. That seems to be the only thing that matters...

My 4 x 2TB Hitachi drives get about 450 - 480 MB/s. (as measured by process explorer while y-cruncher is running)
I keep them separate and let the program manage them. So no RAID 0.

I have one of these on my workstation:
http://www.newegg.com/Product/Produc...-009-_-Product

It's great and it preserves all the bandwidth. It's cheap because it doesn't have raid.
Next year, that card is gonna to be fully loaded because I'll be moving my 4 x 2TB from my Core i7 machine into my Xeon Workstation.
(Optical Drive + 64 GB SSD + 750GB + 4 x 1TB + 4 x 2TB + 3 external SATA = 14 total = 6 on mobo + 8 on card)

I know of others who are using some more expensive SAS cards... they all work.

I've added Dave's #'s to the list... Devastating - even with v0.4.4... :eek::eek::eek:
03-17-2010, 12:37 PM
Particle

I have been recording CPU and disk usage for the entire duration of this run at a rate of one sample per second. I'm faced with a dilemma about how to represent it graphically. Since it's obviously impractical to display one x on the graph for each sample, I have to determine how to combine multiple samples. It is the decision to pick between average and peak utilization that I'm not sure of. Average would certainly show the overall load better, but it does little justice to the spiky utilization of a program such as y-cruncher in swap mode and is misleading in terms of making the program appear not to be maximizing system resources. I'll show you what I mean:

Averaged Graph, Peaked Graph

For a graph that's even as wide as the ones above, each "macrosample" still represents just shy of two minutes of time.

Fun Stats: 2.39 quadrillion cycles of CPU time consumed so far, 24.1 trillion bytes read, 23.5 trillion bytes written
03-17-2010, 01:47 PM
skycrane

well, i lost my last run :( it was a small one, only 100 b....

the breaker on the ups poped, but im not sure why that happened. ive had the same rigs running on that line for the last 2 months. got home an hour ago, and the run was gone :(
03-17-2010, 02:13 PM
Movieman

Quote:

Originally Posted by skycrane

yea poke, can you send me the pms id like to see what this can do before MM destroyes us all with his 12core machine hes got... lol

Sorry, but I had to but your in luck, with 12 gig all I can do is up to one billion..2.5 bil uses like 11.5 gig of memory and 500mb isn't enough left to run the system..I know, I tried it..
I may try bumping up to 4600 to push the small numbers though!:sofa:
03-17-2010, 10:07 PM
poke349

Quote:

Originally Posted by Particle

I have been recording CPU and disk usage for the entire duration of this run at a rate of one sample per second. I'm faced with a dilemma about how to represent it graphically. Since it's obviously impractical to display one x on the graph for each sample, I have to determine how to combine multiple samples. It is the decision to pick between average and peak utilization that I'm not sure of. Average would certainly show the overall load better, but it does little justice to the spiky utilization of a program such as y-cruncher in swap mode and is misleading in terms of making the program appear not to be maximizing system resources. I'll show you what I mean:

Averaged Graph, Peaked Graph

For a graph that's even as wide as the ones above, each "macrosample" still represents just shy of two minutes of time.

Fun Stats: 2.39 quadrillion cycles of CPU time consumed so far, 24.1 trillion bytes read, 23.5 trillion bytes written

Wow... That's very interesting... I've never profiled a run for longer than 10-20 minutes... :clap:
It's almost detailed enough for me to recognize each pattern and match it with where it is in the algorithm. :D

I agree, neither graph is adequate. The peak utilization graph is too inflated. And the average hides the fact that the program is usually maxing out one resource: 100% cpu or 100% disk. (So it ends up showing both as less than 100% - but they add up to ~100%.)

Due to data dependencies, it's difficult to efficiently do both computation and I/O at the same time... (it's possible, but it's freaking hard to do it... lol :()

There are few places where it is able to pull off "some" computation and I/O at the same time - and you will notice in those places that cpu utilization + disk utilization > 100%.
But that's insignificant with respect to the entire computation. :(

Quote:

Originally Posted by skycrane

well, i lost my last run :( it was a small one, only 100 b....

the breaker on the ups poped, but im not sure why that happened. ive had the same rigs running on that line for the last 2 months. got home an hour ago, and the run was gone :(

Ouch... :( How far did it get?
I highly doubt that y-cruncher is more stressful than WCG on a quad-socket like yours due to the NUMA effects... Bad luck?

Quote:

Originally Posted by Movieman

Sorry, but I had to but your in luck, with 12 gig all I can do is up to one billion..2.5 bil uses like 11.5 gig of memory and 500mb isn't enough left to run the system..I know, I tried it..
I may try bumping up to 4600 to push the small numbers though!:sofa:

The way I get it to work is to first enable the pagefile.
Then I start a 2.5b run. Once it finishes allocating all the memory and begins to sustain 100% cpu. I kill it. Then I do it again. For a few times.

That will effectively force the system to page itself out of memory enough to do the whole run without thrashing.

Or you can wait... If all goes to plan, v0.5.3 is gonna get an algorithmic improvement that will not only make it faster, but may also lower the memory requirement a bit. ;)

No ETA yet, since I haven't actually started any coding. But the math is finalized... ;)
03-18-2010, 03:57 AM
skycrane

i was almost 2 days into it when it happened. my problem is the hdd, i just dont have enough for the bandwidth it needs.

the ups is a rack server compatable apc matrix 5kva :) ive got 3 quad socket tyans, i7 920@ 3.4 and Q6600 and my entertainment system all pulling off a 30A 240v line. ive got it split with 2 quads and the tv, on one of the 240 legs, and the other leg is everything in the computer room. and its been running like that with no problem for the last 2 months. Just bad luck it seems.
03-19-2010, 03:55 AM
poke349

Quote:

Originally Posted by skycrane

i was almost 2 days into it when it happened. my problem is the hdd, i just dont have enough for the bandwidth it needs.

the ups is a rack server compatable apc matrix 5kva :) ive got 3 quad socket tyans, i7 920@ 3.4 and Q6600 and my entertainment system all pulling off a 30A 240v line. ive got it split with 2 quads and the tv, on one of the 240 legs, and the other leg is everything in the computer room. and its been running like that with no problem for the last 2 months. Just bad luck it seems.

2 days in... ouch... :(

What drives are you using? And how much bandwidth are you getting?
It might be that the buffering settings aren't optimal. (both in Windows and in the program)
03-19-2010, 12:39 PM
Particle

My 100B run completed yesterday as far as calculation is concerned. It has spent all day today and half of yesterday converting the result to decimal, I imagine. 81.879 hours. I'll post a screen shot once it is 100% complete.

On another note, I have a question for you, Poke: I've noticed that the program has done 45 trillion bytes worth of reads so far. The unrecoverable read error rate for these enterprise-class drives is one per quadrillion bits read. This run has caused about 1/3 of that amount of reads so far and would have simply spun the dial around on regular desktop drives (1 per 100 trillion). Granted, in my case these reads were spread out among 16 individual hard drives. Statistically, I think that increases the expected rate, doesn't it? What happens if an unrecoverable read error is experienced? Does the system just try again and it works, does the calculation get a silent error in it, does it outright crash, etc? With these big runs, this is surely to be encountered from time to time.
03-19-2010, 01:23 PM
skycrane

Quote:

Originally Posted by Particle

My 100B run completed yesterday as far as calculation is concerned. It has spent all day today and half of yesterday converting the result to decimal, I imagine. 81.879 hours. I'll post a screen shot once it is 100% complete.

On another note, I have a question for you, Poke: I've noticed that the program has done 45 trillion bytes worth of reads so far. The unrecoverable read error rate for these enterprise-class drives is one per quadrillion bits read. This run has caused about 1/3 of that amount of reads so far and would have simply spun the dial around on regular desktop drives (1 per 100 trillion). Granted, in my case these reads were spread out among 16 individual hard drives. Statistically, I think that increases the expected rate, doesn't it? What happens if an unrecoverable read error is experienced? Does the system just try again and it works, does the calculation get a silent error in it, does it outright crash, etc? With these big runs, this is surely to be encountered from time to time.

Particle, that would be correct, the error rate for one drive is one in a 10^15 and for 2 drives you would cut that in half ie. 1 in 500 trillion. so for 16 drives. You would have an error for every 15,258,789,062 bytes of data written. You should have had 3000 errors written on that run. 45tb/15gb

umm particle, did you mean to say bits or bytes ?? depending on what it is, my calculations might be off by a factor of 8 :(
03-19-2010, 01:54 PM
Particle

One per 10^15 bits read. Desktop drives are generally one per 10^14. That works out to one per quadrillion and one per 100 trillion respectively.

I don't think it directly cuts in half. I read an article about it a year or two ago, and it wasn't as straight forward as one would expect. I'll see if I can find it.

Edit: Here it is.
03-19-2010, 03:43 PM
Particle

It's complete. Something went wrong with writing the digits to disk, but the calculation itself completed. This isn't the first time, however. I have to wonder if there's a compatibility problem at play in the same way that IBT would fail on my Phenom II X4 940 system even if it was underclocked.

http://www.pcrpg.org/pics/computer/y...lete_final.jpg
03-19-2010, 06:29 PM
poke349

Quote:

Originally Posted by Particle

It's complete. Something went wrong with writing the digits to disk, but the calculation itself completed. This isn't the first time, however. I have to wonder if there's a compatibility problem at play in the same way that IBT would fail on my Phenom II X4 940 system even if it was underclocked.

http://www.pcrpg.org/pics/computer/y...lete_final.jpg

Oh god... Base conversion checksum failure...
This means that it finished the conversion, but the converted digits failed a redundancy check.

In a more technical explanation:
Before the conversion, the program computes (binary digits mod p), where p is a prime number.
After the conversion, the program computes (decimal digits mod p).
The result MUST match because the modulus will always be the same regardless of the what base the digits are in.

The only thing I can think of that went wrong to get that kind of error would be something to with memory.
Most CPU errors get caught on the spot and will show up as "Multiplication Failure" (the program spends 99% of the time performing multiplications - which almost all have redundancy checks).
Those errors are corrected and the program will proceed normally.
This one wasn't the case...

How much virtual memory did you have remaining during the run?
I have encountered an issue once where Windows will fail to run a thread in the event of insufficient virtual memory. (and still return with a normal return code... so the program doesn't know the thread failed...)

When a hard drive encounters an unrecoverable error, it will be caught via hardware CRC (with extremely high probability) and will return to the OS as a read fault.
(This was thoroughly tested on a failing hard drive that I have.)
When that happens, the program will tell you that it read-faulted. (in bright red text)
Then it will print the error code, pause until you hit enter, and reattempt that I/O command. It will continue to reattempt the I/O (with the pauses) until it either succeeds, or the user decides to kill it. In other words, it will enter an infinite loop if the sector is unreadable. This is intentional by design to allow the user some ability to tweak things and reattempt. (so no, it won't crash)

The probability that a hard drive soft errors AND slips through CRC is extremely low... I believe less than 1 in 10^20 bytes. So if the hard drive returns without an error, the data is correct. If it encounters an error that it can't fix, then it will tell the OS that it the cluster is unreadable.

So yes, 45 TB is already pushing the limit of the specification of these HDs, but it won't fail silently.
(I would like to think that they are built to have a much lower unrecoverable error rate than the specs because I've put nearly a petabyte on all my drives, and I've never had a single error on a drive that wasn't loaded with SMART errors...)

Check the SMART on your drives... I dunno what to say. :(:(:( Sorry that it failed.
Since the program has successfully done 100b at least 3 times on 3 other machines, it makes a software bug unlikely. But I'll keep it an open possibility since the program is multi-threaded... (though I can't do anything unless someone can repro it in some consistent manner)
03-19-2010, 07:10 PM
poke349

Another thing:

Exactly what settings did you use? That could make a difference.

When I get back from spring break, I'll try running with the exact same settings you used. And if that gives the same error, I'll have one hell of a bug to fix...
(But at least you can sleep a little easier knowing that it isn't your hardware.)

Anyhow, I have plans to add more redundancy checks into the program over the next few versions. (these will also go along with a more aggressive error-correction)
They will come with some performance overhead, but will be offset by speed optimizations that I'll be doing.

EDIT:
And if you haven't deleted the hexadecimal digits already, you can use the digit viewer (option 4) to see if the hexadecimal digits are correct.

They should be:

Code:

adf23df916 c2d4167875 8e2bede8c6 e87a5d957b 00c7f252fd : 83,048,202,350 e55d87142f 94e93e4f54 d1a

Whether or not your hexadecimal digits are correct is irrelevant to the the base conversion failure you got. But at least it ensures that no other errors were made...
03-19-2010, 08:00 PM
Particle

2 Attachment(s)

The settings should all be in the image with the exception of where I set the swap disks to, F:\ and G:\.

I suppose it's technically possible for it to be a memory problem, but I would think it unlikely it since it is ECC/registered. Proactive ECC options are set to maximum, chipkill is enabled, etc. Any single bit error would be corrected and multi-bit errors should at least be detected and cause a BSOD. An IC failure would cause automatic failover to the extra 9th IC per side. *shrug* It's puzzling.

The good news is that Pi itself was calculated correctly, as per the attachment.

Would it be possible to redo just the base conversion part? The Pi information is already there, and if there's a reproducible problem that should make it a lot quicker to validate. If it doesn't occur again, it could be indicative of some hardware problem or maybe just a fluke due to planetary alignment or whatnot. Either way, it would be helpful for both of us.

PS: Can you provide an MD5, SHA, SFV, or similar (or better yet, all) checksum for the Pi output file?

Edit #2: I don't run with any virtual memory. I did run low on physical memory late last night, but I closed some big apps before it ran out entirely. I was playing Battlefield: Bad Company 2 at the time and killed it quickly.
03-19-2010, 09:08 PM
poke349

Quote:

Originally Posted by Particle

The settings should all be in the image with the exception of where I set the swap disks to, F:\ and G:\.

I suppose it's technically possible for it to be a memory problem, but I would think it unlikely it since it is ECC/registered. Proactive ECC options are set to maximum, chipkill is enabled, etc. Any single bit error would be corrected and multi-bit errors should at least be detected and cause a BSOD. An IC failure would cause automatic failover to the extra 9th IC per side. *shrug* It's puzzling.

The good news is that Pi itself was calculated correctly, as per the attachment.

Would it be possible to redo just the base conversion part? The Pi information is already there, and if there's a reproducible problem that should make it a lot quicker to validate. If it doesn't occur again, it could be indicative of some hardware problem or maybe just a fluke due to planetary alignment or whatnot. Either way, it would be helpful for both of us.

PS: Can you provide an MD5, SHA, SFV, or similar (or better yet, all) checksum for the Pi output file?

Edit #2: I don't run with any virtual memory. I did run low on physical memory late last night, but I closed some big apps before it ran out entirely. I was playing Battlefield: Bad Company 2 at the time and killed it quickly.

Ah... so you were low on memory.
I encountered almost the exact same problem in one of my tests. :(
But I wasn't able to reproduce it so I never did a fix or a work-around for it.

Since you spent a good 90 hours running this, I'm gonna give you the explanation that you deserve.
(Since you're a programmer, I'll can go into detail here.)

When I killed off the pagefile, and set the program to use almost all the ram, what would happen is that windows would spawn all the threads that it needs. But it wouldn't run all of them. Some of the threads would just sit idle (Task Manager shows 13% cpu usage on 8 logical cores, but with 16+ threads.)
I was only able to repro this in like 3 out of some 20 tries... And of the 3, I noticed that 2 of them didn't use all the cpu, and terminated them. The 3rd one was a 5 billion digit test. I didn't kill it because I was away for something else. When I got back, it finished, but the decimal digits fell ~3000 short of 5 billion.

Then when the wait function is called on each thread, each thread is terminated normally. But the threads that didn't run properly also terminated - without doing what they were supposed to.
The return value of those wait functions is 0 - no error. Which tricks the program into thinking that the threads finished what they were supposed to.

Then the computation would go on with the incomplete data. All redundancy checks are only done within the arithmetic. But since an entire thread (along with all the arithmetic that should be done with it) was omitted, the incorrect data is able to slip through all error-detection and make it all the way to end result.

Last question is why it died in the conversion and not earlier.
In my 5 billion digit test, the failed threads happened at the very beginning.
Because of the way the algorithm works, all the work done in the beginning only affects the latter digits. So when the error occurred on the very first set of threads, it only messed up the last 3000 digits.

The base conversion has a much larger load-balancing problem then the rest of the computation. So my trick to "fixing" that is to run double the # of threads that the computation normally uses.

more threads = more memory

If you were right at that threshold... :(:(:(

I know what I need to do now. I need to put in a hand-shaking protocol into all thread destruction to actually confirm if they finished properly...
And should a failure occur, I'll need to print an error at the least, and if possible, attempt to roll back to the last sync point and re-run. (which isn't always possible if the work is done in place)

Not an easy task, since it's incompatible with my threading API. But I feel this is necessary.

I get the feeling that this whole thread-stalling thing might be an OS problem. At the very least, the wait functions should return an error instead of 0. :(
I've triple-checked my threading API and even tested it by intentionally causing thread-creation failures. And it holds together and prints the appropriate error message when it's supposed to.
So I'm starting to think that it is yet another bug in Windows... (I've found more before, but I'll get my house burned down if I disclose them publicly... lol)

Thanks for your help. I definitely ignored a problem that I shouldn't have.

EDIT:
Yes, it's possible to just redo the conversion. I intentionally have it write the hex first for just a scenario, but it was only meant for the world record attempts, so I never actually added a feature for it in the program. (Should it happen on a world record run, I can write a parser for the hex digits and reload at the conversion. But it has never actually happened so I don't have the parser yet...)
The conversion behaves the same way regardless of what you're computing. So you can test it by computing Square Root of n or Golden Ratio to xx digits. (they compute very quickly, so the conversion dominates the total time)
03-19-2010, 09:46 PM
skycrane

Quote:

Originally Posted by Particle

One per 10^15 bits read. Desktop drives are generally one per 10^14. That works out to one per quadrillion and one per 100 trillion respectively.

I don't think it directly cuts in half. I read an article about it a year or two ago, and it wasn't as straight forward as one would expect. I'll see if I can find it.

Edit: Here it is.

i read what he wrote, and understand it. but its still hard to believe. it sounds like hes making an assumption that the dice have memory:) and anyone from Vegas will tell you they dont.. lol

i verry well could be wrong, or maybe he was trying to dumb it down so anyone could understand it. it just feels like to me hes missing something
03-19-2010, 09:58 PM
poke349

Quote:

Originally Posted by skycrane

i read what he wrote, and understand it. but its still hard to believe. it sounds like hes making an assumption that the dice have memory:) and anyone from Vegas will tell you they dont.. lol

i verry well could be wrong, or maybe he was trying to dumb it down so anyone could understand it. it just feels like to me hes missing something

I think those are specified error rates. Actual error rates would be lower.
I've put at least several petabytes of I/O on my normal desktop hard drives over the past year and a half... and I can tell you 99.99999% certainty that the actual unrecoverable read error rate is lower than that.

Particle, I just took a closer look at your screenie. You left the memory at the minimum 2.02 GB. That's just the minimum to run the computation.
You can actually increase it - which makes it faster. That's probably why it took so long.

EDIT: About the validation errors that you usually get. Which error is it? The frequency sanity check? Or the timer sanity check?
I'm curious, since the program might be being overly aggressive with the anti-cheat protection.

Show 100 post(s) from this thread on one page

All times are GMT -8. The time now is 05:06 PM.

XtremeSystems