New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

**Alpha** · 02-08-2010, 08:02 PM

Originally Posted by bonis62

difference of six seconds , not bad, my 32-bit code is good, but
your code is truly remarkable.........

ya, he has me beat by about 100 seconds too. guess i need to work on my code...

can't wait to see your new versions!

**bonis62** · 02-09-2010, 02:22 AM

Originally Posted by poke349

Well... I compiled it for x64...

To me.... this smells like a compliment

**poke349** · 02-09-2010, 06:46 AM

Originally Posted by bonis62

To me.... this smells like a compliment

I recompiled it for x86 and got 17 seconds. So as expected, x64 is a bit faster.

**bonis62** · 02-09-2010, 09:02 AM

Originally Posted by poke349

I recompiled it for x86 and got 17 seconds. So as expected, x64 is a bit faster.

Indeed,
the difference is small, (bit)
i often wonder whether it is worth making two versions (64 / 32)
for this applications type ,
since the difference is minimal...

**poke349** · 02-09-2010, 09:30 AM

Originally Posted by bonis62

Indeed,
the difference is small, (bit)
i often wonder whether it is worth making two versions (64 / 32)
for this applications type ,
since the difference is minimal...

As of our current implementations, the difference is small.
But a more optimized version that stays in cache and uses hard-coded loop-unrolling will benefit a lot from the x64 registers.

x64 also lets you go over 2GB of ram.

**mattkosem** · 02-09-2010, 12:24 PM

Here's what the N270 in my EEE 1000HE does (win 7 - completely stock).

--Matt

**Particle** · 02-09-2010, 01:30 PM

I may not set any world records, but at least I'm on the board. Was a bit weird having 16 threads spawned for 12 physical cores.

2x AMD Opteron 2427 @ 2.2GHz
8GB ECC/Reg DDR2-800
Windows Vista SP2 x64

Awaiting the new swap mode.

**poke349** · 02-09-2010, 02:05 PM

Originally Posted by mattkosem

Here's what the N270 in my EEE 1000HE does (win 7 - completely stock).

--Matt

Nice, another Atom to add to the list.

Originally Posted by Particle

I may not set any world records, but at least I'm on the board. Was a bit weird having 16 threads spawned for 12 physical cores.

2x AMD Opteron 2427 @ 2.2GHz
8GB ECC/Reg DDR2-800
Windows Vista SP2 x64

Awaiting the new swap mode.

Nice. That's the first 12-core that I can add to the list.

If you haven't taken a look yet:
I've added 3 entries of the new swap mode to the table.
Each of them showing a computation that is MUCH too large than would fit in ram.
And if it isn't obvious enough already, disk bandwidth is pretty much the only thing that matters.

Even on my workstation with 4 x 1TB (~400 - 500 MB/s), it is still highly bottlenecked by disk.

So if you have those velociraptors ready by then...
Also, if bandwidth doesn't scale linearly in raid 0, than you'll wanna undo the raid. The program is able to manage multiple HDs separately and get perfect linear scaling.

If it goes well enough upon release, I might start a new thread for it:

"Pi-based Hard Drive Benchmark for the EXTREMELY Patient"

EDIT:
About the 16 threads.
There's a number of algorithms that simply don't work with non-powers of two, so to get around it, I simply round up and let the scheduler take care of it.

**Particle** · 02-09-2010, 02:49 PM

Is it more latency sensitive or bandwidth sensitive? The hard drives I've got aren't velociraptors--they're Fujitsu MBA3147RC 147GB 15K SAS drives. They do about 115MB/s each but offer an impressive 5.2ms (in real life) average random access time. I'll have eight of them at the end of the week. Would it be possible to help you test a beta?

**poke349** · 02-09-2010, 03:02 PM

Originally Posted by Particle

Is it more latency sensitive or bandwidth sensitive? The hard drives I've got aren't velociraptors--they're Fujitsu MBA3147RC 147GB 15K SAS drives. They do about 115MB/s each but offer an impressive 5.2ms (in real life) average random access time. I'll have eight of them at the end of the week. Would it be possible to help you test a beta?

Bandwidth sensitive. I've pretty much optimized out all the extremely non-sequential stuff...

With 8GB of ram, latency isn't gonna matter until you push above 150 - 300 billion digits.
(With 12GB of ram, I start getting latency issues at about 300 - 600 billion digits.
On my workstation with 64GB, 4TB of disk space it isn't enough for me reach the sizes that are large enough to hit latency slowdowns...)

EDIT:
It scales quadratically. Doubling your memory will quadruple this limit...
Should someone be crazy enough to push higher, the program may automatically do an algorithm switch that trades latency for bandwidth. (So less sensitive to latency, but uses more bandwidth.)
So it's likely that you'll never feel the latency unless you override the buffering settings to something that's completely messed up... (which the program lets you do)

Which at those sizes... I doubt anyone is gonna want to tie down their machines for too long.
(I would estimate a 100 billion digit computation to take at least 100 hours with a single Core i7 and 250 MB/s disk bandwidth.)

So for everyone else, this isn't gonna be a test for SSDs. With the sheer amount of writes it will do, it'll probably kill an SSD with a couple of runs...

If everything goes well... I should have a working beta ready in less than 10 days.

**poke349** · 02-09-2010, 08:51 PM

I've added some new details on the new version:
http://www.numberworld.org/y-crunche...n_history.html

And with that,
Here's the first two fully validated Advanced Swap computations with v0.5.2: (click to enlarge)

10 billion digits of Pi on my File Server - Pentium D @ 2.8 GHz + 3 GB DDR2 + 160 GB WD
x86 SSE3 - memory limit set to 1.84 GB.

10 billion digits of Pi on my Laptop - Core i7 720QM @ 1.6 GHz (stock) + 6 GB DDR3 + 500 GB Seagate
x64 SSE4.1 ~ Ushio - memory limit set to 3.00 GB.

Only the old benchmark mode will be able to verify if the digits are correct. (I obviously can't cache the last few digits of every computation size... lol)
For all other computations, (including these swap computations), the digits that it prints out will have to match the accepted values in order to complete the validation.
(the accepted values can be easily found online)

These two 10 billion digit computations took quite a long time (21 and 38 hours).
But that's because neither of the machines were "Xtreme" in any way.
In particular, my laptop's 500 HD is almost full so it only sustained 50 MB/s bandwidth.

I would expect that any "real" system (say a desktop C2Q/Ci7/PII X4) with a decent SATA II 7200 RPM drive will make 10 billion digits an "overnight job", or a "start in the morning, done before getting back from work" job...

**bonis62** · 02-10-2010, 02:07 AM

Originally Posted by bonis62

Indeed,
the difference is small, (bit)
i often wonder whether it is worth making two versions (64 / 32)
for this applications type ,
since the difference is minimal...

I want to compile my code in 64 bit,
i want to know what compiler you use for 64-bit

ty

http://www.xstreme.it/primes64.zip

my first attempt...

**poke349** · 02-10-2010, 10:06 AM

Originally Posted by bonis62

I want to compile my code in 64 bit,
i want to know what compiler you use for 64-bit

ty

http://www.xstreme.it/primes64.zip

my first attempt...

I use Visual Studio for general coding and compiling because it compiles very quickly.
But all final versions that use SSE are compiled using the Intel Compiler. It optimizes better than Visual Studio.

Both support x64.

Now you can try scaling up the size to more than 4GB of ram.
Though you won't know that you're bug-free until you test sizes that are large enough to overflow all 32-bit indexing in your program.
Which is where a lot of ram becomes useful... An array of 32-bit integers won't overflow a 32-bit index until it's larger than 16GB. And for 64-bit double, you need 32GB of ram...

But that's just supersizing... way more than what's needed for most applications.

**bonis62** · 02-10-2010, 10:53 AM

Originally Posted by poke349

I use Visual Studio for general coding and compiling because it compiles very quickly.
But all final versions that use SSE are compiled using the Intel Compiler. It optimizes better than Visual Studio.

Both support x64.

Now you can try scaling up the size to more than 4GB of ram.
Though you won't know that you're bug-free until you test sizes that are large enough to overflow all 32-bit indexing in your program.
Which is where a lot of ram becomes useful... An array of 32-bit integers won't overflow a 32-bit index until it's larger than 16GB. And for 64-bit double, you need 32GB of ram...

But that's just supersizing... way more than what's needed for most applications.

T Y V M

the FLOAT instead of DOUBLE saves bits and gains efficiency,
but i do not know if YOUR APP can work with FLOAT...

**poke349** · 02-10-2010, 12:10 PM

Originally Posted by bonis62

T Y V M

the FLOAT instead of DOUBLE saves bits and gains efficiency,
but i do not know if YOUR APP can work with FLOAT...

I'm already pushing "double" to its limit of precision.
"float" has less than half the precision of "double" so that would require more than 4x the work.

Actually, because of the way the algorithm works, using type "float" would require MUCH more than 4x the work. It would actually fail above a certain (small) size.

The run-time complexity is this:

where "n" is the # of digits.
and "w" is the # of bits of precision in the floating-point.

When the denominator goes to zero, the run-time (and memory) blows up to infinity - in other words, the algorithm fails.

This is the reason why I can't use GPU.

If there was a 128-bit floating-point type that was supported by hardware, the program would actually be MUCH faster.

EDIT: That complexity is just a reasonable approximation to the true complexity.
The true complexity (ignoring round-off error), has special functions in it... so it's unreadable to normal people. (even myself)

**bonis62** · 02-10-2010, 01:40 PM

Originally Posted by poke349

I'm already pushing "double" to its limit of precision.
"float" has less than half the precision of "double" so that would require more than 4x the work.

Actually, because of the way the algorithm works, using type "float" would require MUCH more than 4x the work. It would actually fail above a certain (small) size.

The run-time complexity is this:

where "n" is the # of digits.
and "w" is the # of bits of precision in the floating-point.

When the denominator goes to zero, the run-time (and memory) blows up to infinity - in other words, the algorithm fails.

This is the reason why I can't use GPU.

If there was a 128-bit floating-point type that was supported by hardware, the program would actually be MUCH faster.

EDIT: That complexity is just a reasonable approximation to the true complexity.
The true complexity (ignoring round-off error), has special functions in it... so it's unreadable to normal people. (even myself)

yeah

if
n=100.5
w=64.0

o= n/(w-Log(n))Log(n/(w-Log(n))) = 1.6922085689143893

it's your formula ?

but if you make this :

o= n/(w-Log(n))Log(n/(w-Log(n))) = 1.6922085689143893
o2= sqrt(o) = 1.3008491722388069

you have a large control predictor...

this is only my theory to bypass error

read this :

http://en.wikipedia.org/wiki/Floating_point

**poke349** · 02-10-2010, 03:33 PM

Originally Posted by bonis62

yeah

if
n=100.5
w=64.0

o= n/(w-Log(n))Log(n/(w-Log(n))) = 1.6922085689143893

it's your formula ?

but if you make this :

o= n/(w-Log(n))Log(n/(w-Log(n))) = 1.6922085689143893
o2= sqrt(o) = 1.3008491722388069

you have a large control predictor...

this is only my theory to bypass error

read this :

http://en.wikipedia.org/wiki/Floating_point

It's just a Big-O... I didn't put any of the constants in there.

Basically:
With "double", w = 53. The algorithm becomes impractical when n goes above a billion. The asymptote is reached when n reaches ~25 trillion digits. (give or take a factor of 10 or so...)

With "float", w = 24. The algorithm becomes becomes impractical at just a few thousand digits. The asymptote is reached when n goes to a mere 60,000 digits... (again, give or take a factor of 10 or so...)

This is just one of several major algorithms in the program.
The program doesn't rely on it solely. So 25 trillion digits isn't the limit of the program.

EDIT:
Basically, when "w" is much larger than "log(n)", the complexity is roughly O(n log(n)). That's where the algorithm rules since it's quasi-linear.
So as long as you stay in that range, the algorithm remains efficient.
But as you push higher and higher, the "log(n)" starts creeping up. Eventually, it becomes inefficient. Then impractical...
And when n is large enough such that w = log(n), the algorithm fails completely.

This algorithm is called FFT multiplication.
Virtually all fast pi-programs use it, SuperPi, PiFast, QuickPi, etc...
All implementations of it stay in the "efficient" sizes where "log(n)" is much smaller than "w".

**bonis62** · 02-10-2010, 04:08 PM

Originally Posted by poke349

It's just a Big-O... I didn't put any of the constants in there.

Basically:
With "double", w = 53. The algorithm becomes impractical when n goes above a billion. The asymptote is reached when n reaches ~25 trillion digits. (give or take a factor of 10 or so...)

With "float", w = 24. The algorithm becomes becomes impractical at just a few thousand digits. The asymptote is reached when n goes to a mere 60,000 digits... (again, give or take a factor of 10 or so...)

This is just one of several major algorithms in the program.
The program doesn't rely on it solely. So 25 trillion digits isn't the limit of the program.

EDIT:
Basically, when "w" is much larger than "log(n)", the complexity is roughly O(n log(n)). That's where the algorithm rules since it's quasi-linear.
So as long as you stay in that range, the algorithm remains efficient.
But as you push higher and higher, the "log(n)" starts creeping up. Eventually, it becomes inefficient. Then impractical...
And when n is large enough such that w = log(n), the algorithm fails completely.

This algorithm is called FFT multiplication.
Virtually all fast pi-programs use it, SuperPi, PiFast, QuickPi, etc...
All implementations of it stay in the "efficient" sizes where "log(n)" is much smaller than "w".

last question

with the binary splitting method

"the impractical" resolves anything ?

http://numbers.computation.free.fr/C.../programs.html

**poke349** · 02-10-2010, 04:14 PM

Originally Posted by bonis62

last question

with the binary splitting method

"the impractical" resolves anything ?

http://numbers.computation.free.fr/C.../programs.html

I don't think I understand what you mean...

**bonis62** · 02-11-2010, 05:29 AM

Originally Posted by poke349

I don't think I understand what you mean...

see this

http://numbers.computation.free.fr/C...splitting.html

and this

http://algolist.ru/download.php?path...rc.zip&pspdf=1

is useful ?

**poke349** · 02-11-2010, 06:08 AM

Originally Posted by bonis62

see this

http://numbers.computation.free.fr/C...splitting.html

and this

http://algolist.ru/download.php?path...rc.zip&pspdf=1

is useful ?

And your point?

EDIT:
If you're wondering is any of that is gonna help. I've already been using most of those algorithms since the very beginning.

**bonis62** · 02-11-2010, 09:39 AM

Originally Posted by poke349

And your point?

EDIT:
If you're wondering is any of that is gonna help. I've already been using most of those algorithms since the very beginning.

I really wanted to know your point of view....

I am new to this type of algorithm,
i are documented on this subject,
i thought that theory to solve the problem of zero value,
with a NAN routine,
if the data is zero, replace with NaN,
but if you say that is impossible,
then it is just impossible

**poke349** · 02-11-2010, 09:58 AM

Originally Posted by bonis62

I really wanted to know your point of view....

I am new to this type of algorithm,
i are documented on this subject,
i thought that theory to solve the problem of zero value,
with a NAN routine,
if the data is zero, replace with NaN,
but if you say that is impossible,
then it is just impossible

k, sorry.

Now I'm REALLY confused.

I never mentioned anything about zero or NaN.
I only mentioned that it was impossible to use "float" to any degree of efficiency. (impossible is kinda strong of a word, I'll just say "extremely difficult")

**bonis62** · 02-11-2010, 11:21 AM

Originally Posted by poke349

k, sorry.

Now I'm REALLY confused.

I never mentioned anything about zero or NaN.
I only mentioned that it was impossible to use "float" to any degree of efficiency. (impossible is kinda strong of a word, I'll just say "extremely difficult")

you says
When the denominator goes to zero, the run-time (and memory) blows up to infinity - in other words, the algorithm fails.

i reply
use NaN and Float

sorry, i am not a native speaker, might, indeed, surely i expressed myself badly

**poke349** · 02-11-2010, 01:43 PM

Originally Posted by bonis62

you says
When the denominator goes to zero, the run-time (and memory) blows up to infinity - in other words, the algorithm fails.

i reply
use NaN and Float

sorry, i am not a native speaker, might, indeed, surely i expressed myself badly

I understand you're not a native speaker.

Either you're being sarcastic, or you're completely missing the point...

That complexity is just part of the analysis of the algorithm.
It has nothing to do with 0 or NaN or anything hardware...

Basically I'm saying that (now, I'm just making these numbers up):

at 10,000,000 digits, it needs 100MB and 1 second
at 100,000,000 digits, it needs 2GB and 20 seconds
at 1 billion digits, it needs 50GB and 10 minutes
at 10 billion digits, it needs 5TB and 10 hours
at 100 billion digits, it needs 1000TB and 10 years
at 1 trillion digits, it needs infinite memory, and infinite time.

That's what I mean by blows up.

Basically, you get to the point where you have so many operations that no matter what you do, 53 bits of precision isn't enough, because roundoff error alone takes up 53 bits of precision.

Thread: New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

Thread Tools

Search Thread

Rate This Thread

Display

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions