New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

**poke349** · 07-24-2011, 03:53 PM

Been a while since I've posted... So I thought I'd bump this thread.

There hasn't been much news regarding the program lately.
As I've mentioned before, it's in the middle of a large rewrite that will touch > 80% of all the code. So there isn't gonna be much in the way updates until the new code is working.
I've also been experimenting with new things (related and unrelated to y-cruncher).

Here are some things that will likely make it into the next major release of y-cruncher.

Hybrid RAID 0/3:
Two level RAID 0 + 0/3. RAID 0 on top of either a RAID 0 or RAID 3 array.
Hard drive failures is a huge problem that's plaguing Shigeru Kondo's 10 trillion digit attempt.
With RAID 3, the program will be able to handle a hard drive failure in each RAID group.

Assuming hard drive plug-and-play works out, then it will be possible to hot-swap dead drives with new ones without closing the program. (and needing to revert to a checkpoint)

For pure RAID 0 setups, it will be more efficient then the current y-cruncher code. (The new code is better optimized.)
For hybrid RAID 0/3 setups, some overhead will be incurred for the error-correction math. But it's not significant.

*btw, this is gonna be a pain in the @$$ to test.

I'm gonna be physically unplugging drives (from the motherboard) while the program is running.

A new Multiplication Algorithm:
I mentioned this a few times before, but it's finally complete enough to be benchmarked.
Unfortunately, it's slower than what y-cruncher uses right now. However, it is SIMD-scalable and NUMA-friendly.

The "baseline" performance sucks (and I knew it before I started implementing it), but... look at these numbers:

4 x Opteron 8356 @ 2.31 GHz (the one that skycrane sent me):

Code:

Integer Square: 1.6 x 1.6 billion digits
Memory Needed : 4 GB
Build: x64 SSE3

             Current      New
1 thread     72.0697    197.023
2 threads    37.0519    92.7221
4 threads    19.6629    43.2403
8 threads    11.9583    20.9102
16 threads   9.95651    11.184

Times are in seconds.

The current algorithm doesn't scale well beyond 2 sockets.
But the new algorithm has super-linear scaling?

I thought it was just normal variation, but nope. It is consistent.
I don't know what the cause is, but it probably has to do with the NUMA since I don't see this awesome behavior on my other machines.
It's almost as fast as the current algorithm at 16 threads. Will 8 sockets (32 cores) be the crossover point?

In any case, the new algorithm really needs SSE4.1 and AVX to be efficient.

On my Sandy Bridge rig:
SSE4.1 makes it 30% faster than SSE3.
AVX makes it 66% faster than SSE4.1, and 116% faster than SSE3. (2.16x faster)
*No SSE at all sucks so bad that it isn't worth mentioning.

With SSE4.1, the new algorithm already beats GMP. (1 thread only, since GMP isn't multi-threaded)
With AVX, it is almost as fast as the current algorithm at a billion digits.

Perhaps Bulldozer will make things more interesting?

The shared of the 256-bit execution unit will be a huge drawback, but the new algorithm will benefit greatly from FMA and XOP. (The old algorithm will only benefit a tiny bit from FMA.)

Although this new algorithm sucks for small products, it destroys the current y-cruncher algorithm for sizes above 100 billion digits - with or without AVX. (That's why I decided to implemented it in the first place.)
So I never intended it to be useable at a "mere" 1.6 billion digits - at least not until we get 512/1024 bit SIMD...

Thread: New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions