Been a while since I've posted... So I thought I'd bump this thread.
There hasn't been much news regarding the program lately.
As I've mentioned before, it's in the middle of a large rewrite that will touch > 80% of all the code. So there isn't gonna be much in the way updates until the new code is working.
I've also been experimenting with new things (related and unrelated to y-cruncher).
Here are some things that will likely make it into the next major release of y-cruncher.
Hybrid RAID 0/3:
Two level RAID 0 + 0/3. RAID 0 on top of either a RAID 0 or RAID 3 array.
Hard drive failures is a huge problem that's plaguing Shigeru Kondo's 10 trillion digit attempt.
With RAID 3, the program will be able to handle a hard drive failure in each RAID group.
Assuming hard drive plug-and-play works out, then it will be possible to hot-swap dead drives with new ones without closing the program. (and needing to revert to a checkpoint)
For pure RAID 0 setups, it will be more efficient then the current y-cruncher code. (The new code is better optimized.)
For hybrid RAID 0/3 setups, some overhead will be incurred for the error-correction math. But it's not significant.
*btw, this is gonna be a pain in the @$$ to test.I'm gonna be physically unplugging drives (from the motherboard) while the program is running.
A new Multiplication Algorithm:
I mentioned this a few times before, but it's finally complete enough to be benchmarked.
Unfortunately, it's slower than what y-cruncher uses right now. However, it is SIMD-scalable and NUMA-friendly.
The "baseline" performance sucks (and I knew it before I started implementing it), but... look at these numbers:
4 x Opteron 8356 @ 2.31 GHz (the one that skycrane sent me):
The current algorithm doesn't scale well beyond 2 sockets.Code:Integer Square: 1.6 x 1.6 billion digits Memory Needed : 4 GB Build: x64 SSE3 Current New 1 thread 72.0697 197.023 2 threads 37.0519 92.7221 4 threads 19.6629 43.2403 8 threads 11.9583 20.9102 16 threads 9.95651 11.184 Times are in seconds.
But the new algorithm has super-linear scaling?I thought it was just normal variation, but nope. It is consistent.
I don't know what the cause is, but it probably has to do with the NUMA since I don't see this awesome behavior on my other machines.
It's almost as fast as the current algorithm at 16 threads. Will 8 sockets (32 cores) be the crossover point?
In any case, the new algorithm really needs SSE4.1 and AVX to be efficient.
On my Sandy Bridge rig:
SSE4.1 makes it 30% faster than SSE3.
AVX makes it 66% faster than SSE4.1, and 116% faster than SSE3. (2.16x faster)
*No SSE at all sucks so bad that it isn't worth mentioning.
With SSE4.1, the new algorithm already beats GMP. (1 thread only, since GMP isn't multi-threaded)
With AVX, it is almost as fast as the current algorithm at a billion digits.
Perhaps Bulldozer will make things more interesting?
The shared of the 256-bit execution unit will be a huge drawback, but the new algorithm will benefit greatly from FMA and XOP. (The old algorithm will only benefit a tiny bit from FMA.)
Although this new algorithm sucks for small products, it destroys the current y-cruncher algorithm for sizes above 100 billion digits - with or without AVX. (That's why I decided to implemented it in the first place.)
So I never intended it to be useable at a "mere" 1.6 billion digits - at least not until we get 512/1024 bit SIMD...![]()
Bookmarks