While using "the SuperPi" algorithm [Gauss-Legendre
http://en.wikipedia.org/wiki/Gauss-Legendre_algorithm] it is absolutely multi-threadable. Other algorithms probably too.
The reason is simple: there are only about 10 matematical operations each loop - the rest is computing it with custom-written methematical operations using very high precision [eg. adding 30MB to compute a result of simple addition ;)].
I've done it a little while ago, looks like this: [FORTRAN77, oldschool :p:, but original SuperPi was also written in FORTRAN ;)]
Code:
program pi_calc
:banana::banana::banana::banana::banana::banana:implicit none
integer i
real*8 a(1:21),b(1:21),t(1:21),p(1:21)
real*8 pi
a(1)=1
b(1)=1/sqrt(2.0)
t(1)=0.25
p(1)=1
do i=1,20,1
:banana::banana:a(i+1)=(a(i)+b(i))/2
b(i+1)=sqrt(a(i)*(b(i)))
t(i+1)=t(i)-p(i)*(a(i)-a(i+1))**2
p(i+1)=2*p(i)
pi=(a(i)+b(i))**2/(4*t(i))
write(*,*)i,pi
enddo
stop
end
"do ____ enddo" - loop of course, the main algorithm
"**" - power
output:
1 2.91421352106
2 3.14057922577
3 3.14159262165
4 3.14159262902
5 3.14159262902
6 3.14159262902
.............................................
That program without custom-written mathematical formulas only calculates ~15 digits and in fact not all digits are correct, the accuracy is not so great :p:. But with that formulas it is a "classic SuperPi" - 1M with 19 iterations etc :).
As we can see the idea is really simple, only ~10 mathematical operations per loop.
The thing is to implement mathematical operations which can handle veeeery accurate precision [30M of digits or as we like].
So there is:
addition [add bits with carry, simple and multithreadable with some error margin [if carry will happen too many times ;)], division by 2 [simple binary "shift left"] - very fast in fact, not worth optimizing;
square root - I don't know the algorithm, but this is probably the slowest in that program - if it really is slowest hmm... I dunno if this is easy multithreadable
multiplication - probably faster than sqrt but non-comparable slower than addidtion for sure - the simple algorithm is as we all know to multiplicate in row and add - it will be very time-eating [30MB * 30MB means ~10^12 operations, I dunno if it is coded that way, probably there is faster algorithm, because 10^12 cycles in 1 second means a terahertz] - this is absolutely multithreadable [perfect scaling I can say]. IIRC 32-bit CPU can multiply 32-bit * 32-bit in 6 cycles which will be more real, but still somehow slow. No, wait - there must be faster algorithm [or it is not the most time-eating], because with doubling the accuracy we are slowing the calculations a little more than twice. Or maybe there is some trick to increase number of digits with each loop, but I don't think so [every loop takes +- the same amount of time].
division - also not that simple to write with threads ["not that simple" means "I don't know how to do it" :p:]
So the main problems: write a multi-threaded sqrt and division [power is not that demanding as it is only ^2 which in fact means multiplication].
Sorry if that was too long and I bored You :p:, but I personally find it interesting to know what it is all about :) [especially when we are benching it hours and hours, and all we can see are those 19-24 loops, nothing more ;).
Well, we all can now at least understand what means "not convergent in sqr" - we know what sqrt it is and why it should be convergent :D.