SuperPi on GPU, were going CUDA

**JMKS** · 05-26-2008, 11:58 AM

Originally Posted by Neuuubeh

is calculating pi a process thats easily multi-thread'able?

While using "the SuperPi" algorithm [Gauss-Legendre http://en.wikipedia.org/wiki/Gauss-Legendre_algorithm] it is absolutely multi-threadable. Other algorithms probably too.
The reason is simple: there are only about 10 matematical operations each loop - the rest is computing it with custom-written methematical operations using very high precision [eg. adding 30MB to compute a result of simple addition

].

I've done it a little while ago, looks like this: [FORTRAN77, oldschool

, but original SuperPi was also written in FORTRAN

]

Code:

program pi_calc
&#160;&#160;&#160;&#160;&#160;&#160;implicit none
      integer i
      real*8 a(1:21),b(1:21),t(1:21),p(1:21)
      real*8 pi      
      a(1)=1
      b(1)=1/sqrt(2.0)
      t(1)=0.25
      p(1)=1

      do i=1,20,1
      &#160;&#160;a(i+1)=(a(i)+b(i))/2
        b(i+1)=sqrt(a(i)*(b(i)))
        t(i+1)=t(i)-p(i)*(a(i)-a(i+1))**2
        p(i+1)=2*p(i)
        pi=(a(i)+b(i))**2/(4*t(i))
        write(*,*)i,pi
      enddo

      stop
      end

"do ____ enddo" - loop of course, the main algorithm
"**" - power
output:
1 2.91421352106
2 3.14057922577
3 3.14159262165
4 3.14159262902
5 3.14159262902
6 3.14159262902
.............................................

That program without custom-written mathematical formulas only calculates ~15 digits and in fact not all digits are correct, the accuracy is not so great

. But with that formulas it is a "classic SuperPi" - 1M with 19 iterations etc

.
As we can see the idea is really simple, only ~10 mathematical operations per loop.
The thing is to implement mathematical operations which can handle veeeery accurate precision [30M of digits or as we like].

So there is:
addition [add bits with carry, simple and multithreadable with some error margin [if carry will happen too many times

], division by 2 [simple binary "shift left"] - very fast in fact, not worth optimizing;
square root - I don't know the algorithm, but this is probably the slowest in that program - if it really is slowest hmm... I dunno if this is easy multithreadable
multiplication - probably faster than sqrt but non-comparable slower than addidtion for sure - the simple algorithm is as we all know to multiplicate in row and add - it will be very time-eating [30MB * 30MB means ~10^12 operations, I dunno if it is coded that way, probably there is faster algorithm, because 10^12 cycles in 1 second means a terahertz] - this is absolutely multithreadable [perfect scaling I can say]. IIRC 32-bit CPU can multiply 32-bit * 32-bit in 6 cycles which will be more real, but still somehow slow. No, wait - there must be faster algorithm [or it is not the most time-eating], because with doubling the accuracy we are slowing the calculations a little more than twice. Or maybe there is some trick to increase number of digits with each loop, but I don't think so [every loop takes +- the same amount of time].
division - also not that simple to write with threads ["not that simple" means "I don't know how to do it"

]

So the main problems: write a multi-threaded sqrt and division [power is not that demanding as it is only ^2 which in fact means multiplication].

Sorry if that was too long and I bored You

, but I personally find it interesting to know what it is all about

[especially when we are benching it hours and hours, and all we can see are those 19-24 loops, nothing more

.
Well, we all can now at least understand what means "not convergent in sqr" - we know what sqrt it is and why it should be convergent

.

**audiofreak** · 05-26-2008, 02:08 PM

Originally Posted by Boogerlad

can't you just get a cheapo pci card for your primary display?

Last time I checked it had to be an NVIDIA card or the drivers will not load.

I didn't say you can't run CUDA with just one card -- I said you can't run a CUDA task that lasts longer than 5 seconds on a primary display or it will get aborted by the OS (Windows XP, don't know about Vista but probably the same applies).

As for the formula, best known PI algorithm is Chudnovsky at the moment as used in PiFast43.

**thephenom** · 05-26-2008, 02:30 PM

Interesting way to benchmark a GPU, but what's the point? Does it stand as a benchmark for the GPGPU crowd?

**JMKS** · 05-26-2008, 02:30 PM

Originally Posted by audiofreak

As for the formula, best known PI algorithm is Chudnovsky at the moment as used in PiFast43.

http://home.istar.ca/~lyster/chart.html - comparision, with names of algorithms too

Yes, I know that Gauss-Legendre isn't fastest but this is "the SuperPi" algorithm, if we will change it, it will not be SuperPi in any meaning, just another program calculating Pi

.

**bobbobson** · 05-26-2008, 03:21 PM

Originally Posted by audiofreak

One very important thing to note:

You will have to have (at least) two NVIDIA GPUs in your system to run SuperPI or any other CUDA applicaton on a GPU for more than 5 seconds.

Primary display adapter in Windows cannot work under full load for more than 5 seconds without the task being aborted by the operating system because the OS assumes that the driver has got stuck.

Such a limitation does not apply to the secondary adapter. Therefore, you will be able to use only secondary adapter for running any CUDA applications which take more than 5 seconds to complete their workload.

That means you will have to invest in a dual PCI-Express x16 mainboard and into another NVIDIA card, even if it is only 9600GT (or cheaper) for the primary display adapter.

And when you are already investing, why not go SLI? That way NVIDIA sells two cards and a mainboard chipset. Nice way of to boost the sales.

Some of you already have SLI, and some may want to get it because of this "exciting" announcement so a word of warning to you:

1. You won't be able to run CUDA applications with SLI enabled EVER. Each CUDA application must manage multiple GPUs on its own.

2. Multi-GPU CUDA applications require that each GPU thread be associated with a distinct CPU thread. It means that for maximum performance on a Quad GPU setup you would need Quad-Core CPU as well.

Not a problem. Seeing as we will be processing 2 bajillion didgits of pi a second. The bench will only last for 4 seconds :P

**Lestat** · 05-26-2008, 04:25 PM

Originally Posted by JMKS

While using "the SuperPi" algorithm [Gauss-Legendre http://en.wikipedia.org/wiki/Gauss-Legendre_algorithm] it is absolutely multi-threadable. Other algorithms probably too.
The reason is simple: there are only about 10 matematical operations each loop - the rest is computing it with custom-written methematical operations using very high precision [eg. adding 30MB to compute a result of simple addition

].

I've done it a little while ago, looks like this: [FORTRAN77, oldschool

, but original SuperPi was also written in FORTRAN

]

Code:

program pi_calc
:banana::banana::banana::banana::banana::banana:implicit none
      integer i
      real*8 a(1:21),b(1:21),t(1:21),p(1:21)
      real*8 pi      
      a(1)=1
      b(1)=1/sqrt(2.0)
      t(1)=0.25
      p(1)=1

      do i=1,20,1
      :banana::banana:a(i+1)=(a(i)+b(i))/2
        b(i+1)=sqrt(a(i)*(b(i)))
        t(i+1)=t(i)-p(i)*(a(i)-a(i+1))**2
        p(i+1)=2*p(i)
        pi=(a(i)+b(i))**2/(4*t(i))
        write(*,*)i,pi
      enddo

      stop
      end