SuperPi on GPU, were going CUDA

Printable View

Show 100 post(s) from this thread on one page

05-25-2008, 04:57 PM
Charles Wirth

SuperPi on GPU, were going CUDA

The first steps were taken to port Superpi to CUDA so we should see working software in less than 30 days.

What is CUDA?
"The CUDA™ Toolkit is a C language development environment for CUDA-enabled GPUs."
http://www.nvidia.com/object/cuda_learn.html

CUDA will enable SuperPi to run on the GPU.

We will soon be able to bench on hundreds of cores inside our GPU's for an exact performance number. This will be very interesting to massive parallel versus massive cache.

With the success of Superpi exceeding 100 million downloads it will have an existing installed base of 70 million CUDA capable GPU's to run on and will grow by 100 million per year or so.

Current list of CUDA enabled GPU's
http://www.nvidia.com/object/cuda_learn_products.html

My thanks to everyone around the world for making it one of the most popular benching applications.

It will remain free of adverts and free to use, the program name will include CUDA GPU.
05-25-2008, 05:00 PM
Ashraf

Nice! Will there be a validation, like this: http://www.xtremesystems.com/pi/?
05-25-2008, 05:02 PM
Warboy

Nice! Can't wait.
05-25-2008, 05:04 PM
ownage

It is possible to run CUDA on ATI right? Or will this be a nVidia bench only?
05-25-2008, 05:05 PM
Charles Wirth

I will fix validation, not sure what broke it still. It should work but it doesn't.

I do know when it was served on IIS it does work so I may go that route with its own server.
05-25-2008, 05:05 PM
madcho

cuda is nvidia's API
05-25-2008, 05:06 PM
Charles Wirth

Not sure of current situation with ATI on GPGPU, they feel its best to exclude me.
05-25-2008, 05:13 PM
eva2000

sweet holy goodness :D
05-25-2008, 05:14 PM
dinos22

i look forward to doing a 20 second 32M SuperPi run :D eek :D
05-25-2008, 05:15 PM
[XC] gomeler

This should be very interesting. I suspect though that GPGPUs won't have the single-threaded speed that a 5+GHz Core 2 would have. Look forward to this though :up:
05-25-2008, 05:19 PM
ExodusC

Aww, my 7800GTX isn't CUDA enabled... I bet it could beat a lot of the top of the line CPUs in SuperPi, though.

I can't wait to get a new video card... It's gonna be cool to bench GPUs like this!
05-25-2008, 06:10 PM
metro.cl

while you are at it add 64m, 128m and more :)

Great news i always thought that NVIDIA would do this just to get on Intel's toes.
05-25-2008, 06:13 PM
lowfat

Hopefully this is just the beginning of GPGPU :)
05-25-2008, 06:15 PM
BenchZowner

Sort of comedian look & accent:
"And there goes the world's "cheapest" 2D benchmark" :p:
SuperPi is now a..."3D" benchmark :D

Boy SuperPi records are about to get owned
05-25-2008, 06:17 PM
nfm

Very nice! Make it open source please, I want to see the source code pr0n.
05-25-2008, 06:19 PM
DRCRAWFISH

Woah, this is awesome.

We will be owning CPu records I hope!
05-25-2008, 06:22 PM
dinos22

Quote:

Originally Posted by BenchZowner

Sort of comedian look & accent:
"And there goes the world's "cheapest" 2D benchmark" :p:
SuperPi is now a..."3D" benchmark :D

Boy SuperPi records are about to get owned

see this is the problem.....ppl conceive all sorts of ideas over a few beers :D
you can blame Vince for planting this idea in my head during dinner :ROTF:

Hey Charles how can you name this SuperPi benchmark CUDA GPU and not have any reference to SuperPi

Maybe reconsider calling it CUDA GPU SuperPi or something unless that's what you meant in first place :)
05-25-2008, 06:29 PM
Boogerlad

maybe call it gpuPi?
05-25-2008, 06:32 PM
mrcape

Wow CUDA sounds really cool. Great news!
05-25-2008, 06:41 PM
SKYMTL

Quote:

Originally Posted by FUGGER

It will remain free of adverts and free to use, the program name will include CUDA GPU.

Naming it with the CUDA moniker is the same as advertising IMO.

Great news though. :up:
05-25-2008, 06:53 PM
MuffinFlavored

This inspired me to download the CUDA SDK and toolkit, so I did.
But, then it got blurry.

I hope this rocks.

Instead of multi-threading SuperPi for CPU, make the jump and make hundreds of threads on the GPU. :)
05-25-2008, 07:05 PM
dinos22

what about using those parallel processing units to compute single threaded applications

i thought that should be doable and produce good results as well :shrug:

anyways it will be interesting to see how it all turns out and whether superpi speed will scale with the amount of cores the CPU has

then we would have to start to tweak RAM latencies EEK :D
actually that would not be very smart in case you can't go back to defaults lol
05-25-2008, 07:08 PM
cadaveca

Quote:

Originally Posted by FUGGER

Not sure of current situation with ATI on GPGPU, they feel its best to exclude me.

:rolleyes: Really?

Quote:

Originally Posted by Jen-Hsun Huang, 7 weeks ago @ Nvidia Financial Analyst Meeting

Of course it runs on ATI cards! Don't tell THEM that though!
05-25-2008, 07:09 PM
Luka_Aveiro

Quote:

Originally Posted by dinos22

then we would have to start to tweak RAM latencies EEK :D
actually that would not be very smart in case you can't go back to defaults lol

nvflash + vga pci card FTW?

headaches FTL?

:p:

^Jen-Hsun Huang is a very moody guy :D
05-25-2008, 07:10 PM
Cooper

Quote:

Originally Posted by lowfat

Hopefully this is just the beginning of GPGPU :)

It's not the beginning of GPGPU, not the first implementation of SPi on GPU and definitely CUDA was not made for benchmarks...it's just a byproduct.
05-25-2008, 07:32 PM
REBEL900

I used to own a Cuda, great american muscle car :up:
05-25-2008, 08:07 PM
cookerjc

call it GPi :)
05-25-2008, 09:04 PM
Lestat

has there been any HONEST EDUCATED specualtion as to the true power of CUDA vs todays dual and quad core cpu's ?
i read the quick article about photoshop running with the GPU plugin and although nothing technical was given the 1 line statement they gave makes it seem like a GPU is light years beyond todays current cpu's.

if this is the case then why hasnt AMD or intel made the CgPU ? i mean,, if AMD/ATI want to win the war,, make the dam next thing an ATI CPU.....

if the GPU is so powerful,,, then,,, ... wtf are they waiting for???
05-25-2008, 09:11 PM
Lestat

and what about SLI-Pi ?? or X-Pi...

given the piss poor multi core support Pi currently has,, IF the gpu's are multi multi multi cores then do you really expect it to work all that well ? and if it does then fix the dam pc version to be multicore caple,, TRUELY capable,, not just 80% one core and 20% the other..
05-25-2008, 09:22 PM
[XC] Lead Head

Quote:

Originally Posted by Lestat

has there been any HONEST EDUCATED specualtion as to the true power of CUDA vs todays dual and quad core cpu's ?
i read the quick article about photoshop running with the GPU plugin and although nothing technical was given the 1 line statement they gave makes it seem like a GPU is light years beyond todays current cpu's.

if this is the case then why hasnt AMD or intel made the CgPU ? i mean,, if AMD/ATI want to win the war,, make the dam next thing an ATI CPU.....

if the GPU is so powerful,,, then,,, ... wtf are they waiting for???

Ever seen Folding @ Home on ATI GPUs? There ya go. They have released toolkits do to so. The thing with GPUs is that they are very powerful very parallel "CPUs". They are only very good at doing a few specific tasks, most of time its with images and image related things. The F@H GPU client can't crunch all the same Work Units the CPU version gets, because of limitations of what the GPU can do.
05-25-2008, 10:27 PM
oohms

It would be cool to see if wwww could implement CUDA support in wprime :D
(Especially since its already multithreaded)
05-25-2008, 10:27 PM
xlink

I can see intel shaking in their boots.
05-25-2008, 10:36 PM
saaya

lol, what exactly is the point? :lol:

pi became a common bench cause it turned out to indicate gaming and memory performance with only a quick benchmark run. im curious about this, but just running pi on a gpu might be pretty pointless and not have any relation to real world performance. it might scale when real apps dont scale anymore, or stop scaling when real apps still scale, or scale different etc

But a new bench is always exciting, and im looking forward to more cuda stuff!
im actually surprised there is barely any app that uses a gpu to crunch on data even tho ati and nvidia both support this since years... i know its hard to code, but still, the perf you can tab into with such an app is quite big, so it should be worth the effort?
05-25-2008, 11:33 PM
BulldogPO

I´m still waiting for QMC Boinc CUDA client, please somebody make that happen :D
05-26-2008, 01:38 AM
LowRun

Quote:

Originally Posted by SKYMTL

Naming it with the CUDA moniker is the same as advertising IMO.

QFT.
05-26-2008, 03:21 AM
tam2

First f@h, and now this. Nvidia's marketing team works very2 well indeed.
Let's just hope they'll deliver.

-tam2-
05-26-2008, 03:30 AM
GoThr3k

Before everyone thinks GPGPU is the future, i would like to point some things out

1) GPU's are good at calculating parallell stuff
2) GPU's truly truly suck when it comes to branching, because when you branch you stall every other shader...
3) CUDA is slower then when you implement a similar thing in directx, you can program gpu's with directX also instead of using CUDA

And ATI has something like CUDA, called CTM (close to metal), to bad you have to program in assembler with CTM, in CUDA you can program in C & C++
05-26-2008, 04:16 AM
[XC] Lead Head

[QUOTE=GoThr3k;3015331]Before everyone thinks GPGPU is the future, i would like to point some things out

1) GPU's are good at calculating parallell stuff
2) GPU's truly truly suck when it comes to branching, because when you branch you stall every other shader...
3) CUDA is slower then when you implement a similar thing in directx, you can program gpu's with directX also instead of using CUDA

And ATI has something like CUDA, called CTM (close to metal), to bad you have to program in assembler with CTM, in CUDA you can program in C & C++[/QUOTE]

So ATI F@H is programmed in Assembly? Anyways, that keeps the nubs out of programming :p:
05-26-2008, 05:04 AM
Lestat

Quote:

Originally Posted by [XC] Lead Head

Ever seen Folding @ Home on ATI GPUs? There ya go. They have released toolkits do to so. The thing with GPUs is that they are very powerful very parallel "CPUs". They are only very good at doing a few specific tasks, most of time its with images and image related things. The F@H GPU client can't crunch all the same Work Units the CPU version gets, because of limitations of what the GPU can do.

i have seen a few posts on it (the folding client), but to me that doesnt prove alot. in one specific area yes.

and yes i understand the GPU being image/graphic limited my comment was one more of hope than misunderstanding or disbelief.

BUT taking that same knowledge and metallity and making a cpu can't be that life altering that intel/amd can't do it.

i fully understand and appreciate the technology and how different chips are designed to do different things but we do not live, and never have, on a flat plane,,, the world is round,,, to believe that we can't do something will force us to stay in the dark ages of technology.

and yes i have read a few tidbits here and there about tapping into the gpu for extra computing power.
let's just hope they can expand on that in our life time. expand on it in such a way that everyday computing graduates to the next level.
05-26-2008, 05:12 AM
Trouffman

Interesting! Hope this will work fine and can represent something interesting.

They were more and more software using GPGPU to help CPU in heavy load like Adobe Premiere Pro plugin for HD videos, Photoshop CS4 (S Next), Folding @ home etc...

Maybe the next version of Sisoft sandra will have a measurement for such "GPGPU" calcul ?
05-26-2008, 06:47 AM
Cooper

Quote:

And ATI has something like CUDA, called CTM (close to metal), to bad you have to program in assembler with CTM, in CUDA you can program in C & C++

CTM allows both low and high level programming apart from high-level only for CUDA.
05-26-2008, 08:20 AM
audiofreak

One very important thing to note:

You will have to have (at least) two NVIDIA GPUs in your system to run SuperPI or any other CUDA applicaton on a GPU for more than 5 seconds.

Primary display adapter in Windows cannot work under full load for more than 5 seconds without the task being aborted by the operating system because the OS assumes that the driver has got stuck.

Such a limitation does not apply to the secondary adapter. Therefore, you will be able to use only secondary adapter for running any CUDA applications which take more than 5 seconds to complete their workload.

That means you will have to invest in a dual PCI-Express x16 mainboard and into another NVIDIA card, even if it is only 9600GT (or cheaper) for the primary display adapter.

And when you are already investing, why not go SLI? That way NVIDIA sells two cards and a mainboard chipset. Nice way of to boost the sales.

Some of you already have SLI, and some may want to get it because of this "exciting" announcement so a word of warning to you:

1. You won't be able to run CUDA applications with SLI enabled EVER. Each CUDA application must manage multiple GPUs on its own.

2. Multi-GPU CUDA applications require that each GPU thread be associated with a distinct CPU thread. It means that for maximum performance on a Quad GPU setup you would need Quad-Core CPU as well.
05-26-2008, 08:25 AM
Lestat

Quote:

Originally Posted by audiofreak

One very important thing to note:

You will have to have (at least) two NVIDIA GPUs in your system to run SuperPI or any other CUDA applicaton on a GPU for more than 5 seconds.

Primary display adapter in Windows cannot work under full load for more than 5 seconds without the task being aborted by the operating system because the OS assumes that the driver has got stuck.

Such a limitation does not apply to the secondary adapter. Therefore, you will be able to use only secondary adapter for running any CUDA applications which take more than 5 seconds to complete their workload.

That means you will have to invest in a dual PCI-Express x16 mainboard and into another NVIDIA card, even if it is only 9600GT (or cheaper) for the primary display adapter.

And when you are already investing, why not go SLI? That way NVIDIA sells two cards and a mainboard chipset. Nice way of to boost the sales.

Some of you already have SLI, and some may want to get it because of this "exciting" announcement so a word of warning to you:

1. You won't be able to run CUDA applications with SLI enabled EVER. Each CUDA application must manage multiple GPUs on its own.

2. Multi-GPU CUDA applications require that each GPU thread be associated with a distinct CPU thread. It means that for maximum performance on a Quad GPU setup you would need Quad-Core CPU as well.

ewwwwww....

thats not good,,,,
pi doesnt run full core % though maybe we'll be ok.. (atleast it doesnt for CPU)
still an interesting twist

that would be a neat tool since you touched on it, a GPU based task manager... something to show GPU load, processes and memory usage.

someone get on that lol...
05-26-2008, 08:33 AM
jimmyz

Quote:

Originally Posted by Lestat

ewwwwww....

thats not good,,,,
pi doesnt run full core % though maybe we'll be ok.. (atleast it doesnt for CPU)
still an interesting twist

that would be a neat tool since you touched on it, a GPU based task manager... something to show GPU load, processes and memory usage.

someone get on that lol...

Sombody did, Rivatuner charting shows it along with more as does GPU-Z. in the sensor page. (ver .22 is out now)
05-26-2008, 10:12 AM
NapalmV5

superpi on cpu: battle of the seconds

superpi on gpu: battle of the milliseconds

so wholl brake 9ms ?
05-26-2008, 10:34 AM
Neuuubeh

is calculating pi a process thats easily multi-thread'able?

that thing about needing a second GPU kinda killed my interest in the last second :(. Hope they can make a work-around and we see good software :)
05-26-2008, 10:40 AM
GoThr3k

kinda strange you need a second GPU, some of my comrades using CUDA dont need tthat
05-26-2008, 10:40 AM
pxhx

nice.. my poor 8500gt is on list.
05-26-2008, 11:27 AM
bowman

Can't see why you'd need a second GPU - just don't stress it 100%, it looks like.

If you need a second GPU to run any GPGPU application I wonder how the ATI f@h client does it, or how folding on nvidia will work.
05-26-2008, 11:54 AM
Boogerlad

can't you just get a cheapo pci card for your primary display?
05-26-2008, 11:58 AM
JMKS

Quote:

Originally Posted by Neuuubeh

is calculating pi a process thats easily multi-thread'able?

While using "the SuperPi" algorithm [Gauss-Legendre http://en.wikipedia.org/wiki/Gauss-Legendre_algorithm] it is absolutely multi-threadable. Other algorithms probably too.
The reason is simple: there are only about 10 matematical operations each loop - the rest is computing it with custom-written methematical operations using very high precision [eg. adding 30MB to compute a result of simple addition ;)].

I've done it a little while ago, looks like this: [FORTRAN77, oldschool :p:, but original SuperPi was also written in FORTRAN ;)]

Code:

program pi_calc       implicit none integer i real*8 a(1:21),b(1:21),t(1:21),p(1:21) real*8 pi a(1)=1 b(1)=1/sqrt(2.0) t(1)=0.25 p(1)=1 do i=1,20,1   a(i+1)=(a(i)+b(i))/2 b(i+1)=sqrt(a(i)*(b(i))) t(i+1)=t(i)-p(i)*(a(i)-a(i+1))**2 p(i+1)=2*p(i) pi=(a(i)+b(i))**2/(4*t(i)) write(*,*)i,pi enddo stop end

"do ____ enddo" - loop of course, the main algorithm
"**" - power
output:
1 2.91421352106
2 3.14057922577
3 3.14159262165
4 3.14159262902
5 3.14159262902
6 3.14159262902
.............................................

That program without custom-written mathematical formulas only calculates ~15 digits and in fact not all digits are correct, the accuracy is not so great :p:. But with that formulas it is a "classic SuperPi" - 1M with 19 iterations etc :).
As we can see the idea is really simple, only ~10 mathematical operations per loop.
The thing is to implement mathematical operations which can handle veeeery accurate precision [30M of digits or as we like].

So there is:
addition [add bits with carry, simple and multithreadable with some error margin [if carry will happen too many times ;)], division by 2 [simple binary "shift left"] - very fast in fact, not worth optimizing;
square root - I don't know the algorithm, but this is probably the slowest in that program - if it really is slowest hmm... I dunno if this is easy multithreadable
multiplication - probably faster than sqrt but non-comparable slower than addidtion for sure - the simple algorithm is as we all know to multiplicate in row and add - it will be very time-eating [30MB * 30MB means ~10^12 operations, I dunno if it is coded that way, probably there is faster algorithm, because 10^12 cycles in 1 second means a terahertz] - this is absolutely multithreadable [perfect scaling I can say]. IIRC 32-bit CPU can multiply 32-bit * 32-bit in 6 cycles which will be more real, but still somehow slow. No, wait - there must be faster algorithm [or it is not the most time-eating], because with doubling the accuracy we are slowing the calculations a little more than twice. Or maybe there is some trick to increase number of digits with each loop, but I don't think so [every loop takes +- the same amount of time].
division - also not that simple to write with threads ["not that simple" means "I don't know how to do it" :p:]

So the main problems: write a multi-threaded sqrt and division [power is not that demanding as it is only ^2 which in fact means multiplication].

Sorry if that was too long and I bored You :p:, but I personally find it interesting to know what it is all about :) [especially when we are benching it hours and hours, and all we can see are those 19-24 loops, nothing more ;).
Well, we all can now at least understand what means "not convergent in sqr" - we know what sqrt it is and why it should be convergent :D.
05-26-2008, 02:08 PM
audiofreak

Quote:

Originally Posted by Boogerlad

can't you just get a cheapo pci card for your primary display?

Last time I checked it had to be an NVIDIA card or the drivers will not load.

I didn't say you can't run CUDA with just one card -- I said you can't run a CUDA task that lasts longer than 5 seconds on a primary display or it will get aborted by the OS (Windows XP, don't know about Vista but probably the same applies).

As for the formula, best known PI algorithm is Chudnovsky at the moment as used in PiFast43.
05-26-2008, 02:30 PM
thephenom

Interesting way to benchmark a GPU, but what's the point? Does it stand as a benchmark for the GPGPU crowd?
05-26-2008, 02:30 PM
JMKS

Quote:

Originally Posted by audiofreak

As for the formula, best known PI algorithm is Chudnovsky at the moment as used in PiFast43.

http://home.istar.ca/~lyster/chart.html - comparision, with names of algorithms too :)
Yes, I know that Gauss-Legendre isn't fastest but this is "the SuperPi" algorithm, if we will change it, it will not be SuperPi in any meaning, just another program calculating Pi ;).
05-26-2008, 03:21 PM
bobbobson

Quote:

Originally Posted by audiofreak

One very important thing to note:

You will have to have (at least) two NVIDIA GPUs in your system to run SuperPI or any other CUDA applicaton on a GPU for more than 5 seconds.

Primary display adapter in Windows cannot work under full load for more than 5 seconds without the task being aborted by the operating system because the OS assumes that the driver has got stuck.

Such a limitation does not apply to the secondary adapter. Therefore, you will be able to use only secondary adapter for running any CUDA applications which take more than 5 seconds to complete their workload.

That means you will have to invest in a dual PCI-Express x16 mainboard and into another NVIDIA card, even if it is only 9600GT (or cheaper) for the primary display adapter.

And when you are already investing, why not go SLI? That way NVIDIA sells two cards and a mainboard chipset. Nice way of to boost the sales.

Some of you already have SLI, and some may want to get it because of this "exciting" announcement so a word of warning to you:

1. You won't be able to run CUDA applications with SLI enabled EVER. Each CUDA application must manage multiple GPUs on its own.

2. Multi-GPU CUDA applications require that each GPU thread be associated with a distinct CPU thread. It means that for maximum performance on a Quad GPU setup you would need Quad-Core CPU as well.

Not a problem. Seeing as we will be processing 2 bajillion didgits of pi a second. The bench will only last for 4 seconds :P :up:
05-26-2008, 04:25 PM
Lestat

Quote:

Originally Posted by JMKS

While using "the SuperPi" algorithm [Gauss-Legendre http://en.wikipedia.org/wiki/Gauss-Legendre_algorithm] it is absolutely multi-threadable. Other algorithms probably too.
The reason is simple: there are only about 10 matematical operations each loop - the rest is computing it with custom-written methematical operations using very high precision [eg. adding 30MB to compute a result of simple addition ;)].

I've done it a little while ago, looks like this: [FORTRAN77, oldschool :p:, but original SuperPi was also written in FORTRAN ;)]

Code:

program pi_calc :banana::banana::banana::banana::banana::banana:implicit none integer i real*8 a(1:21),b(1:21),t(1:21),p(1:21) real*8 pi a(1)=1 b(1)=1/sqrt(2.0) t(1)=0.25 p(1)=1 do i=1,20,1 :banana::banana:a(i+1)=(a(i)+b(i))/2 b(i+1)=sqrt(a(i)*(b(i))) t(i+1)=t(i)-p(i)*(a(i)-a(i+1))**2 p(i+1)=2*p(i) pi=(a(i)+b(i))**2/(4*t(i)) write(*,*)i,pi enddo stop end

"do ____ enddo" - loop of course, the main algorithm
"**" - power
output:
1 2.91421352106
2 3.14057922577
3 3.14159262165
4 3.14159262902
5 3.14159262902
6 3.14159262902
.............................................

That program without custom-written mathematical formulas only calculates ~15 digits and in fact not all digits are correct, the accuracy is not so great :p:. But with that formulas it is a "classic SuperPi" - 1M with 19 iterations etc :).
As we can see the idea is really simple, only ~10 mathematical operations per loop.
The thing is to implement mathematical operations which can handle veeeery accurate precision [30M of digits or as we like].

So there is:
addition [add bits with carry, simple and multithreadable with some error margin [if carry will happen too many times ;)], division by 2 [simple binary "shift left"] - very fast in fact, not worth optimizing;
square root - I don't know the algorithm, but this is probably the slowest in that program - if it really is slowest hmm... I dunno if this is easy multithreadable
multiplication - probably faster than sqrt but non-comparable slower than addidtion for sure - the simple algorithm is as we all know to multiplicate in row and add - it will be very time-eating [30MB * 30MB means ~10^12 operations, I dunno if it is coded that way, probably there is faster algorithm, because 10^12 cycles in 1 second means a terahertz] - this is absolutely multithreadable [perfect scaling I can say]. IIRC 32-bit CPU can multiply 32-bit * 32-bit in 6 cycles which will be more real, but still somehow slow. No, wait - there must be faster algorithm [or it is not the most time-eating], because with doubling the accuracy we are slowing the calculations a little more than twice. Or maybe there is some trick to increase number of digits with each loop, but I don't think so [every loop takes +- the same amount of time].
division - also not that simple to write with threads ["not that simple" means "I don't know how to do it" :p:]

So the main problems: write a multi-threaded sqrt and division [power is not that demanding as it is only ^2 which in fact means multiplication].

Sorry if that was too long and I bored You :p:, but I personally find it interesting to know what it is all about :) [especially when we are benching it hours and hours, and all we can see are those 19-24 loops, nothing more ;).
Well, we all can now at least understand what means "not convergent in sqr" - we know what sqrt it is and why it should be convergent :D.

i have no idea what you just said dude but i just got an ice cream headache from trying to understand that

Quote:

Originally Posted by audiofreak

Last time I checked it had to be an NVIDIA card or the drivers will not load.

I didn't say you can't run CUDA with just one card -- I said you can't run a CUDA task that lasts longer than 5 seconds on a primary display or it will get aborted by the OS (Windows XP, don't know about Vista but probably the same applies).

As for the formula, best known PI algorithm is Chudnovsky at the moment as used in PiFast43.

hell vista does the "the display driver has stopped and has been restarted" all on its own,, you dont need CUDA to make Vista screw up your display adapter!!
05-26-2008, 06:47 PM
moloko

Looking forward to the CUDA port, Charles.
05-26-2008, 07:22 PM
[XC] leviathan18

fugger nice to see the port of super pi to cuda

but is there any possibility that maybe you can talk with some guys at nvidia or the guys working with cuda so we can have this same kind of port to wcg i think the team will benefit a lot from this to reach 2nd place or even shot for first place, we have gpu power unused in the team but also buying gpu is cheaper than psu mobo ram and cpu
05-26-2008, 08:19 PM
Yeknom

Looks like this'll be nice!
05-26-2008, 08:25 PM
dinos22

Quote:

Originally Posted by [XC] leviathan18

fugger nice to see the port of super pi to cuda

but is there any possibility that maybe you can talk with some guys at nvidia or the guys working with cuda so we can have this same kind of port to wcg i think the team will benefit a lot from this to reach 2nd place or even shot for first place, we have gpu power unused in the team but also buying gpu is cheaper than psu mobo ram and cpu

pretty sure we did tell them about wcg but they are saying that community should be encouraged to port it to CUDA

maybe some brilliant mind here or other forums can figure it out so that these figures can skyrocket

otherwise F@H is also something you guys can get into more hehehe
05-26-2008, 08:51 PM
[XC] leviathan18

Quote:

Originally Posted by dinos22

pretty sure we did tell them about wcg but they are saying that community should be encouraged to port it to CUDA

maybe some brilliant mind here or other forums can figure it out so that these figures can skyrocket

otherwise F@H is also something you guys can get into more hehehe

no doubt we can help the f@h team in the meanwhile but my team is wcg so i want this app running in cuda maybe the old fart sorry ehm movieman can move his old hands and get some people trying to see if we can have some ports from wcg to cuda

i really want to see the wcg team in the very top in no time we are doing or max and its been a long year crunching already in the top 3 but we need more steam to really get up there
05-26-2008, 09:10 PM
[XC] riptide

^^^ Hey Lev. It would help if you actually got crunching again. ;)
05-26-2008, 09:28 PM
[XC] leviathan18

Quote:

Originally Posted by [XC] riptide

^^^ Hey Lev. It would help if you actually got crunching again. ;)

i wont show you the burn the laptop left in my leg for crunching and using it in my lap at the same time my quad is down no ram probably as my birthday is this friday i will get some cash and will buy the ram mouse and lcd to get it running again.

really sorry to be down like this but there is no other option for me right now :(
05-26-2008, 11:24 PM
W1zzard

who's coding that? how far are you into development? have you figured out how to perform calculations of higher accuracy on gpus? which algorithm will you choose? have you looked into how to parallelize it and how to avoid branching?
05-26-2008, 11:49 PM
audiofreak

Quote:

Originally Posted by JMKS

http://home.istar.ca/~lyster/chart.html - comparision, with names of algorithms too :)
Yes, I know that Gauss-Legendre isn't fastest but this is "the SuperPi" algorithm, if we will change it, it will not be SuperPi in any meaning, just another program calculating Pi ;).

If you make it multi-threaded you are already changing it so it is not SuperPI anymore.

Not to mention executing it on a GPU defeats the purpose of SuperPI being a CPU benchmark. :p:

For those who are writing it, there are few points about IEE-754 non-compliance to consider, taken from the latest CUDA 2.0 manual:

- Division is implemented via the reciprocal in a non-standard-compliant way

- Square root is implemented via the reciprocal square root in a non-standard-compliant way

- For addition and multiplication, only round-to-nearest-even and
round-towards-zero are supported via static rounding modes; directed rounding towards +/- infinity is not supported

- The conversion of a floating-point value to an integer value in the case where the floating-point value falls outside the range of the integer format is left undefined by IEEE-754. For compute devices, the behavior is to clamp to the end of the supported range. This is unlike the x86 architecture behaves.

Those are the limitations of the GPU hardware, not the CUDA language itself because computer graphics doesn't need full IEE-754 compliance anyway.

Quote:

Originally Posted by bobbobson

Not a problem. Seeing as we will be processing 2 bajillion didgits of pi a second. The bench will only last for 4 seconds :P :up:

Unless you use it for stress-testing :p:

Quote:

Originally Posted by [XC] leviathan18

buying gpu is cheaper than psu mobo ram and cpu

Yeah, especially if you have to buy PSU, mobo, RAM, CPU and a case to match the GPU :D
05-27-2008, 12:15 AM
Shintai

Quote:

Originally Posted by Lestat

hell vista does the "the display driver has stopped and has been restarted" all on its own,, you dont need CUDA to make Vista screw up your display adapter!!

Blaming Vista for the GFX makers ultra poor quality drivers? In XP the same event would have even you a BSOD.
05-27-2008, 12:43 AM
Harshal

This is nice.. NVIDIA is kicking it nice.
Looking forward to the port :)
05-27-2008, 01:27 AM
Jimmer411

Outside of the novelty of this, it is kinda pointless. We already got plenty of ways to gauge the performance of video cards, and thats in both camps.

Im pretty sure an ATI only benchmark would be flamed to hell at this point.

On a side note I am curious to see how this turns out, not so much for superpi but for the door of other possibilities that this may open up in the future. We have all seen how a CPU handles GPU calculations and now it will be interesting to see how a GPU handles CPU stuff. I would love to have the ability to offload various tasks to the videocard while not gaming. encoding x264 for instance...
05-27-2008, 02:57 AM
BenchZowner

Guys, we're just grabbing a opportunity to run a pure number crunching benchmark on the GPU.
It's not like you will compete with CPUs in SuperPi Mod v1.5 XS to GPUs in the same bench.
And it's for benchmarking purposes, so saying that you already have a numerous ways to bench graphics cards ain't a valid point.
05-27-2008, 05:09 AM
Luka_Aveiro

Quote:

Originally Posted by Jimmer411

On a side note I am curious to see how this turns out, not so much for superpi but for the door of other possibilities that this may open up in the future. We have all seen how a CPU handles GPU calculations and now it will be interesting to see how a GPU handles CPU stuff. I would love to have the ability to offload various tasks to the videocard while not gaming. encoding x264 for instance...

I quite believe thats the reason why this thread hasn't begun a flame-war ;)
05-27-2008, 05:17 AM
perkam

CudaPi ? :eek:

Perkam
05-27-2008, 05:32 AM
BenchZowner

Quote:

Originally Posted by perkam

CudaPi ? :eek:

Perkam

SuperCudaPi :p:
05-27-2008, 05:39 AM
saaya

Quote:

Originally Posted by GoThr3k

And ATI has something like CUDA, called CTM (close to metal), to bad you have to program in assembler with CTM, in CUDA you can program in C & C++

i thought that too, but its not like that... you cant just run c and c++ code on cuda. its similar to c and c++ afaik, but you still need to basically program pretty much in machine langauge for cuda.

cuda and ctm are very similar afaik, both are way too complex and way NOT user friendly, which i think is the main reason why we dont see any gpgpu based apps so far.

the directx gpgpu approach seems to be very promising, a united api that drivers are already optimized to work with and that is already somewhat familiar to developers makes more sense than cuda or ctm imo. Plus its going to be ONE api, maintained by ONE party, which means coders dont have to decide for either ati or nvidia, and actually ANY gpu can be used for this, even sis IGP or whatnot :D
05-27-2008, 05:40 AM
Neuuubeh

Quote:

Originally Posted by JMKS

...

thanks for taking the time to type all that stuff :clap:. Was interesting to read :up:
05-27-2008, 05:42 AM
W1zzard

Quote:

Originally Posted by saaya

the directx gpgpu approach seems to be very promising, a united api that drivers are already optimized to work with and that is already somewhat familiar to developers makes more sense than cuda or ctm imo.

yep, even opengl gpgpu is quite easy to do. but ctm/cuda, especially ctm give you much more options to improve performance and flexibility
05-27-2008, 08:39 AM
audiofreak

Quote:

Originally Posted by BenchZowner

Guys, we're just grabbing a opportunity to run a pure number crunching benchmark on the GPU.

Fine, just remember that such a benchmark won't tell you how good your video card will be for running Crysis.

In my opinion, benching just the number crunching part of a GPU is equally insane as benching just the FPU in the CPU.

Simply put CPU has other units which may improve its performance for other uses (for example SSE as used in DivX encoder), and GPU has other units which may limit its performance for other uses (number of ROPs, TMUs, etc, which determine actual game performance).

What I am trying to say is that I am not sure we really need yet another benchmark with relative instead of absolute performance numbers.

What we also don't need is the CUDA mp3 encoder (and Linux only mind you) but NVIDIA still organized a contest.

mp3 encoding is ridiculously fast on a CPU already, disk I/O is the bottleneck -- it would be much better if they organized a contest for x264 encoder on a GPU or even better wrote one themselves and open-sourced it.
05-27-2008, 08:49 AM
Luka_Aveiro

Quote:

Originally Posted by audiofreak

it would be much better if they organized a contest for x264 encoder on a GPU or even better wrote one themselves and open-sourced it.

Don't know about the open source part, but x264 encoder by cuda seems to be on.

http://www.youtube.com/watch?v=8C_Pj1Ep4nw
05-27-2008, 08:51 AM
BenchZowner

Quote:

Originally Posted by audiofreak

Fine, just remember that such a benchmark won't tell you how good your video card will be for running Crysis.

That's not my purpose.
If I wanted to test a graphics card's gaming performance, I know how to run and what to run.

Notice the word BENCHING.

Quote:

Originally Posted by audiofreak

In my opinion, benching just the number crunching part of a GPU is equally insane as benching just the FPU in the CPU.

Once again, BENCHING.
We're talking about programs that we ( overclockers ) use to measure the performance of our overclocked systems in specific apps/things.

Quote:

Originally Posted by audiofreak

What I am trying to say is that I am not sure we really need yet another benchmark with relative instead of absolute performance numbers.

I repeat :p:
Benching.
We ( overclockers ) want more and we like it.

-- We need applications to take advantage of our GPUs for normal usage, but this is NOT the thread to talk about CUDA & "real-life" usage.
05-27-2008, 01:35 PM
MrHydes

how much can CUDA impact on general performance?

what is the biggest advantage? how will this influence market and tech

evolution in a close future?

cheers and thanks
05-27-2008, 02:08 PM
conzymaher

Well you can see in the tech demo posted above ^

GPUs are animals at video encoding etc
05-27-2008, 02:54 PM
W1zzard

gpus are only good at workloads that can be parallelized (hundreds of parallel threads). if there is a sequential execution flow, gpus can't show their performance and will be slower than cpus

quick example .. imagine a large excel sheet with a number of rows (the money you spent on drinking, partying and getting laid) that you want to sum up.

one way is to go through the rows one by one and add each row to the previous result -> sequential, like it would run on any CPU today, this will take you N steps for N rows (actually N-1 but lets keep it simple).

on a GPU you could parallelize this and launch a large number of threads that add up groups of two rows each first, like (1+2=a, 3+4=b, 5+6=c..), all of those additions are done at the same time in parallel on the gpu in a single step. once that is done you sum up a+b=a1, c+d=a2 etc... repeat until you have only two numbers left to add together and you get the final result. in total this will take you log2(N) steps. (e.g. for 256 rows -> log2(256) = 8 steps only)

for a small number of rows there won't be much difference and the higher clock speed of the CPU will still outweigh small gains. but once you increase the number of rows you can clearly see what a huge difference this make. (yes this is simplified, you do not have an infinite amount of execution units)

however, note how much more complex the second example is. most programmers today have coded for their whole life like example 1. now they are supposed to switch to example 2...
05-27-2008, 03:38 PM
Luka_Aveiro

Quote:

Originally Posted by W1zzard

however, note how much more complex the second example is. most programmers today have coded for their whole life like example 1. now they are seduced to switch to example 2...

Fixed :D

Loved the post, W1zzard really enlighten. :)
05-27-2008, 03:54 PM
Seraphiel

The benchmark side of this is not something that really impresses me (just another benchmark). However, the potential of the calculation speed difference, is something that does impress me. Can CUDA be used to factor numbers, with the promise of faster performance than a CPU?
05-27-2008, 04:04 PM
Charles Wirth

Michael, though I am not up to speed on tweaks needed to compile correctly there are two people working on getting me the assistance to get it done.

As to the name, SuperPi 1.6 CUDA GPU

Is there way to make a universal binary for both manufactures? Larrabe should be a GPGPU as well.
05-28-2008, 01:01 AM
W1zzard

Quote:

Originally Posted by FUGGER

there are two people working on getting me the assistance to get it done.

if those are somehow affiliated with nvidia maybe call it "techdemo"?
05-28-2008, 01:32 AM
[XC] riptide

If anyone is more interested on the general use of GPU's you could follow up on the feeds here. http://www.gpgpu.org/
05-28-2008, 08:18 AM
saaya

Quote:

Originally Posted by W1zzard

yep, even opengl gpgpu is quite easy to do. but ctm/cuda, especially ctm give you much more options to improve performance and flexibility

hmmm how much of a boost at what expense though? coding for ctm and cuda is a lot more complex then coding for gpgpu directx or ogl, right?

Quote:

Originally Posted by W1zzard

magine a large excel sheet with a number of rows (the money you spent on drinking, partying and getting laid) that you want to sum up.

you keep a record of all that? :D heheheh

thanks for the example, very interesting :toast:
so basically anything that is coded to run on a server cluster should work well on a gpu, right? so every application that has to do with audio/video processing, filtering, compressing etc, should work well on gpus then, right? im curious when we will see a gpu divx codec
05-28-2008, 08:36 AM
EnJoY

Quote:

Originally Posted by saaya

you keep a record of all that? :D heheheh

Don't we all? :shrug: :rolleyes:
05-28-2008, 08:45 AM
Planet

Quote:

Originally Posted by EnJoY

Don't we all? :shrug: :rolleyes:

Its call previous orders from newegg lol. To bad thats only a quarter of all the hardware ive spent money on.
05-28-2008, 09:00 AM
Charles Wirth

My guys are not with Nvidia (doing the work), but I do have developer assistance from Nvidia.

Sascha one could assume but the gains are random but they are usually exponential of 8x beyond 100x for the examples given.
05-28-2008, 10:16 AM
JMKS

Quote:

Originally Posted by saaya

so basically anything that is coded to run on a server cluster should work well on a gpu, right?

First requirement is - of course - multithreading; and yes - MULTI, not 2 or 4 as we are happy when playing with CPUs.
Second requirement - recode program to be doable on GPU and don't lose that ~30x "possible" performance boost while recoding :p:.
I'm far from being accurate or anything ;), that's just a simple explanation as I see it.
05-28-2008, 11:28 AM
Yakyb

Quote:

Originally Posted by audiofreak

[b]

What we also don't need is the CUDA mp3 encoder (and Linux only mind you) but NVIDIA still organized a contest.

mp3 encoding is ridiculously fast on a CPU already, disk I/O is the bottleneck -- it would be much better if they organized a contest for x264 encoder on a GPU or even better wrote one themselves and open-sourced it.

i cant understand that statement how can you say we dont need this?

i would love a program i could throw all my mp3s at and have it recode, normalise esspecially if i didn't have to dedicate a pc for a day to do it.

as much as they can do to possibly improve PC usage is great (there is already a cuda based h.264 encoder)

this is the most exciting thing to come on to xtreme news in a long time have fun Fugger (and yes open source would be great!!)
05-29-2008, 08:28 AM
saaya

well hes right, we dont need this, but its still interesting :)
if it turns out to be a pointless benchmark that doesnt really scale realisitcally, then it will most likely be forgotten pretty soon...
but not necessarily... it might be fun :D things dont need to make sense to be fun :D
05-29-2008, 09:28 AM
MuffinFlavored

The problem I have found is that the algorithm SuperPi uses (Gauss-Legendre) can not be very well mutlithreaded.
The way it works is, every time an "iteration" is performed, more and more numbers are returned. So, each result is dependent on the previous result.

There is probably an algorithm out there that is very good for mutlithreading.

I think something like wPrime being ported to GPGPU code would be good, because the workload can be distributed to many threads.
If you want to calculate 100 prime numbers, have each thread calculate 1 prime number. (assuming there are 100 threads)
If you want to do 1000, have each thread calculate 10 prime numbers. (assuming there are 100 threads)
The work load is able to be distributed.

This might just be a tech demo for nVIDIA.
05-29-2008, 09:41 AM
initialised

Nice one, shame I haven't got my G92 any more, any chance that this time round it can have a continuous mode for stress testing.
05-29-2008, 10:53 AM
RaZz!

Quote:

Originally Posted by GoThr3k

[...]

And ATI has something like CUDA, called CTM (close to metal), to bad you have to program in assembler with CTM, in CUDA you can program in C & C++

but then, ati did something wrong as i never heard anything of CTM. i know that there's a folding@home client for ati gpus, but ati never caused sensation with this.
and now nvidia teases customers with marketing regarding their CUDA environment.

seems like ati somehow missed the train to advertise their feature properly?
05-31-2008, 12:12 AM
rozzyroz

will there be a way to test an individual core on a gpu?

on a side note, it would be nice if there was a common compiler for all gpu's (ati's, nvidia's and intels up and comming one), but that would take these guys working together... :rofl:
05-31-2008, 03:59 PM
initialised

If there is a point to this it is then it is to let the GPU do the maths that the GPU does best (massively parallel, DSP, Video, Audio, CAD etc). Offload whatever can be off loaded to the GPU while letting the CPU do what it does best.
05-31-2008, 04:45 PM
MuffinFlavored

Quote:

Originally Posted by RaZz!

but then, ati did something wrong as i never heard anything of CTM. i know that there's a folding@home client for ati gpus, but ati never caused sensation with this.
and now nvidia teases customers with marketing regarding their CUDA environment.

seems like ati somehow missed the train to advertise their feature properly?

This is true.
I just got an e-mail already from eVGA saying that you should join there Folding@Home team. :)
I base this on absolutely nothing, but didn't ATI "overly" advertise R600?
05-31-2008, 05:14 PM
Lestat

Quote:

I just got an e-mail already from eVGA, Folding@Home on nVIDIA.

and what did that email say?

Show 100 post(s) from this thread on one page

All times are GMT -8. The time now is 09:53 AM.

XtremeSystems