MMM
Page 3 of 10 FirstFirst 123456 ... LastLast
Results 51 to 75 of 226

Thread: SuperPi on GPU, were going CUDA

  1. #51
    Xtreme Member
    Join Date
    Aug 2006
    Location
    Warsaw, Poland
    Posts
    148
    Quote Originally Posted by Neuuubeh View Post
    is calculating pi a process thats easily multi-thread'able?
    While using "the SuperPi" algorithm [Gauss-Legendre http://en.wikipedia.org/wiki/Gauss-Legendre_algorithm] it is absolutely multi-threadable. Other algorithms probably too.
    The reason is simple: there are only about 10 matematical operations each loop - the rest is computing it with custom-written methematical operations using very high precision [eg. adding 30MB to compute a result of simple addition ].

    I've done it a little while ago, looks like this: [FORTRAN77, oldschool , but original SuperPi was also written in FORTRAN ]

    Code:
    program pi_calc
          implicit none
          integer i
          real*8 a(1:21),b(1:21),t(1:21),p(1:21)
          real*8 pi      
          a(1)=1
          b(1)=1/sqrt(2.0)
          t(1)=0.25
          p(1)=1
    
          do i=1,20,1
            a(i+1)=(a(i)+b(i))/2
            b(i+1)=sqrt(a(i)*(b(i)))
            t(i+1)=t(i)-p(i)*(a(i)-a(i+1))**2
            p(i+1)=2*p(i)
            pi=(a(i)+b(i))**2/(4*t(i))
            write(*,*)i,pi
          enddo
    
          stop
          end
    "do ____ enddo" - loop of course, the main algorithm
    "**" - power
    output:
    1 2.91421352106
    2 3.14057922577
    3 3.14159262165
    4 3.14159262902
    5 3.14159262902
    6 3.14159262902
    .............................................

    That program without custom-written mathematical formulas only calculates ~15 digits and in fact not all digits are correct, the accuracy is not so great . But with that formulas it is a "classic SuperPi" - 1M with 19 iterations etc .
    As we can see the idea is really simple, only ~10 mathematical operations per loop.
    The thing is to implement mathematical operations which can handle veeeery accurate precision [30M of digits or as we like].

    So there is:
    addition [add bits with carry, simple and multithreadable with some error margin [if carry will happen too many times ], division by 2 [simple binary "shift left"] - very fast in fact, not worth optimizing;
    square root - I don't know the algorithm, but this is probably the slowest in that program - if it really is slowest hmm... I dunno if this is easy multithreadable
    multiplication - probably faster than sqrt but non-comparable slower than addidtion for sure - the simple algorithm is as we all know to multiplicate in row and add - it will be very time-eating [30MB * 30MB means ~10^12 operations, I dunno if it is coded that way, probably there is faster algorithm, because 10^12 cycles in 1 second means a terahertz] - this is absolutely multithreadable [perfect scaling I can say]. IIRC 32-bit CPU can multiply 32-bit * 32-bit in 6 cycles which will be more real, but still somehow slow. No, wait - there must be faster algorithm [or it is not the most time-eating], because with doubling the accuracy we are slowing the calculations a little more than twice. Or maybe there is some trick to increase number of digits with each loop, but I don't think so [every loop takes +- the same amount of time].
    division - also not that simple to write with threads ["not that simple" means "I don't know how to do it" ]

    So the main problems: write a multi-threaded sqrt and division [power is not that demanding as it is only ^2 which in fact means multiplication].

    Sorry if that was too long and I bored You , but I personally find it interesting to know what it is all about [especially when we are benching it hours and hours, and all we can see are those 19-24 loops, nothing more .
    Well, we all can now at least understand what means "not convergent in sqr" - we know what sqrt it is and why it should be convergent .
    Last edited by JMKS; 05-26-2008 at 12:03 PM. Reason: indentations, [code]

  2. #52
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Belgrade, Serbia
    Posts
    187
    Quote Originally Posted by Boogerlad View Post
    can't you just get a cheapo pci card for your primary display?
    Last time I checked it had to be an NVIDIA card or the drivers will not load.

    I didn't say you can't run CUDA with just one card -- I said you can't run a CUDA task that lasts longer than 5 seconds on a primary display or it will get aborted by the OS (Windows XP, don't know about Vista but probably the same applies).

    As for the formula, best known PI algorithm is Chudnovsky at the moment as used in PiFast43.
    Last edited by audiofreak; 05-26-2008 at 02:21 PM.

  3. #53
    Xtreme Enthusiast
    Join Date
    Apr 2004
    Posts
    703
    Interesting way to benchmark a GPU, but what's the point? Does it stand as a benchmark for the GPGPU crowd?
    A wiseman once said, "If Bible proves the existence of God, then comic books prove the existence of Superheros."

  4. #54
    Xtreme Member
    Join Date
    Aug 2006
    Location
    Warsaw, Poland
    Posts
    148
    Quote Originally Posted by audiofreak View Post
    As for the formula, best known PI algorithm is Chudnovsky at the moment as used in PiFast43.
    http://home.istar.ca/~lyster/chart.html - comparision, with names of algorithms too
    Yes, I know that Gauss-Legendre isn't fastest but this is "the SuperPi" algorithm, if we will change it, it will not be SuperPi in any meaning, just another program calculating Pi .

  5. #55
    Xtreme Member
    Join Date
    Jul 2005
    Location
    Melbourne - Australia
    Posts
    115
    Quote Originally Posted by audiofreak View Post
    One very important thing to note:

    You will have to have (at least) two NVIDIA GPUs in your system to run SuperPI or any other CUDA applicaton on a GPU for more than 5 seconds.

    Primary display adapter in Windows cannot work under full load for more than 5 seconds without the task being aborted by the operating system because the OS assumes that the driver has got stuck.

    Such a limitation does not apply to the secondary adapter. Therefore, you will be able to use only secondary adapter for running any CUDA applications which take more than 5 seconds to complete their workload.

    That means you will have to invest in a dual PCI-Express x16 mainboard and into another NVIDIA card, even if it is only 9600GT (or cheaper) for the primary display adapter.

    And when you are already investing, why not go SLI? That way NVIDIA sells two cards and a mainboard chipset. Nice way of to boost the sales.

    Some of you already have SLI, and some may want to get it because of this "exciting" announcement so a word of warning to you:

    1. You won't be able to run CUDA applications with SLI enabled EVER. Each CUDA application must manage multiple GPUs on its own.

    2. Multi-GPU CUDA applications require that each GPU thread be associated with a distinct CPU thread. It means that for maximum performance on a Quad GPU setup you would need Quad-Core CPU as well.
    Not a problem. Seeing as we will be processing 2 bajillion didgits of pi a second. The bench will only last for 4 seconds :P
    e6420 @ 500x7 - 1.48v
    Asus P5B DLX
    2x1024mb GSkill HZ's @ 500 4-4-4-12 @ 2.25v
    Asus 320mb 8800gts @ 675/985
    ThermalTake Armor Black
    OCZ 850w

  6. #56
    I am Xtreme
    Join Date
    Feb 2005
    Location
    SiliCORN Valley
    Posts
    5,543
    Quote Originally Posted by JMKS View Post
    While using "the SuperPi" algorithm [Gauss-Legendre http://en.wikipedia.org/wiki/Gauss-Legendre_algorithm] it is absolutely multi-threadable. Other algorithms probably too.
    The reason is simple: there are only about 10 matematical operations each loop - the rest is computing it with custom-written methematical operations using very high precision [eg. adding 30MB to compute a result of simple addition ].

    I've done it a little while ago, looks like this: [FORTRAN77, oldschool , but original SuperPi was also written in FORTRAN ]

    Code:
    program pi_calc
    :banana::banana::banana::banana::banana::banana:implicit none
          integer i
          real*8 a(1:21),b(1:21),t(1:21),p(1:21)
          real*8 pi      
          a(1)=1
          b(1)=1/sqrt(2.0)
          t(1)=0.25
          p(1)=1
    
          do i=1,20,1
          :banana::banana:a(i+1)=(a(i)+b(i))/2
            b(i+1)=sqrt(a(i)*(b(i)))
            t(i+1)=t(i)-p(i)*(a(i)-a(i+1))**2
            p(i+1)=2*p(i)
            pi=(a(i)+b(i))**2/(4*t(i))
            write(*,*)i,pi
          enddo
    
          stop
          end
    "do ____ enddo" - loop of course, the main algorithm
    "**" - power
    output:
    1 2.91421352106
    2 3.14057922577
    3 3.14159262165
    4 3.14159262902
    5 3.14159262902
    6 3.14159262902
    .............................................

    That program without custom-written mathematical formulas only calculates ~15 digits and in fact not all digits are correct, the accuracy is not so great . But with that formulas it is a "classic SuperPi" - 1M with 19 iterations etc .
    As we can see the idea is really simple, only ~10 mathematical operations per loop.
    The thing is to implement mathematical operations which can handle veeeery accurate precision [30M of digits or as we like].

    So there is:
    addition [add bits with carry, simple and multithreadable with some error margin [if carry will happen too many times ], division by 2 [simple binary "shift left"] - very fast in fact, not worth optimizing;
    square root - I don't know the algorithm, but this is probably the slowest in that program - if it really is slowest hmm... I dunno if this is easy multithreadable
    multiplication - probably faster than sqrt but non-comparable slower than addidtion for sure - the simple algorithm is as we all know to multiplicate in row and add - it will be very time-eating [30MB * 30MB means ~10^12 operations, I dunno if it is coded that way, probably there is faster algorithm, because 10^12 cycles in 1 second means a terahertz] - this is absolutely multithreadable [perfect scaling I can say]. IIRC 32-bit CPU can multiply 32-bit * 32-bit in 6 cycles which will be more real, but still somehow slow. No, wait - there must be faster algorithm [or it is not the most time-eating], because with doubling the accuracy we are slowing the calculations a little more than twice. Or maybe there is some trick to increase number of digits with each loop, but I don't think so [every loop takes +- the same amount of time].
    division - also not that simple to write with threads ["not that simple" means "I don't know how to do it" ]

    So the main problems: write a multi-threaded sqrt and division [power is not that demanding as it is only ^2 which in fact means multiplication].

    Sorry if that was too long and I bored You , but I personally find it interesting to know what it is all about [especially when we are benching it hours and hours, and all we can see are those 19-24 loops, nothing more .
    Well, we all can now at least understand what means "not convergent in sqr" - we know what sqrt it is and why it should be convergent .

    i have no idea what you just said dude but i just got an ice cream headache from trying to understand that



    Quote Originally Posted by audiofreak View Post
    Last time I checked it had to be an NVIDIA card or the drivers will not load.

    I didn't say you can't run CUDA with just one card -- I said you can't run a CUDA task that lasts longer than 5 seconds on a primary display or it will get aborted by the OS (Windows XP, don't know about Vista but probably the same applies).

    As for the formula, best known PI algorithm is Chudnovsky at the moment as used in PiFast43.
    hell vista does the "the display driver has stopped and has been restarted" all on its own,, you dont need CUDA to make Vista screw up your display adapter!!
    Last edited by Lestat; 05-26-2008 at 04:27 PM.
    "These are the rules. Everybody fights, nobody quits. If you don't do your job I'll kill you myself.
    Welcome to the Roughnecks"

    "Anytime you think I'm being too rough, anytime you think I'm being too tough, anytime you miss-your-mommy, QUIT!
    You sign your 1248, you get your gear, and you take a stroll down washout lane. Do you get me?"

    Heat Ebay Feedback

  7. #57
    Registered User
    Join Date
    Aug 2006
    Location
    Australia
    Posts
    77
    Looking forward to the CUDA port, Charles.
    Team.AU
    Drunken benching well below zero!

  8. #58
    Xtreme Cruncher
    Join Date
    Mar 2005
    Location
    venezuela caracas
    Posts
    6,460
    fugger nice to see the port of super pi to cuda

    but is there any possibility that maybe you can talk with some guys at nvidia or the guys working with cuda so we can have this same kind of port to wcg i think the team will benefit a lot from this to reach 2nd place or even shot for first place, we have gpu power unused in the team but also buying gpu is cheaper than psu mobo ram and cpu
    Incoming new computer after 5 long years

    YOU want to FIGHT CANCER OR AIDS join us at WCG and help to have a better FUTURE

  9. #59
    Xtreme Member
    Join Date
    Jan 2008
    Posts
    284
    Looks like this'll be nice!

  10. #60
    Xtreme Legend
    Join Date
    Mar 2005
    Location
    Australia
    Posts
    17,242
    Quote Originally Posted by [XC] leviathan18 View Post
    fugger nice to see the port of super pi to cuda

    but is there any possibility that maybe you can talk with some guys at nvidia or the guys working with cuda so we can have this same kind of port to wcg i think the team will benefit a lot from this to reach 2nd place or even shot for first place, we have gpu power unused in the team but also buying gpu is cheaper than psu mobo ram and cpu
    pretty sure we did tell them about wcg but they are saying that community should be encouraged to port it to CUDA

    maybe some brilliant mind here or other forums can figure it out so that these figures can skyrocket

    otherwise F@H is also something you guys can get into more hehehe
    Team.AU
    Got tube?
    GIGABYTE Australia
    Need a GIGABYTE bios or support?



  11. #61
    Xtreme Cruncher
    Join Date
    Mar 2005
    Location
    venezuela caracas
    Posts
    6,460
    Quote Originally Posted by dinos22 View Post
    pretty sure we did tell them about wcg but they are saying that community should be encouraged to port it to CUDA

    maybe some brilliant mind here or other forums can figure it out so that these figures can skyrocket

    otherwise F@H is also something you guys can get into more hehehe
    no doubt we can help the f@h team in the meanwhile but my team is wcg so i want this app running in cuda maybe the old fart sorry ehm movieman can move his old hands and get some people trying to see if we can have some ports from wcg to cuda

    i really want to see the wcg team in the very top in no time we are doing or max and its been a long year crunching already in the top 3 but we need more steam to really get up there
    Incoming new computer after 5 long years

    YOU want to FIGHT CANCER OR AIDS join us at WCG and help to have a better FUTURE

  12. #62
    Moderator
    Join Date
    Mar 2006
    Posts
    8,556
    ^^^ Hey Lev. It would help if you actually got crunching again.

  13. #63
    Xtreme Cruncher
    Join Date
    Mar 2005
    Location
    venezuela caracas
    Posts
    6,460
    Quote Originally Posted by [XC] riptide View Post
    ^^^ Hey Lev. It would help if you actually got crunching again.
    i wont show you the burn the laptop left in my leg for crunching and using it in my lap at the same time my quad is down no ram probably as my birthday is this friday i will get some cash and will buy the ram mouse and lcd to get it running again.

    really sorry to be down like this but there is no other option for me right now
    Incoming new computer after 5 long years

    YOU want to FIGHT CANCER OR AIDS join us at WCG and help to have a better FUTURE

  14. #64
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    who's coding that? how far are you into development? have you figured out how to perform calculations of higher accuracy on gpus? which algorithm will you choose? have you looked into how to parallelize it and how to avoid branching?

  15. #65
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Belgrade, Serbia
    Posts
    187

    Cool

    Quote Originally Posted by JMKS View Post
    http://home.istar.ca/~lyster/chart.html - comparision, with names of algorithms too
    Yes, I know that Gauss-Legendre isn't fastest but this is "the SuperPi" algorithm, if we will change it, it will not be SuperPi in any meaning, just another program calculating Pi .
    If you make it multi-threaded you are already changing it so it is not SuperPI anymore.

    Not to mention executing it on a GPU defeats the purpose of SuperPI being a CPU benchmark.

    For those who are writing it, there are few points about IEE-754 non-compliance to consider, taken from the latest CUDA 2.0 manual:

    - Division is implemented via the reciprocal in a non-standard-compliant way

    - Square root is implemented via the reciprocal square root in a non-standard-compliant way

    - For addition and multiplication, only round-to-nearest-even and
    round-towards-zero are supported via static rounding modes; directed rounding towards +/- infinity is not supported

    - The conversion of a floating-point value to an integer value in the case where the floating-point value falls outside the range of the integer format is left undefined by IEEE-754. For compute devices, the behavior is to clamp to the end of the supported range. This is unlike the x86 architecture behaves.

    Those are the limitations of the GPU hardware, not the CUDA language itself because computer graphics doesn't need full IEE-754 compliance anyway.

    Quote Originally Posted by bobbobson View Post
    Not a problem. Seeing as we will be processing 2 bajillion didgits of pi a second. The bench will only last for 4 seconds :P
    Unless you use it for stress-testing

    Quote Originally Posted by [XC] leviathan18 View Post
    buying gpu is cheaper than psu mobo ram and cpu
    Yeah, especially if you have to buy PSU, mobo, RAM, CPU and a case to match the GPU

  16. #66
    Xtreme Cruncher
    Join Date
    Aug 2006
    Location
    Denmark
    Posts
    7,747
    Quote Originally Posted by Lestat View Post
    hell vista does the "the display driver has stopped and has been restarted" all on its own,, you dont need CUDA to make Vista screw up your display adapter!!
    Blaming Vista for the GFX makers ultra poor quality drivers? In XP the same event would have even you a BSOD.
    Crunching for Comrades and the Common good of the People.

  17. #67
    Registered User
    Join Date
    Mar 2005
    Location
    Mumbai, India
    Posts
    1,090
    This is nice.. NVIDIA is kicking it nice.
    Looking forward to the port

  18. #68
    Xtreme Addict
    Join Date
    Jul 2006
    Location
    Washington State
    Posts
    1,315
    Outside of the novelty of this, it is kinda pointless. We already got plenty of ways to gauge the performance of video cards, and thats in both camps.


    Im pretty sure an ATI only benchmark would be flamed to hell at this point.




    On a side note I am curious to see how this turns out, not so much for superpi but for the door of other possibilities that this may open up in the future. We have all seen how a CPU handles GPU calculations and now it will be interesting to see how a GPU handles CPU stuff. I would love to have the ability to offload various tasks to the videocard while not gaming. encoding x264 for instance...
    Phenom 9950BE @ 3.24Ghz| ASUS M3A78-T | ASUS 4870 | 4gb G.SKILL DDR2-1000 |Silverstone Strider 600w ST60F| XFI Xtremegamer | Seagate 7200.10 320gb | Maxtor 200gb 7200rpm 16mb | Samsung 206BW | MCP655 | MCR320 | Apogee | MCW60 | MM U2-UFO |

    A64 3800+ X2 AM2 @3.2Ghz| Biostar TF560 A2+ | 2gb Crucial Ballistix DDR2-800 | Sapphire 3870 512mb | Aircooled inside a White MM-UFO Horizon |

    Current Phenom overclock


    Max Phenom overclock

  19. #69
    One-Eyed Killing Machine
    Join Date
    Sep 2006
    Location
    Inside a pot
    Posts
    6,340
    Guys, we're just grabbing a opportunity to run a pure number crunching benchmark on the GPU.
    It's not like you will compete with CPUs in SuperPi Mod v1.5 XS to GPUs in the same bench.
    And it's for benchmarking purposes, so saying that you already have a numerous ways to bench graphics cards ain't a valid point.
    Coding 24/7... Limited forums/PMs time.

    -Justice isn't blind, Justice is ashamed.

    Many thanks to: Sue Wu, Yiwen Lin, Steven Kuo, Crystal Chen, Vivian Lien, Joe Chan, Sascha Krohn, Joe James, Dan Snyder, Amy Deng, Jack Peterson, Hank Peng, Mafalda Cogliani, Olivia Lee, Marta Piccoli, Mike Clements, Alex Ruedinger, Oliver Baltuch, Korinna Dieck, Steffen Eisentein, Francois Piednoel, Tanja Markovic, Cyril Pelupessy (R.I.P. ), Juan J. Guerrero

  20. #70
    Xtreme Addict
    Join Date
    Dec 2007
    Posts
    1,030
    Quote Originally Posted by Jimmer411 View Post
    On a side note I am curious to see how this turns out, not so much for superpi but for the door of other possibilities that this may open up in the future. We have all seen how a CPU handles GPU calculations and now it will be interesting to see how a GPU handles CPU stuff. I would love to have the ability to offload various tasks to the videocard while not gaming. encoding x264 for instance...
    I quite believe thats the reason why this thread hasn't begun a flame-war
    Are we there yet?

  21. #71
    Live Long And Overclock
    Join Date
    Sep 2004
    Posts
    14,058
    CudaPi ?

    Perkam

  22. #72
    One-Eyed Killing Machine
    Join Date
    Sep 2006
    Location
    Inside a pot
    Posts
    6,340
    Quote Originally Posted by perkam View Post
    CudaPi ?

    Perkam
    SuperCudaPi
    Coding 24/7... Limited forums/PMs time.

    -Justice isn't blind, Justice is ashamed.

    Many thanks to: Sue Wu, Yiwen Lin, Steven Kuo, Crystal Chen, Vivian Lien, Joe Chan, Sascha Krohn, Joe James, Dan Snyder, Amy Deng, Jack Peterson, Hank Peng, Mafalda Cogliani, Olivia Lee, Marta Piccoli, Mike Clements, Alex Ruedinger, Oliver Baltuch, Korinna Dieck, Steffen Eisentein, Francois Piednoel, Tanja Markovic, Cyril Pelupessy (R.I.P. ), Juan J. Guerrero

  23. #73
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by GoThr3k View Post
    And ATI has something like CUDA, called CTM (close to metal), to bad you have to program in assembler with CTM, in CUDA you can program in C & C++
    i thought that too, but its not like that... you cant just run c and c++ code on cuda. its similar to c and c++ afaik, but you still need to basically program pretty much in machine langauge for cuda.

    cuda and ctm are very similar afaik, both are way too complex and way NOT user friendly, which i think is the main reason why we dont see any gpgpu based apps so far.

    the directx gpgpu approach seems to be very promising, a united api that drivers are already optimized to work with and that is already somewhat familiar to developers makes more sense than cuda or ctm imo. Plus its going to be ONE api, maintained by ONE party, which means coders dont have to decide for either ati or nvidia, and actually ANY gpu can be used for this, even sis IGP or whatnot
    Last edited by saaya; 05-27-2008 at 05:46 AM.

  24. #74
    Xtreme Member
    Join Date
    Mar 2008
    Location
    Germany
    Posts
    351
    Quote Originally Posted by JMKS View Post
    ...
    thanks for taking the time to type all that stuff . Was interesting to read
    X3350 | DFI LP X38 T2R | d9gkx
    9800gtx | Raptor1500AHFD/5000AACS/WD3201ABYS
    Corsair 620HX | Coolermaster CM690

  25. #75
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by saaya View Post
    the directx gpgpu approach seems to be very promising, a united api that drivers are already optimized to work with and that is already somewhat familiar to developers makes more sense than cuda or ctm imo.
    yep, even opengl gpgpu is quite easy to do. but ctm/cuda, especially ctm give you much more options to improve performance and flexibility

Page 3 of 10 FirstFirst 123456 ... LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •