Page 4 of 7 FirstFirst 1234567 LastLast
Results 76 to 100 of 175

Thread: AMD does reverse GPGPU, announces OpenCL SDK for x86

  1. #76
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by Chumbucket843 View Post
    hey im in high school and i take that as an offense but i do see how easy that would be to parallelize and then paste into a forum. it looks like you can do this with a lot of sequences.
    Haha, sorry. Wasn't meant as an insult, just pointing out how smart people in high school are

    Quote Originally Posted by Drwho? View Post
    Knowing that you have to deal with numbers that are way behond what the CPU or GPU registers can deal with
    Yes, we've already covered that earlier. You would need to implement z=f(x,y) where f is a multiplication function of arbitrarily large numbers. It doesn't change the algorithm, you're still doing a prefix sum on an associative operation. You should already know how to implement f(x,y) since you coded a factorial function on the CPU right?

    Quote Originally Posted by xVeinx View Post
    That being said, the factorials calculation can be computed in parallel, but you'd have to break up the problem and be able to keep track of what has/hasn't been calculated, which would be more effectively done on the CPU.
    For non-deterministic problem sets maybe but this is a simplification of an already simple prefix scan. I'm sure Drwho didn't take the time to look it up when I suggested he do so, so I'll help him out.

    http://developer.download.nvidia.com...n/doc/scan.pdf

    This problem is even simpler because you don't need to store an array of operands since you know that for n! you generate a sequence where i=1..n and for every element a[i], a[i+1]=a+1. So it's just simple math to partition that sequence into M thread blocks each taking M*2 elements from the sequence. Once all blocks are done with their sub-sequences you do a final reduction of all block results and voila.

  2. #77
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by Chumbucket843 View Post
    hey im in high school and i take that as an offense but i do see how easy that would be to parallelize and then paste into a forum. it looks like you can do this with a lot of sequences.
    Haha, sorry. Wasn't meant as an insult, just pointing out how smart people in high school are

    Quote Originally Posted by Drwho? View Post
    Knowing that you have to deal with numbers that are way behond what the CPU or GPU registers can deal with
    Yes, we've already covered that earlier. You would need to implement z=f(x,y) where f is a multiplication function of arbitrarily large numbers. It doesn't change the algorithm, you're still doing a prefix sum on an associative operation. You should already know how to implement f(x,y) since you coded a factorial function on the CPU right?

    Quote Originally Posted by xVeinx View Post
    That being said, the factorials calculation can be computed in parallel, but you'd have to break up the problem and be able to keep track of what has/hasn't been calculated, which would be more effectively done on the CPU.
    For non-deterministic problem sets maybe but this is a simplification of an already simple prefix scan. I'm sure Drwho didn't take the time to look it up when I suggested he do so, so I'll help him out.

    http://developer.download.nvidia.com...n/doc/scan.pdf

    This problem is even simpler because you don't need to store an array of operands since you know that for n! you generate a sequence where i=1..n and for every element a[i], a[i+1]=a+1. So it's just simple math to partition that sequence into M thread blocks, each block taking n/M elements from the sequence. Once all blocks are done with their sub-sequences you do a final reduction of all block results and voila.

  3. #78
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by trinibwoy View Post
    So it's just simple math to partition that sequence into M thread blocks each taking M*2 elements from the sequence. Once all blocks are done with their sub-sequences you do a final reduction of all block results and voila.
    do it please. i genuinely wonder how long it takes to implement and how fast it will run

  4. #79
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by W1zzard View Post
    do it please. i genuinely wonder how long it takes to implement and how fast it will run
    I'm no CUDA programmer but it's trivial to do in Java on multiple CPU threads using BigDecimal. To get this to run in CUDA you would need to build a similiar data structure to represent unbounded integer values on the GPU. That isn't trivial!

  5. #80
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by trinibwoy View Post
    I'm no CUDA programmer but it's trivial to do in Java on multiple CPU threads using BigDecimal.
    to me that sounds like an argument _for_ x86 "normal programming" and against gpgpu?

  6. #81
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by W1zzard View Post
    i dont think it is possible to make a general statement like "x86 will outperform opencl".

    from what i understand, the unique feature of opencl is that it can run threads on the cpu and the gpu. so run iterative/branching/recursive stuff on the cpu and use the gpu for stuff like matrix multiplications.

    of course in a commercial environment it is also important how quickly you can get things working. your factorial is a great example for that. 1 hour to write cpu code, weeks to write gpu code. coders cost money, time you often don't have, hardware is cheap, just get a few dozen boxes running the cpu implementation
    It is entirely possible to make that statement in cases in which the code is massively Serial.
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  7. #82
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by W1zzard View Post
    to me that sounds like an argument _for_ x86 "normal programming" and against gpgpu?
    Nope it just means that people have written billions of lines of code for CPUs over time that other people can reuse. If you had to go reinvent the wheel everytime you wrote a program on a CPU you would have the same problem. Just because there's a lot of code already out there for CPUs doesn't make it "normal programming". Eventually there will be similiar reusable data structures and libraries for GPUs, but right now we're in the inventing the wheel phase

  8. #83
    Xtreme Addict
    Join Date
    Jul 2007
    Posts
    1,488
    Quote Originally Posted by trinibwoy View Post
    Nope it just means that people have written billions of lines of code for CPUs over time that other people can reuse. If you had to go reinvent the wheel everytime you wrote a program on a CPU you would have the same problem. Just because there's a lot of code already out there for CPUs doesn't make it "normal programming". Eventually there will be similiar reusable data structures and libraries for GPUs, but right now we're in the inventing the wheel phase
    Not only are the programming tools for GPGPU still evolving, the hardware itself is still evolving too. Will future GPUs have more capabilities and fewer limitations then current GPUs? Definitely.

    It's the same story with x86. It definitely wasn't pretty in its early days.

  9. #84
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by Solus Corvus View Post
    Not only are the programming tools for GPGPU still evolving, the hardware itself is still evolving too. Will future GPUs have more capabilities and fewer limitations then current GPUs? Definitely.

    It's the same story with x86. It definitely wasn't pretty in its early days.
    true, but back then there wasnt a competing technology available that was more mature in every regard except that it came with 1 potential order of magnitude execution time reduction promise.

    does gpgpu offer anything else other than the promise of being faster?

  10. #85
    Xtreme Addict
    Join Date
    Jul 2007
    Posts
    1,488
    Being faster is the only thing that comes to mind at the moment.

    But x86 at its start wasn't faster, easier to develop for, or more mature then some of the other ISAs of the time. IMO it was low cost and ubiquity that lead to its early success.

  11. #86
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    here is the few 1st digit of 1000000!


    .....
    .....
    .....
    For Those who are crazy enough and what to see for your eyes ...
    The world document with 1000000! inside ... 2.55MB

    The screen shot:

    Without ANY optimization, it toke 43735260 milliseconds, about 728 minutes ... around 12 hours ...
    With some threading, I think I can get it down dramatically, I ll just do it for the fun of it ...
    The 8MB of my Nehalem will be really useful between the combining of splited multiplications ...

    Basically: Done!
    Last edited by Drwho?; 08-16-2009 at 10:11 AM.
    DrWho, The last of the time lords, setting up the Clock.

  12. #87
    Xtremely Kool
    Join Date
    Jul 2006
    Location
    UK
    Posts
    1,875
    I can do it faster in my head, its just that copy & paste to the forum from my head is problematic.

  13. #88
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by Final8ty View Post
    I can do it faster in my head, its just that copy & paste to the forum from my head is problematic.
    Then drop to scientific notation and give us the first 13 digits [which is more than enough for any engineering project]
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  14. #89
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by W1zzard View Post
    i dont think it is possible to make a general statement like "x86 will outperform opencl".

    from what i understand, the unique feature of opencl is that it can run threads on the cpu and the gpu. so run iterative/branching/recursive stuff on the cpu and use the gpu for stuff like matrix multiplications.

    of course in a commercial environment it is also important how quickly you can get things working. your factorial is a great example for that. 1 hour to write cpu code, weeks to write gpu code. coders cost money, time you often don't have, hardware is cheap, just get a few dozen boxes running the cpu implementation
    OpenCL is to Java what Larrabee new instruction is the Assemby code ... See the point?

    Francois
    DrWho, The last of the time lords, setting up the Clock.

  15. #90
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by Drwho? View Post
    OpenCL is to Java what Larrabee new instruction is the Assemby code ... See the point?

    Francois
    no

  16. #91
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by W1zzard View Post
    no
    OpenCL is a softare layer ... with drivers ... it has concequences ... and overhead .. Assemble code is known to be the fastest code on any architecture.
    A Simple layer between LrB new instruction and the programmer will do. we all know that as soon as a desktop application need performance, it does down to ASM.

    The massive number of Java Desktop is gone, no more need of those ... almost every computer runs x86 from AMD, intel or VIA.

    Last word ... Drivers seems to be a problem, there are a lot of empirical about this ... an healthy PC uses less driver in the long run.
    Before people make fun of the intel drivers, some should clean up on the front of their doors ... ;-)

    Last edited by Drwho?; 08-16-2009 at 11:41 AM.
    DrWho, The last of the time lords, setting up the Clock.

  17. #92
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by Drwho? View Post
    OpenCL is a softare layer ... with drivers ... it has concequences ... and overhead .. Assemble code is known to be the fastest code on any architecture.
    A Simple layer between LrB new instruction and the programmer will do. we all know that as soon as a desktop application need performance, it does down to ASM.

    Last word ... Drivers seems to be a problem, there are a lot of empirical about this ... an healthy PC uses less driver in the long run.
    Before people make fun of the intel drivers, some should clean up on the front of their doors ... ;-)

    i haven't seen anyone write assembler code on x86 in many years. even drivers are coded in a high level language. compilers are fairly smart, they produce good code, better than 90% of asm coders. asm code is a nightmare to maintain.
    what do you code in asm at intel?

  18. #93
    Xtreme Mentor
    Join Date
    Sep 2007
    Location
    Ohio
    Posts
    2,977
    Quote Originally Posted by Drwho? View Post
    OpenCL is a softare layer ... with drivers ... it has concequences ... and overhead .. Assemble code is known to be the fastest code on any architecture.
    I believe you...

    I posted once that I thought OpenCL would probably be slower than CUDA and got my head bit off!

    The thread was locked in the end.
    Asus Maximus SE X38 / Lapped Q6600 G0 @ 3.8GHz (L726B397 stock VID=1.224) / 7 Ultimate x64 /EVGA GTX 295 C=650 S=1512 M=1188 (Graphics)/ EVGA GTX 280 C=756 S=1512 M=1296 (PhysX)/ G.SKILL 8GB (4 x 2GB) SDRAM DDR2 1000 (PC2 8000) / Gateway FPD2485W (1920 x 1200 res) / Toughpower 1,000-Watt modular PSU / SilverStone TJ-09 BW / (2) 150 GB Raptor's RAID-0 / (1) Western Digital Caviar 750 GB / LG GGC-H20L (CD, DVD, HD-DVD, and BlueRay Drive) / WaterKegIII Xtreme / D-TEK FuZion CPU, EVGA Hydro Copper 16 GPU, and EK NB S-MAX Acetal Waterblocks / Enzotech Forged Copper CNB-S1L (South Bridge heat sink)

  19. #94
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by Talonman View Post
    I posted once that I thought OpenCL would probably be slower than CUDA and got my head bit off!
    i doubt a significant difference will exist, even if nv implements opencl on top of cuda. got the link to the thread?

  20. #95
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by Talonman View Post
    I believe you...

    I posted once that I thought OpenCL would probably be slower than CUDA and got my head bit off!

    The thread was locked in the end.
    Please read totally before replying
    hehehhe ... CuDa is like OpenCL ... a thick layer between the programmer and the hardware.

    For the next few months, it may be fine, but we can't have a PC fragmented with different APIs ... Don't get me wrong, I was a super lover of GLIDE, but it did not survive the "Standardization" ... I guess Cuda is in the same position. OpenCL and DirectX compute are more into the position to standartize ... but long term, Integration will do its work.
    The Network cards, sound cards are use to be separated, they are now in the motherboard, with System on chip very soon in the CPU.
    This is the nature of this industry, when the CPU catch up from the bottom, it does start doing custom functions with Generic transistors, the CPU goes through transformation to do this, and we are at the beginning of Generic Purpose cores x86 to get into GFX.

    The soft drivers for sound cards were never as good as a Creative lab 48 bits cards, but they catch up the all required feature set. they are now a huge majority shipping.

    Let's all share opinions, without beating on each other, I may be right, I may be wrong ... It is just what I think, and you are free to dissagree.

    Francois
    Last edited by Drwho?; 08-16-2009 at 12:02 PM.
    DrWho, The last of the time lords, setting up the Clock.

  21. #96
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by W1zzard View Post
    i haven't seen anyone write assembler code on x86 in many years. even drivers are coded in a high level language. compilers are fairly smart, they produce good code, better than 90% of asm coders. asm code is a nightmare to maintain.
    what do you code in asm at intel?
    I use mostly compilers for the soft architecture ... but most of the code that need performance on PCs IS using ASM.

    * Most of the video codec do. Check x264 code or vTune DivX or windows Media ...
    * Most of the 3D Rendering like 3DSMax uses intrinsics, almost a 1:1 translation to ASM.
    * Most of audio programs have the MP3 parts using MMX.
    * Most of the Drivers, including NV and ATI are using ASM to managed cache pollution properly when sending through PCIexpress ... MOVNT and so on ...

    The list is very long, you just don t see it because we don t explain much details ...

    as soon as Performance can give you a competitive advantage over your software competitor, people use ASM.

    I help a lot of those guys.

    Francois
    DrWho, The last of the time lords, setting up the Clock.

  22. #97
    Xtreme Mentor
    Join Date
    Sep 2007
    Location
    Ohio
    Posts
    2,977
    Quote Originally Posted by Drwho? View Post
    Let's all share opinions, without beating on each other, I may be right, I may be wrong ... It is just what I think, and you are free to dissagree.

    Francois
    I am with you 100%. I was kind of fired up yesterday, and might have gotten out of line toward you. Sorry about that.

    I think we can all learn from each other.
    Asus Maximus SE X38 / Lapped Q6600 G0 @ 3.8GHz (L726B397 stock VID=1.224) / 7 Ultimate x64 /EVGA GTX 295 C=650 S=1512 M=1188 (Graphics)/ EVGA GTX 280 C=756 S=1512 M=1296 (PhysX)/ G.SKILL 8GB (4 x 2GB) SDRAM DDR2 1000 (PC2 8000) / Gateway FPD2485W (1920 x 1200 res) / Toughpower 1,000-Watt modular PSU / SilverStone TJ-09 BW / (2) 150 GB Raptor's RAID-0 / (1) Western Digital Caviar 750 GB / LG GGC-H20L (CD, DVD, HD-DVD, and BlueRay Drive) / WaterKegIII Xtreme / D-TEK FuZion CPU, EVGA Hydro Copper 16 GPU, and EK NB S-MAX Acetal Waterblocks / Enzotech Forged Copper CNB-S1L (South Bridge heat sink)

  23. #98
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by W1zzard View Post
    true, but back then there wasnt a competing technology available that was more mature in every regard except that it came with 1 potential order of magnitude execution time reduction promise.
    I hope you don't believe x86 to be the pinnacle of human innovation in ISA's

    Quote Originally Posted by Solus Corvus View Post
    Being faster is the only thing that comes to mind at the moment.
    And the only thing that matters. It's not like x86 is any more parallel friendly than other platforms.

    But x86 at its start wasn't faster, easier to develop for, or more mature then some of the other ISAs of the time. IMO it was low cost and ubiquity that lead to its early success.
    Exactly.

  24. #99
    Xtreme Addict
    Join Date
    Dec 2008
    Location
    Sweden, Linköping
    Posts
    2,034
    Quote Originally Posted by Drwho? View Post
    The Network cards, sound cards are use to be separated, they are now in the motherboard, with System on chip very soon in the CPU.
    This is the nature of this industry, when the CPU catch up from the bottom, it does start doing custom functions with Generic transistors, the CPU goes through transformation to do this, and we are at the beginning of Generic Purpose cores x86 to get into GFX.
    So going abit off-topic perhaps... How far do you think System on Chip will develop within the coming years? I'd like your personal opinion on this

    It is obvious that the 3 major PC vendors Intel, AMD and Nvidia have a large interest in this and all have a solution/is working on one that will be released very soon. Looking 10 years ahead will something have changed with out tradition computers with a motherboard, CPU and the option for a stronger GPU would we want to, or have we in 10 years started the transition whereas the difference between a CPU and the GPU has become minimal if not none-existant anymore, and all PCs will consist of a single-chip on a PCB instead of the classical motherboard, CPU, GPU etc. etc.

    We have seen SoC taking it's way into Phones and Smaller hand-carried products and soon Qualcomm will release their smartbooks, not performance monsters, but it's SoC in a small package which I can carry around and use as a PC together with a OS (not Windows ofc.).

    So my question really is, will this pattern follow into the PC market?
    SweClockers.com

    CPU: Phenom II X4 955BE
    Clock: 4200MHz 1.4375v
    Memory: Dominator GT 2x2GB 1600MHz 6-6-6-20 1.65v
    Motherboard: ASUS Crosshair IV Formula
    GPU: HD 5770

  25. #100
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by trinibwoy View Post
    I hope you don't believe x86 to be the pinnacle of human innovation in ISA's
    so which isa is more successful?

Page 4 of 7 FirstFirst 1234567 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •