Page 2 of 7 FirstFirst 12345 ... LastLast
Results 26 to 50 of 175

Thread: AMD does reverse GPGPU, announces OpenCL SDK for x86

  1. #26
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    nice post francois...
    but i think you exaggerate a bit on how dificult it is to code for gpus

    and talonman, i dont think this means amd has problems with opencl on their gpus... i agree with francois, if anything this shows amd is focussing on making opencl useful by delivering a cpu open cl interface... nvidia brags about having the first gpu opencl driver yet how francois pointed out, how useful that actually is remains to be seen as its rather limited afaik. you can do the same thing on a cpu and gpu with opencl i guess, more or less, but most things will be pretty slow on a gpu.

  2. #27
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by saaya View Post
    you can do the same thing on a cpu and gpu with opencl i guess, more or less, but most things will be pretty slow on a gpu.
    You've got that backwards.

  3. #28
    Xtreme Addict
    Join Date
    Apr 2006
    Posts
    2,462
    Quote Originally Posted by trinibwoy View Post
    You've got that backwards.
    Why? I'm a noob but I'd guess that the possibilities to use the GPUs are limited, for instance due to the need of highly parallelized workload. Does that happen often?
    Notice any grammar or spelling mistakes? Feel free to correct me! Thanks

  4. #29
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Ask yourself this. Why would you need OpenCL on a CPU if you have C/C++? These languages are designed to take advantage of highly parallel machines, they provide no benefit on a CPU over existing languages.

  5. #30
    I am Xtreme zanzabar's Avatar
    Join Date
    Jul 2007
    Location
    SF bay area, CA
    Posts
    15,871
    Quote Originally Posted by trinibwoy View Post
    Ask yourself this. Why would you need OpenCL on a CPU if you have C/C++? These languages are designed to take advantage of highly parallel machines, they provide no benefit on a CPU over existing languages.
    they provide and advantage for x86, think of the new intel servers they have some out or close to out with 32 threads. and there is no reason to not have openCL of x86 not only do it allow for ease of testing its not like people dont use other languages that already have competitors.
    5930k, R5E, samsung 8GBx4 d-die, vega 56, wd gold 8TB, wd 4TB red, 2TB raid1 wd blue 5400
    samsung 840 evo 500GB, HP EX 1TB NVME , CM690II, swiftech h220, corsair 750hxi

  6. #31
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by zanzabar View Post
    they provide and advantage for x86, think of the new intel servers they have some out or close to out with 32 threads. and there is no reason to not have openCL of x86 not only do it allow for ease of testing its not like people dont use other languages that already have competitors.
    OpenCL doesn't give you any advantage over C++ no matter the number of threads. It's not magic, just a framework to help you formulate a solution. It provides an advantage on a CPU when you're testing/prototyping something you plan to run on a GPU or when you want to run the serial part of an application on a CPU with the data parallel part running on a GPU. CPU only applications won't see any benefit in OpenCL over regular programming languages. Don't get me wrong, it's great and it's needed. I'm just clarifying that most OpenCL applications will most certainly not be targeted at CPUs as there are already way more powerful and flexible options for doing so.
    Last edited by trinibwoy; 08-13-2009 at 05:05 PM.

  7. #32
    Xtreme Mentor
    Join Date
    Sep 2007
    Location
    Ohio
    Posts
    2,977
    +1

    It is a positive thing to have OpenCL being able to process on a CPU, but for animal speed, OpenCL on a GPU would win.

    That is why I think if Havok physics is able to run on GPU's via OpenCL, Havok will only be able to give us better performance than it currently does running on CPU's.
    Asus Maximus SE X38 / Lapped Q6600 G0 @ 3.8GHz (L726B397 stock VID=1.224) / 7 Ultimate x64 /EVGA GTX 295 C=650 S=1512 M=1188 (Graphics)/ EVGA GTX 280 C=756 S=1512 M=1296 (PhysX)/ G.SKILL 8GB (4 x 2GB) SDRAM DDR2 1000 (PC2 8000) / Gateway FPD2485W (1920 x 1200 res) / Toughpower 1,000-Watt modular PSU / SilverStone TJ-09 BW / (2) 150 GB Raptor's RAID-0 / (1) Western Digital Caviar 750 GB / LG GGC-H20L (CD, DVD, HD-DVD, and BlueRay Drive) / WaterKegIII Xtreme / D-TEK FuZion CPU, EVGA Hydro Copper 16 GPU, and EK NB S-MAX Acetal Waterblocks / Enzotech Forged Copper CNB-S1L (South Bridge heat sink)

  8. #33
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by trinibwoy View Post
    Ask yourself this. Why would you need OpenCL on a CPU if you have C/C++? These languages are designed to take advantage of highly parallel machines, they provide no benefit on a CPU over existing languages.
    The answer to your question is simple ... OpenCL or Direct Compute are abtraction layers that facilitate the vectorization and parallelization.

    when you provide the loops to a CPU, it is surprisingly fast on OpenCL.
    The geniuses of Video games almost all agree: They want many cores, with Caches , fast interconnect, and fast RAM ... This architecture is closer from CPU than GPU: http://graphics.cs.williams.edu/arch...TimHPG2009.pdf
    Tim makes a lot of sense in his presentation ... the title is a little misleading, so, move into the presentation, and you ll understand his overall point...
    GPU will have to become so close to the CPU, that Instruction per clock will rebecome the St Grall ... and we all know who can generate the best Instruction per clock architecture ... humm hummm ;-) the x86 vendors ...
    Because IPC is what drives your power saving, good instruction per clock drives your power down , because you need less MHz to archive the same Performance ...
    Pentium 4 vs Athlon 64 and then Core 2 illustrate the path ... apply this to GPU converging to CPU.

    does it clarify it for you?
    DrWho, The last of the time lords, setting up the Clock.

  9. #34
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by Drwho? View Post
    The answer to your question is simple ... OpenCL or Direct Compute are abtraction layers that facilitate the vectorization and parallelization. when you provide the loops to a CPU, it is surprisingly fast on OpenCL.
    Which you can already do in C++. Why go through some additional layer coded by a third party that will invariably make your code slower? The entire reason for the existence of OpenCL is to make GPU programming easier.

    The geniuses of Video games almost all agree: They want many cores, with Caches , fast interconnect, and fast RAM ... This architecture is closer from CPU than GPU: http://graphics.cs.williams.edu/arch...TimHPG2009.pdf
    Thanks but I've already read that presentation thoroughly and posted it here. I'm not sure how you arrived at the arbitrary conclusion that the future of computing looks more like a CPU than a GPU. Lots of cores, fast interconnects and fast RAM are all GPU qualities. So even based on your own list your statement is false.

    does it clarify it for you?
    Not sure what point you're trying to make. x86 is far from the most efficient architecture in terms of IPC. And big caches are the exact opposite of what you want in a throughput architecture.

  10. #35
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by trinibwoy View Post
    Which you can already do in C++. Why go through some additional layer coded by a third party that will invariably make your code slower? The entire reason for the existence of OpenCL is to make GPU programming easier.

    Are you a programmer? Try to make n! work on the GPU, or any function where you have any convergence of a math function ...
    like f(n) = G(F(n-1)) ...
    then, you ll figure out that if n is a vector, and G a complexe function, the CPU is faster than the GPU at this kind of processing ...

    try to write a ray tracer in a GPU ... you meet very quickly the problem that some pixel share rebounces for a while, but then use different pathes ... so, when they share the path, you use GPU like ... when they don't , you use the same OpenCL or Direct Compute , or Larrabee Instruction stream, but with a single vector ...
    you need both, CPU and GPU like ... your vision of black and white is too simplistic ...
    Intel choosed to have x86 on both path, AMD wants to have OpenCL layer ... and NV will be using a lot the nehalem cores if they want to stay competitive ... lol.



    I am not talking about fake ray tracing ... it is suppose to look like this (thanks to the povray guys for their rest less work! ):




    rendered on the CPU
    you can render it yourself: http://www.oyonale.com/modeles.php?page=40 for the files , and http://povray.org/beta/ for the program ... it takes a very long time, it is the price to pay for 100% independant rays...
    Last edited by Drwho?; 08-15-2009 at 09:57 AM.
    DrWho, The last of the time lords, setting up the Clock.

  11. #36
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by Drwho? View Post
    Are you a programmer?
    Yes, by profession.

    A factorial is actually trivially parallelizable. It's a simple reduction that would be blazing fast on a GPU. It'll be similiar to a merge sort, simply divide and conquer.

    Your comments on ray-tracing are also invalid. A GPU can produce raytraced renders that would match a CPU version pixel for pixel. The primary issue with GPU raytracing is the random memory access and divergent code paths when traversing the scene structure. This makes things slow, not impossible as you imply. And with techniques such as packet tracing and persistent threads GPUs are already attacking this problem and it will only get better with the upcoming crop of hardware.
    Last edited by trinibwoy; 08-15-2009 at 09:50 AM.

  12. #37
    Xtreme Member
    Join Date
    Jun 2008
    Location
    Finland
    Posts
    111
    [ot]
    Quote Originally Posted by Drwho? View Post
    it is suppose to look like this (thanks to the povray guys for their rest less work! ):
    Could you make the resolution a bit smaller please? Big pictures make the forum a lot harder to read for us with 1024res.
    [/ot]

  13. #38
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Dr. Who, try doing 1,000,000! using Windows calculator and tell me how long it takes

  14. #39
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by trinibwoy View Post
    Yes, by profession.

    A factorial is actually trivially parallelizable. It's a simple reduction that would be blazing fast on a GPU. It'll be similiar to a merge sort, simply divide and conquer.

    Your comments on ray-tracing are also invalid. A GPU can produce raytraced renders that would match a CPU version pixel for pixel. The primary issue with GPU raytracing is the random memory access and divergent code paths when traversing the scene structure. This makes things slow, not impossible as you imply. And with techniques such as packet tracing and persistent threads GPUs are already attacking this problem and it will only get better with the upcoming crop of hardware.
    hehehhe ... we will see I would not be so confident in your statement if I was you ...

    Please show me a n! faster on GPU than the CPU to start with ...
    you forgot a little issue in the parallelization of the factorial ... the branching speed isn't it? each local thread needs it ...

    So, show us how professional you are factorial is a very simple problem ... code it for tomorrow on GPU ... I ll do on CPU and we compare ?

    we start from recursive or not ... I ll use the non recursive ...



    ===================================
    int determineFactorialRecursive (int a)
    {
    if (a == 1)
    return 1; //factorial of 1 == 1

    else
    return a * determineFactorialRecursive (a - 1);

    // Notes::
    // recursive approach - Note you are calling the function
    // determineFactorial recursively.
    // do some reading and you will find this helpful
    }


    //Another way
    int DetermineFactorialOther (int myFact)
    {
    int value;
    int value2 = 1;

    for (int i = myFact; i > 1; i--)
    {
    value = (i * (i-1));
    value2 *= value;
    i--;
    }

    return value2;
    }


    ===================================


    here is some help for you:
    http://www.luschny.de/math/factorial/Benchmark.html
    The algorythm is there for // processing.

    we will use 64 000 000! as goal, since it is fairly reasonable ...
    you ll post your code using your GPU tomorrow, I ll post mind too

    Since it is trivial ... I don t see why you can not do it for tomorrow.

    deal?
    DrWho, The last of the time lords, setting up the Clock.

  15. #40
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    good example, but integer overflow in your sample code looking forward to see your cpu implementation tomorrow. for the sake of the discussion the overflow shouldnt pose a problem though. adding "arbitrary" length number support would complicate things even more because of memory bandwidth and caches involved

    64! = 12688693218588416410343338933516148080286551617454 5192198801894375214704230400000000000000

    took me 0.34 seconds using an outside-of-the-box-thinking-search-algorithm
    Last edited by W1zzard; 08-15-2009 at 10:27 AM.

  16. #41
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by W1zzard View Post
    good example, but integer overflow in your sample code looking forward to see your cpu implementation tomorrow. for the sake of the discussion the overflow shouldnt pose a problem though
    Well, don t worry about the overflow ... n! is an hobby of mine , but i will not cheat, create a apps in MFC ...
    done with the UI:
    Last edited by Drwho?; 08-15-2009 at 10:27 AM.
    DrWho, The last of the time lords, setting up the Clock.

  17. #42
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    i love mfc, but linux just prevails in some situations

    # time echo "define f(x) { if (x>1) { return (x*f(x-1)) }; return (1) }; f(16384)" | bc

    .. lots of numbers ..

    real 0m7.470s
    user 0m5.829s
    sys 0m0.090s

    simpler:

    # time seq -s \* 16384 | bc

    real 0m8.680s
    user 0m5.858s
    sys 0m0.088s
    Last edited by W1zzard; 08-15-2009 at 10:37 AM.

  18. #43
    Admin
    Join Date
    Dec 2003
    Location
    Hillsboro, OR
    Posts
    5,225
    Quote Originally Posted by Drwho? View Post
    hehehhe ... we will see I would not be so confident in your statement if I was you ...

    Please show me a n! faster on GPU than the CPU to start with ...
    you forgot a little issue in the parallelization of the factorial ... the branching speed isn't it? each local thread needs it ...

    So, show us how professional you are factorial is a very simple problem ... code it for tomorrow on GPU ... I ll do on CPU and we compare ?

    we start from recursive or not ... I ll use the non recursive ...
    Why'd you list such a crummy recursive algorithm?

  19. #44
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by W1zzard View Post
    i love mfc, but linux just prevails in some situations

    # time echo "define f(x) { if (x>1) { return (x*f(x-1)) }; return (1) }; f(16384)" | bc

    .. lots of numbers ..

    real 0m7.470s
    user 0m5.829s
    sys 0m0.090s
    well, I am using Hai yi simple code ... http://www.codeguru.com/Cpp/Cpp/algo...icle.php/c2041 ... the time of installing linux would have be more than opening MSVC ... heheheh ...
    Let s see if the CUDA or OpenCL version will come before next year
    And he said 1000000! ... so, he will have to provide a support for bigger than long long in the GPU ... I guess he will be able to file a PhD after this ... lol
    DrWho, The last of the time lords, setting up the Clock.

  20. #45
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by Gautam View Post
    Why'd you list such a crummy recursive algorithm?
    it is a very primitive algorithm that should be really easy to implement on the gpu, no ?

    Quote Originally Posted by Drwho? View Post
    Let s see if the CUDA or OpenCL version will come before next year
    dont think so... unless you talk to nvidia marketing, they fix up some mdf, get one of their devs to write it, give it to the OP and stick a way it's meant to be played logo on it

    Quote Originally Posted by Drwho? View Post
    well, I am using Hai yi simple code ... http://www.codeguru.com/Cpp/Cpp/algo...icle.php/c2041 ...
    your thoughts on using strings to store the digits instead of a list? char* s++ / s-- seems cheaper than dereferencing pointers?
    Last edited by W1zzard; 08-15-2009 at 10:48 AM.

  21. #46
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by W1zzard View Post
    it is a very primitive algorithm that should be really easy to implement on the gpu, no ?



    dont think so... unless you talk to nvidia marketing, they fix up some mdf, get one of their devs to write it, give it to the OP and stick a way it's meant to be played logo on it



    your thoughts about using strings to store the digits instead of a list? char* s++ / s-- seems cheaper than dereferencing pointers?
    This methode is showed to bachelors ... as an exercice ... using the 1st return by google :P Even this, a GPU will not beat ... no GPU yet can branch fast enough.
    DrWho, The last of the time lords, setting up the Clock.

  22. #47
    Xtreme Legend
    Join Date
    Jan 2003
    Location
    Stuttgart, Germany
    Posts
    929
    Quote Originally Posted by Drwho? View Post
    Even this, a GPU will not beat ... no GPU yet can branch fast enough.
    agreed, but there are some problems where gpu computing does have benefits. people must just realize that this is only a very very very small subset of computational problems

  23. #48
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by W1zzard View Post
    agreed, but there are some problems where gpu computing does have benefits. people must just realize that this is only a very very very small subset of computational problems
    almost done coding it ... I want to go enjoy california sun while it is running
    DrWho, The last of the time lords, setting up the Clock.

  24. #49
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by W1zzard View Post
    agreed, but there are some problems where gpu computing does have benefits. people must just realize that this is only a very very very small subset of computational problems
    Most of the computing world is itterative, and using the result of an other process is the rule ... Too many people think that GPU can do it all, it is an absolute heritic thinking ...
    We did not put all of this hardware in the processor for free ... and who ever claimed in his Q2 2009 financial claim that the GPU can do it all is somebody who lost contact with reality.

    Branching when processing is still the st grall, then, intruction per clock ...
    I am reinstalling quickly without Intel Compiler ... just to prove that CPU can do so much better ... even on a "trivial parrallelizable" algorithm.

    you want to notice the developpement speed difference, I will have finish a system able to calculate n! over 1000 in few minutes ... or an hour or so ...
    Doing it on CUDA will have days ...
    The propaganda about GPGPU always forget those factors ... they only show you one very little corner of the world.

    I added a CRichEditCtrl to display the results ... like this, I can copy and paste the results ... without even writting a line of code ... Do this with your GPU ... hahahahha
    DrWho, The last of the time lords, setting up the Clock.

  25. #50
    Xtreme Addict
    Join Date
    Apr 2007
    Posts
    1,870
    Quote Originally Posted by Drwho? View Post
    Most of the computing world is itterative, and using the result of an other process is the rule ... Too many people think that GPU can do it all, it is an absolute heritic thinking ...
    From day one GPU computing has been about accelerating parallelizable problems not "doing it all". Do you get some joy from spreading this nonsense?

    Quote Originally Posted by Drwho? View Post
    Since it is trivial ... I don t see why you can not do it for tomorrow.
    Yeah because I'm gonna buy Visual Studio, learn to program in CUDA and implement a factorial algorithm all by tomorrow Btw, how is your little algorithm above there dealing with precision? There's no way an "int" can store the result of 1,000,000!

    The entire premise of your argument is false. Factorial isn't a strictly iterative calculation as all mutliplications of every two numbers in the sequence can happen in parallel. For example, 10!, using 4 threads.

    Clock #1
    Thread 1: 10*9
    Thread 2: 8*7
    Thread 3: 6*5
    Thread 4: 4*3
    Clock #2
    Thread 1: 90*56
    Thread 2: 30*12
    Clock #3
    Thread 1: 5040 * 360
    Clock #4
    Thread 1: 1814400 * 2

    Only 4 clocks on a theoretical 4 core processor. How is your naive iterative algorithm supposed to be faster when it takes 8 clocks? You really need to read up on parallel scans, there isn't even any branching in this algorithm so I'm really curious where you're getting all of this bad info from.
    Last edited by trinibwoy; 08-15-2009 at 12:11 PM.

Page 2 of 7 FirstFirst 12345 ... LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •