Page 6 of 7 FirstFirst ... 34567 LastLast
Results 126 to 150 of 175

Thread: AMD does reverse GPGPU, announces OpenCL SDK for x86

  1. #126
    Xtreme Enthusiast
    Join Date
    Feb 2005
    Posts
    970
    Quote Originally Posted by Chad Boga View Post
    Try reading what I wrote again, you might be able to reach the correct conclusion this time.
    Fine then you tell me. Or did you mean that it only matters to big spenders, and us little fish enthusiasts should be privaleged to pay a little more, if it has Intel branding??


    It never ceases to amaze me how Intel's rivals never fail to execute properly, never have botched products, are never behind on performance, it is always Intel's supposed monopoly status that holds back these other more deserving companies.
    It never ceases to amaze me how Intel sympathizers continue to use that failed argument when plain common sense, as well as 2 countries and the EU, says that Intel's monopoly has not only all but locked out the competition, but hurt consumers in the process.

  2. #127
    Banned
    Join Date
    Aug 2008
    Posts
    1,052
    Quote Originally Posted by flippin_waffles View Post
    Fine then you tell me. Or did you mean that it only matters to big spenders, and us little fish enthusiasts should be privaleged to pay a little more, if it has Intel branding??
    I know when Dr.Who posts that because he is an Intel employee that means we are guaranteed to get a response from you like one of Pavlov's dogs turned rabid, but he was suggesting that when an Intel product is 1/10th that of an IBM product, that the Intel product will be quite compelling on the basis of price performance.

    How you managed to turn that into the illogical rant you did, is no credit to you.

    It never ceases to amaze me how Intel sympathizers continue to use that failed argument when plain common sense, as well as 2 countries and the EU, says that Intel's monopoly has not only all but locked out the competition, but hurt consumers in the process.
    The EU's decision is the only one of any weight, the other countries gave Intel a slap on the wrist only.

    But what's more, now that Intel has been stopped from these nefarious practices, where is the increased marketshare for AMD that we have been told would certainly be there, if not for Intel using their rebates?

    Is it due to Intel rebates that AMD is so lacking in a decent mobile offering, the fastest growing segment of the CPU sector?

    And as for brand strength, I really don't see the Lynnfield generation as being that great an advance over the i7 offerings we have had for this year already, but despite that, it looks pretty clear to me that these Lynnfield processors are going to decimate AMD's marketshare in Q4 and Q1.

    If that does happen, what else does that point to other than brand recognition/strength matters?

  3. #128
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by [XC] riptide View Post
    Who. Newsflash. This has been done on CUDA for a while now.





    http://forums.overclockers.ru/viewto...0339&start=120

    But I'll concede that >1,000,000! from my limited knowledge of algorithms and my good knowledge of maths that the GPU would soon hit a memory problem. A Tesla card would help..
    amazing , the code I did (in 3 hours) is amazingly faster than what was used as reference for the CPU... hehehehe ... thanks for making my point even better ... the comparaison is flowed .. Let s get this binary and run a real contest
    I kept the threaded version in back pocket in case ... ;-)
    Get the binary and compare !
    Last edited by Drwho?; 08-29-2009 at 10:13 PM.
    DrWho, The last of the time lords, setting up the Clock.

  4. #129
    Registered User
    Join Date
    Aug 2008
    Posts
    8
    Quote Originally Posted by Drwho? View Post
    amazing , the code I did (in 3 hours) is amazingly faster than what was used as reference for the CPU... hehehehe ... thanks for making my point even better ... the comparaison is flowed .. Let s get this binary and run a real contest
    I kept the threaded version in back pocket in case ... ;-)
    Get the binary and compare !
    Are you sure about that? Your code takes 43735260ms for 1000000!, and the time taken is roughly proportional to N^2 (the growth factor is a little larger due to a log N term, but it's fairly irrelevant).

    So that gives an approximate time of 2733453ms for 250000!, which is actually much worse than any of the CPU times quoted here.

    Without wanting to cause offense, the code you've provided does seem a lot slower than it ought to be - here's my "5 minute hack":

    Code:
    int *a = new int[1000000];
    int m = 1000000;
    a[0] = 1;
    int L = 1;
    for (int i=2;i<=250000;i++)
    {
    	if (i%1000 == 0)
    		printf("%d  \r", i);
    	int C = 0;
    	__int64 I = i, M = m;
    	for (int j=0;j<L;j++)
    	{
    		__int64 V = a[j] * I + C;
    		C = V / M;
    		a[j] = V - M * C;
    	}
    	if (C)
    		a[L++]=C;
    }
    for (int i=L-1;i>=0;i--)
    	printf("%06d", a[i]);
    printf("\n");
    which only takes about 550000 ms (under 3 hours for 1000000!) This is on a stock i920.

    I know you're much more of an x86 expert than I am, so maybe I'm missing something?
    Last edited by David F; 08-30-2009 at 03:05 AM. Reason: fixed tabs in [code]

  5. #130
    I am Xtreme
    Join Date
    Jul 2007
    Location
    Austria
    Posts
    5,485
    lol chad why do you even bother to reply to a flamebait?

  6. #131
    Moderator
    Join Date
    Mar 2006
    Posts
    8,556
    Quote Originally Posted by Drwho? View Post
    amazing , the code I did (in 3 hours) is amazingly faster than what was used as reference for the CPU... hehehehe ... thanks for making my point even better ... the comparaison is flowed .. Let s get this binary and run a real contest
    I kept the threaded version in back pocket in case ... ;-)
    Get the binary and compare !
    Who. Your code is bad. On my 3.99Ghz E8500, 2 threads I have done 1000000! in 6m 50.236 seconds. It took you 12 hours and I presume it was on a i7. And it was digging your own hole from the moment you started the argument. You should have started out the argument with the latest multithreaded enabled algorithms instead of cooking up your own searched basic recursive. That recursive algorithm is the 1st lesson in any C++ tutorial to demonstrate recursive algorithms. Now that we already have a GPU accelerated app for factorials, this is becoming a little silly. I see goal posts been moved in the following discussion. BTW the chart I listed... the , is the decimel point... as is used often in non english continental Europe. The good news is (for you) is that Intel seems to be ahead of AMD CPU's on this one.

    Last edited by [XC] riptide; 08-30-2009 at 04:35 AM.

  7. #132
    Moderator
    Join Date
    Mar 2006
    Posts
    8,556
    Chad and Waffles.... be cool.

  8. #133
    Registered User
    Join Date
    Aug 2008
    Posts
    8
    Quote Originally Posted by [XC] riptide View Post
    Who. Your code is bad. On my 3.99Ghz E8500, 2 threads I have done 1000000! in 6m 50.236 seconds. It took you 12 hours and I presume it was on a i7. And it was digging your own hole from the moment you started the argument. You should have started out the argument with the latest multithreaded enabled algorithms instead of cooking up your own searched basic recursive.
    Good advice. Have you seen this page: http://www.luschny.de/math/factorial/Benchmark.html ?

    1000000! in 5.1 seconds on an AMD-64! I'm guessing that could go sub 1 second on a fast i7.

    I don't know how hard it would be to produce a CUDA implementation, but my gut feeling is it would be tough (you'd need to implement true bignum multiplication, I'm sure).

    Edit: looking at the gains got from 2 threads, sub 1 second might be a bit optimistic.
    Last edited by David F; 08-30-2009 at 09:54 AM.

  9. #134
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    well, guys , then you play poker ... do you show your cards the 1st turn ? lol ...
    So, who wants to race ?
    Now that I have the CUDA camps bragging, let s try

    So, this is the best shot for the GPU, everybody agrees?
    Last edited by Drwho?; 08-30-2009 at 12:04 PM.
    DrWho, The last of the time lords, setting up the Clock.

  10. #135
    Moderator
    Join Date
    Mar 2006
    Posts
    8,556
    Quote Originally Posted by Drwho? View Post
    well, guys , then you play poker ... do you show your cards the 1st turn ? lol ...
    So, who wants to race ?
    Now that I have the CUDA camps bragging, let s try

    So, this is the best shot for the GPU, everybody agrees?
    There is no 'CUDA' camp. There's you taking 12 hours to do 1000000!... and thats all really.

    I also do a bit of work on the Collatz conjecture (HOTPO problem) with Jon SOnntag's project 'Collatz at Home' If you're familiar with Collatz conjecture... http://en.wikipedia.org/wiki/Collatz_conjecture http://boinc.thesonntags.com/collatz/index.php



    The same work takes several times longer on a X86 processor than either ATi's stream or NVidia's CUDA. And has minimal need for the CPU at all, unlike the factorial problem.
    Last edited by [XC] riptide; 08-30-2009 at 12:26 PM.

  11. #136
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    Quote Originally Posted by [XC] riptide View Post
    There is no 'CUDA' camp. There's you taking 12 hours to do 1000000!... and thats all really.
    yeap ... a code done in 3 hours, copied and paste into an MFC wrapper , out of internet ... so, now ... let s see if you can beat my optimized code ...

    So, This is the best Shot for Factorial on CUDA? everybodies agree?
    DrWho, The last of the time lords, setting up the Clock.

  12. #137
    Moderator
    Join Date
    Mar 2006
    Posts
    8,556
    Quote Originally Posted by Drwho? View Post

    So, This is the best Shot for Factorial on CUDA? everybodies agree?
    What is the best shot?

  13. #138
    Registered User
    Join Date
    Aug 2008
    Posts
    8
    Quote Originally Posted by Drwho? View Post
    yeap ... a code done in 3 hours, copied and paste into an MFC wrapper , out of internet ... so, now ... let s see if you can beat my optimized code ...
    So, what algorithm are you using?

    So, This is the best Shot for Factorial on CUDA? everybodies agree?
    Best shot for code currently available, possibly. They're fairly obviously not using one of the efficient algorithms - so the question is whether it's even possible to implement one on CUDA.

    Having now researched some algorithms, a key question, I think, is whether there's a fast "BigInt" multiply available for CUDA? Googling, it looks like quite a few people have tried, but I don't see any evidence any have succeeded.

  14. #139
    Xtreme Addict
    Join Date
    Apr 2008
    Location
    Texas
    Posts
    1,663
    Does CUDA factorial run slower on Intel CPUs for some reason?

    Core i7 2600K@4.6Ghz| 16GB G.Skill@2133Mhz 9-11-10-28-38 1.65v| ASUS P8Z77-V PRO | Corsair 750i PSU | ASUS GTX 980 OC | Xonar DSX | Samsung 840 Pro 128GB |A bunch of HDDs and terabytes | Oculus Rift w/ touch | ASUS 24" 144Hz G-sync monitor

    Quote Originally Posted by phelan1777 View Post
    Hail fellow warrior albeit a surat Mercenary. I Hail to you from the Clans, Ghost Bear that is (Yes freebirth we still do and shall always view mercenaries with great disdain!) I have long been an honorable warrior of the mighty Warden Clan Ghost Bear the honorable Bekker surname. I salute your tenacity to show your freebirth sibkin their ignorance!

  15. #140
    Registered User
    Join Date
    Aug 2008
    Posts
    8
    Quote Originally Posted by Mechromancer View Post
    Does CUDA factorial run slower on Intel CPUs for some reason?
    Well, it certainly runs faster when you only ask for 100000! instead of 1000000!...

  16. #141
    Moderator
    Join Date
    Mar 2006
    Posts
    8,556
    Quote Originally Posted by Mechromancer View Post
    Does CUDA factorial run slower on Intel CPUs for some reason?

    Buddy. Try 4 threads next time and something higher than 100,000

  17. #142
    Xtreme Mentor
    Join Date
    Nov 2005
    Location
    Devon
    Posts
    3,437
    Quote Originally Posted by Mechromancer View Post
    Does CUDA factorial run slower on Intel CPUs for some reason?

    PIC
    Psss! You're missing a ZERO there
    RiG1: Ryzen 7 1700 @4.0GHz 1.39V, Asus X370 Prime, G.Skill RipJaws 2x8GB 3200MHz CL14 Samsung B-die, TuL Vega 56 Stock, Samsung SS805 100GB SLC SDD (OS Drive) + 512GB Evo 850 SSD (2nd OS Drive) + 3TB Seagate + 1TB Seagate, BeQuiet PowerZone 1000W

    RiG2: HTPC AMD A10-7850K APU, 2x8GB Kingstone HyperX 2400C12, AsRock FM2A88M Extreme4+, 128GB SSD + 640GB Samsung 7200, LG Blu-ray Recorder, Thermaltake BACH, Hiper 4M880 880W PSU

    SmartPhone Samsung Galaxy S7 EDGE
    XBONE paired with 55'' Samsung LED 3D TV

  18. #143
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Omg 1,000,000! on both cpu and gpu.shaders are actually at 1450 and its a 192.

  19. #144
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    I use the same algo as what I posted, except that I use 64bits numbers, instead of 8 bits ... and have a Carry of 64bits too, and it is all threaded
    I was trying to see if I can use the SSE2 registers do process 2 x 64bits, the supershuffle engine to forward the carry up into the stack. This one is a bad idea ... I am finishing the SSE2 version, and let s compare ... I will be able to claim X times faster than CUDA ... I ll probably have it finish in the very few days ...
    I was still unhappy , I was only using 64bits out of the 128 bits ... so, here is interesting part ... when you are doing calculating the 8 partials
    for (cpu=0;cpu<8;cpu++)
    {
    _param->TM=1+cpu*N/8;
    _param->TN=(cpu+1)*N/8;
    _param->ps=ps[cpu];
    T[cpu]=AfxBeginThread((AFX_THREADPROC)Thread1,_param,THR EAD_PRIORITY_NORMAL);
    }
    The last optimization I am trying to get in, is the function:
    int Multiply(Node *B,Node *A,Node *C)
    I belive that I can write a code that will use the low 64 bits for multiplying 2 list at a time, using fully the 128 registers, and the Dual execution units of the CPU core ...

    Who wants to put some $ on the table ...

    I am tighting up some parts with _ASM{}, just for the hell of it ...
    PS: I added the version of David into it, as reference, One button MFC ...

    Francois
    Last edited by Drwho?; 08-30-2009 at 01:38 PM.
    DrWho, The last of the time lords, setting up the Clock.

  20. #145
    Moderator
    Join Date
    Mar 2006
    Posts
    8,556
    Quote Originally Posted by Drwho? View Post

    Who wants to put some $ on the table ...
    Poker talk eh? Your hand was already down at post 86! And all the posts before it.

  21. #146
    Registered User
    Join Date
    Aug 2008
    Posts
    8
    Quote Originally Posted by Drwho? View Post
    I use the same algo as what I posted, except that I use 64bits numbers, instead of 8 bits ... and have a Carry of 64bits too, and it is all threaded
    Did you see the site I linked to?

    1e6! in 5 seconds (roughly 1000 times faster than your first attempt). To cap it all, it's written in Java!

    Trying to optimize the naive iterative method (which is what I coded as well, don't get me wrong) to beat that is not likely to succeed.

    My immediate thought was "you can't really do better than the obvious iterative solution", but of course you can. A simple "divide and conquer" does better, because although you still have 1000000 multiplies to do, you're not having to do them to with one number having millions of digits.

    But to combine the partial results, you need a fast "bignum" multiply (that can multiply two numbers with 1e6 digits quickly). Lots available for x86. To the best of my knowledge, none available for CUDA.

    In which case I suspect x86 wins big here. Probably not forever though.

    As is so often the case, choice of algorithm is more important than processing speed. And this is where ease/flexibility of programming comes in - it makes it easier to choose the right algorithm as opposed to thinking "write a bignum multiply for CUDA - youch!".

    [But if someone actually manages to do that for CUDA, I expect it would beat x86].

  22. #147
    Moderator
    Join Date
    Mar 2006
    Posts
    8,556
    Quote Originally Posted by Chumbucket843 View Post
    Omg 1,000,000! on both cpu and gpu.shaders are actually at 1450 and its a 192.
    I'm gonna start a thread in the benchmark section. I'd like to see what some of the guys get there.

  23. #148
    Xtreme Addict
    Join Date
    Apr 2008
    Location
    Texas
    Posts
    1,663
    Quote Originally Posted by [XC] riptide View Post
    Buddy. Try 4 threads next time and something higher than 100,000
    Ah, I was thinking that 1,000,000 was the 100,000 preset. Never mind!

    Last edited by Mechromancer; 08-30-2009 at 02:04 PM.
    Core i7 2600K@4.6Ghz| 16GB G.Skill@2133Mhz 9-11-10-28-38 1.65v| ASUS P8Z77-V PRO | Corsair 750i PSU | ASUS GTX 980 OC | Xonar DSX | Samsung 840 Pro 128GB |A bunch of HDDs and terabytes | Oculus Rift w/ touch | ASUS 24" 144Hz G-sync monitor

    Quote Originally Posted by phelan1777 View Post
    Hail fellow warrior albeit a surat Mercenary. I Hail to you from the Clans, Ghost Bear that is (Yes freebirth we still do and shall always view mercenaries with great disdain!) I have long been an honorable warrior of the mighty Warden Clan Ghost Bear the honorable Bekker surname. I salute your tenacity to show your freebirth sibkin their ignorance!

  24. #149
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by [XC] riptide View Post
    I'm gonna start a thread in the benchmark section. I'd like to see what some of the guys get there.
    i posted that because no one did a cpu to gpu comparison and i was curious to see what the results would be.

  25. #150
    Xtreme Enthusiast
    Join Date
    Dec 2007
    Posts
    816
    David, in Post http://www.xtremesystems.org/forums/...6&postcount=39, I pointed this page out already, I am very aware of this methode ;-) ... This is the next point I wanted to make ...
    I have half of it in C++ already ... some parts are tricky to convert back to C++.

    Francois
    DrWho, The last of the time lords, setting up the Clock.

Page 6 of 7 FirstFirst ... 34567 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •