AMD does reverse GPGPU, announces OpenCL SDK for x86

**flippin_waffles** · 08-29-2009, 07:47 PM

Originally Posted by Chad Boga

Try reading what I wrote again, you might be able to reach the correct conclusion this time.

Fine then you tell me. Or did you mean that it only matters to big spenders, and us little fish enthusiasts should be privaleged to pay a little more, if it has Intel branding??

It never ceases to amaze me how Intel's rivals never fail to execute properly, never have botched products, are never behind on performance, it is always Intel's supposed monopoly status that holds back these other more deserving companies.

It never ceases to amaze me how Intel sympathizers continue to use that failed argument when plain common sense, as well as 2 countries and the EU, says that Intel's monopoly has not only all but locked out the competition, but hurt consumers in the process.

**~~Chad Boga~~** · 08-29-2009, 09:37 PM

Originally Posted by flippin_waffles

Fine then you tell me. Or did you mean that it only matters to big spenders, and us little fish enthusiasts should be privaleged to pay a little more, if it has Intel branding??

I know when Dr.Who posts that because he is an Intel employee that means we are guaranteed to get a response from you like one of Pavlov's dogs turned rabid, but he was suggesting that when an Intel product is 1/10th that of an IBM product, that the Intel product will be quite compelling on the basis of price performance.

How you managed to turn that into the illogical rant you did, is no credit to you.

It never ceases to amaze me how Intel sympathizers continue to use that failed argument when plain common sense, as well as 2 countries and the EU, says that Intel's monopoly has not only all but locked out the competition, but hurt consumers in the process.

The EU's decision is the only one of any weight, the other countries gave Intel a slap on the wrist only.

But what's more, now that Intel has been stopped from these nefarious practices, where is the increased marketshare for AMD that we have been told would certainly be there, if not for Intel using their rebates?

Is it due to Intel rebates that AMD is so lacking in a decent mobile offering, the fastest growing segment of the CPU sector?

And as for brand strength, I really don't see the Lynnfield generation as being that great an advance over the i7 offerings we have had for this year already, but despite that, it looks pretty clear to me that these Lynnfield processors are going to decimate AMD's marketshare in Q4 and Q1.

If that does happen, what else does that point to other than brand recognition/strength matters?

**Drwho?** · 08-29-2009, 09:46 PM

Originally Posted by [XC] riptide

Who. Newsflash. This has been done on CUDA for a while now.

http://forums.overclockers.ru/viewto...0339&start=120

But I'll concede that >1,000,000! from my limited knowledge of algorithms and my good knowledge of maths that the GPU would soon hit a memory problem. A Tesla card would help..

amazing , the code I did (in 3 hours) is amazingly faster than what was used as reference for the CPU... hehehehe ... thanks for making my point even better ... the comparaison is flowed .. Let s get this binary and run a real contest

I kept the threaded version in back pocket in case ... ;-)
Get the binary and compare !

**David F** · 08-30-2009, 01:11 AM

Originally Posted by Drwho?

amazing , the code I did (in 3 hours) is amazingly faster than what was used as reference for the CPU... hehehehe ... thanks for making my point even better ... the comparaison is flowed .. Let s get this binary and run a real contest

I kept the threaded version in back pocket in case ... ;-)
Get the binary and compare !

Are you sure about that? Your code takes 43735260ms for 1000000!, and the time taken is roughly proportional to N^2 (the growth factor is a little larger due to a log N term, but it's fairly irrelevant).

So that gives an approximate time of 2733453ms for 250000!, which is actually much worse than any of the CPU times quoted here.

Without wanting to cause offense, the code you've provided does seem a lot slower than it ought to be - here's my "5 minute hack":

Code:

int *a = new int[1000000];
int m = 1000000;
a[0] = 1;
int L = 1;
for (int i=2;i<=250000;i++)
{
	if (i%1000 == 0)
		printf("%d  \r", i);
	int C = 0;
	__int64 I = i, M = m;
	for (int j=0;j<L;j++)
	{
		__int64 V = a[j] * I + C;
		C = V / M;
		a[j] = V - M * C;
	}
	if (C)
		a[L++]=C;
}
for (int i=L-1;i>=0;i--)
	printf("%06d", a[i]);
printf("\n");

which only takes about 550000 ms (under 3 hours for 1000000!) This is on a stock i920.

I know you're much more of an x86 expert than I am, so maybe I'm missing something?

**Hornet331** · 08-30-2009, 01:39 AM

lol chad why do you even bother to reply to a flamebait?

**[XC] riptide** · 08-30-2009, 04:20 AM

Originally Posted by Drwho?

amazing , the code I did (in 3 hours) is amazingly faster than what was used as reference for the CPU... hehehehe ... thanks for making my point even better ... the comparaison is flowed .. Let s get this binary and run a real contest

I kept the threaded version in back pocket in case ... ;-)
Get the binary and compare !

Who. Your code is bad. On my 3.99Ghz E8500, 2 threads I have done 1000000! in 6m 50.236 seconds. It took you 12 hours and I presume it was on a i7. And it was digging your own hole from the moment you started the argument. You should have started out the argument with the latest multithreaded enabled algorithms instead of cooking up your own searched basic recursive. That recursive algorithm is the 1st lesson in any C++ tutorial to demonstrate recursive algorithms.

Now that we already have a GPU accelerated app for factorials, this is becoming a little silly. I see goal posts been moved in the following discussion. BTW the chart I listed... the , is the decimel point... as is used often in non english continental Europe. The good news is (for you) is that Intel seems to be ahead of AMD CPU's on this one.

**[XC] riptide** · 08-30-2009, 04:22 AM

Chad and Waffles.... be cool.

**David F** · 08-30-2009, 09:29 AM

Originally Posted by [XC] riptide

Who. Your code is bad. On my 3.99Ghz E8500, 2 threads I have done 1000000! in 6m 50.236 seconds. It took you 12 hours and I presume it was on a i7. And it was digging your own hole from the moment you started the argument. You should have started out the argument with the latest multithreaded enabled algorithms instead of cooking up your own searched basic recursive.

Good advice. Have you seen this page: http://www.luschny.de/math/factorial/Benchmark.html ?

1000000! in 5.1 seconds on an AMD-64! I'm guessing that could go sub 1 second on a fast i7.

I don't know how hard it would be to produce a CUDA implementation, but my gut feeling is it would be tough (you'd need to implement true bignum multiplication, I'm sure).

Edit: looking at the gains got from 2 threads, sub 1 second might be a bit optimistic.

**Drwho?** · 08-30-2009, 12:00 PM

well, guys , then you play poker ... do you show your cards the 1st turn ?

lol ...
So, who wants to race ?
Now that I have the CUDA camps bragging, let s try

So, this is the best shot for the GPU, everybody agrees?

**[XC] riptide** · 08-30-2009, 12:15 PM

Originally Posted by Drwho?

well, guys , then you play poker ... do you show your cards the 1st turn ?

lol ...
So, who wants to race ?
Now that I have the CUDA camps bragging, let s try

So, this is the best shot for the GPU, everybody agrees?

There is no 'CUDA' camp. There's you taking 12 hours to do 1000000!... and thats all really.

I also do a bit of work on the Collatz conjecture (HOTPO problem) with Jon SOnntag's project 'Collatz at Home' If you're familiar with Collatz conjecture... http://en.wikipedia.org/wiki/Collatz_conjecture http://boinc.thesonntags.com/collatz/index.php

The same work takes several times longer on a X86 processor than either ATi's stream or NVidia's CUDA. And has minimal need for the CPU at all, unlike the factorial problem.

**Drwho?** · 08-30-2009, 12:25 PM

Originally Posted by [XC] riptide

There is no 'CUDA' camp. There's you taking 12 hours to do 1000000!... and thats all really.

yeap ... a code done in 3 hours, copied and paste into an MFC wrapper , out of internet ... so, now ... let s see if you can beat my optimized code ...

So, This is the best Shot for Factorial on CUDA? everybodies agree?

**[XC] riptide** · 08-30-2009, 12:33 PM

Originally Posted by Drwho?

So, This is the best Shot for Factorial on CUDA? everybodies agree?

What is the best shot?

**David F** · 08-30-2009, 12:41 PM

Originally Posted by Drwho?

yeap ... a code done in 3 hours, copied and paste into an MFC wrapper , out of internet ... so, now ... let s see if you can beat my optimized code ...

So, what algorithm are you using?

So, This is the best Shot for Factorial on CUDA? everybodies agree?

Best shot for code currently available, possibly. They're fairly obviously not using one of the efficient algorithms - so the question is whether it's even possible to implement one on CUDA.

Having now researched some algorithms, a key question, I think, is whether there's a fast "BigInt" multiply available for CUDA? Googling, it looks like quite a few people have tried, but I don't see any evidence any have succeeded.

**Mechromancer** · 08-30-2009, 12:58 PM

Does CUDA factorial run slower on Intel CPUs for some reason?

**David F** · 08-30-2009, 01:12 PM

Originally Posted by Mechromancer

Does CUDA factorial run slower on Intel CPUs for some reason?

Well, it certainly runs faster when you only ask for 100000! instead of 1000000!...

**[XC] riptide** · 08-30-2009, 01:16 PM

Originally Posted by Mechromancer

Does CUDA factorial run slower on Intel CPUs for some reason?

Buddy. Try 4 threads next time and something higher than 100,000

**Lightman** · 08-30-2009, 01:21 PM

Originally Posted by Mechromancer

Does CUDA factorial run slower on Intel CPUs for some reason?

PIC

Psss! You're missing a ZERO there

**Chumbucket843** · 08-30-2009, 01:32 PM

Omg 1,000,000! on both cpu and gpu.shaders are actually at 1450 and its a 192.

**Drwho?** · 08-30-2009, 01:32 PM

I use the same algo as what I posted, except that I use 64bits numbers, instead of 8 bits ... and have a Carry of 64bits too, and it is all threaded

I was trying to see if I can use the SSE2 registers do process 2 x 64bits, the supershuffle engine to forward the carry up into the stack. This one is a bad idea ... I am finishing the SSE2 version, and let s compare ... I will be able to claim X times faster than CUDA ... I ll probably have it finish in the very few days ...
I was still unhappy , I was only using 64bits out of the 128 bits ... so, here is interesting part ... when you are doing calculating the 8 partials

for (cpu=0;cpu<8;cpu++)
{
_param->TM=1+cpu*N/8;
_param->TN=(cpu+1)*N/8;
_param->ps=ps[cpu];
T[cpu]=AfxBeginThread((AFX_THREADPROC)Thread1,_param,THR EAD_PRIORITY_NORMAL);
}

The last optimization I am trying to get in, is the function:

int Multiply(Node *B,Node *A,Node *C)

I belive that I can write a code that will use the low 64 bits for multiplying 2 list at a time, using fully the 128 registers, and the Dual execution units of the CPU core ...

Who wants to put some $ on the table ...

I am tighting up some parts with _ASM{}, just for the hell of it ...
PS: I added the version of David into it, as reference, One button MFC ...

Francois

**[XC] riptide** · 08-30-2009, 01:42 PM

Originally Posted by Drwho?

Who wants to put some $ on the table ...

Poker talk eh? Your hand was already down at post 86! And all the posts before it.

**David F** · 08-30-2009, 01:44 PM

Originally Posted by Drwho?

I use the same algo as what I posted, except that I use 64bits numbers, instead of 8 bits ... and have a Carry of 64bits too, and it is all threaded

Did you see the site I linked to?

1e6! in 5 seconds (roughly 1000 times faster than your first attempt). To cap it all, it's written in Java!

Trying to optimize the naive iterative method (which is what I coded as well, don't get me wrong) to beat that is not likely to succeed.

My immediate thought was "you can't really do better than the obvious iterative solution", but of course you can. A simple "divide and conquer" does better, because although you still have 1000000 multiplies to do, you're not having to do them to with one number having millions of digits.

But to combine the partial results, you need a fast "bignum" multiply (that can multiply two numbers with 1e6 digits quickly). Lots available for x86. To the best of my knowledge, none available for CUDA.

In which case I suspect x86 wins big here. Probably not forever though.

As is so often the case, choice of algorithm is more important than processing speed. And this is where ease/flexibility of programming comes in - it makes it easier to choose the right algorithm as opposed to thinking "write a bignum multiply for CUDA - youch!".

[But if someone actually manages to do that for CUDA, I expect it would beat x86].

**[XC] riptide** · 08-30-2009, 01:45 PM

Originally Posted by Chumbucket843

Omg 1,000,000! on both cpu and gpu.shaders are actually at 1450 and its a 192.

I'm gonna start a thread in the benchmark section. I'd like to see what some of the guys get there.

**Mechromancer** · 08-30-2009, 01:53 PM

Originally Posted by [XC] riptide

Buddy. Try 4 threads next time and something higher than 100,000

Ah, I was thinking that 1,000,000 was the 100,000 preset. Never mind!

**Chumbucket843** · 08-30-2009, 01:56 PM

Originally Posted by [XC] riptide

I'm gonna start a thread in the benchmark section. I'd like to see what some of the guys get there.

i posted that because no one did a cpu to gpu comparison and i was curious to see what the results would be.

**Drwho?** · 08-30-2009, 02:00 PM

David, in Post http://www.xtremesystems.org/forums/...6&postcount=39, I pointed this page out already, I am very aware of this methode ;-) ... This is the next point I wanted to make ...
I have half of it in C++ already ... some parts are tricky to convert back to C++.

Francois

Thread: AMD does reverse GPGPU, announces OpenCL SDK for x86

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions