AMD does reverse GPGPU, announces OpenCL SDK for x86

**Talonman** · 08-10-2009, 04:55 AM

AMD does reverse GPGPU, announces OpenCL SDK for x86

By Jon Stokes | Last updated August 6, 2009 6:15 AM CT

http://arstechnica.com/hardware/news...dk-for-x86.ars

"AMD has announced the release of the first OpenCL SDK for x86 CPUs, and it will enable developers to target x86 processors with the kind of OpenCL code that's normally written for GPUs. In a way, this is a reverse of the normal "GPGPU" trend, in which programs that run on a CPU are modified to run in whole or in part on a GPU.

Why would you want to run GPU programs on a CPU? Debugging is one reason, if you don't have access to an OpenCL-compliant GPU. And for now, that's essentially what'll be doing, since the new SDK doesn't appear to be able to target GPUs, yet. But eventually, developers will be able to write in OpenCL and target multicore x86 CPUs alongside GPUs from NVIDIA, AMD, and Intel. Of course, when you can write once and target a variety of parallel hardware types, the fact that Larrabee runs x86 will be irrelevant; so Intel had better be able to scale up Larrabee's performance, because its x86 support will not be a selling point (at least for Larrabee as a GPU, though an HPC coprocessor might be a different story).

Note that you can already write once, run anywhere for GPUs and multicore x86 already, but you'd have to use RapidMind's proprietary middleware layer. Because it's more than just an API—the middleware does just-in-time compilation targeting whatever hardware is in the system, dynamic load-balancing, and real-time optimization—an OpenGL vs. RapidMind comparison is a little bit apples-to-oranges, but only just a bit.

In reality, few workloads are such that you can break them up in the design phase into parallel chunks so that a middleware layer can dynamically map them to hardware resources at run-time. Certainly there are some problem domains that this works for—finance is one that comes to mind at the moment—but these are very specialized (though profitable) niches. Most of the stuff that ordinary developers will want to do with GPGPU in the medium-term is more mundane and application-specific, like using the GPU to speed up some specific part of a common application in order to give a performance boost vs. the CPU alone. In other words, these common apps don't solve data-parallel, compute-intensive problems—rather, they have specific parts that need acceleration, and if there's a capable GPU available then they can use OpenCL to hand off that part to it.

Note that Snow Leopard will come with an OpenCL implementation that works on both CPUs and GPUs. Ars will have a review when it launches, so stay tuned."
__________________________________________________ ________

I find this odd...

Could this mean that ATI is having issues getting OpenCL to run on their GPU's?

Note the last comment...
"Snow Leopard will come with an OpenCL implementation that works on both CPUs and GPUs."

**Particle** · 08-10-2009, 05:13 AM

I think you're trying to read something from between the lines when there isn't meant to be anything there. More likely, they mean what the text says on its face--the SDK isn't finished and the next release of OSX will support OpenCL.

**Farinorco** · 08-10-2009, 05:15 AM

Originally Posted by Talonman

Could this mean that ATI is having issues getting OpenCL to run on their GPU's?

Probably, if you mean that their GPU OpenCL drivers are not polished or functional enough yet. This things take time to develope. NVIDIA should need much less time to do it if what I've read about OpenCL API being so similar to CUDA API are true. So it doesn't really surprise me...

It's all a matter of time, though.

**deeperblue** · 08-10-2009, 05:27 AM

This is in no way weird.

The point of OpenCL actually IS to run on every computation device in you machine. That's the beauty about it! Just wait for the time where you can fire up a OpenCL folding client and you can send the same jobs to your GPU, CPU and your Cell addon card. Sure different working units will still run faster on the different architectures.
I'm just waiting for the first guys to implement a driver that on the fly generates an optimized FPGA image for every executed kernel

And btw the OSX 10.6 beta already has a functioning AMD GPU OpenCL driver. (as well as a CPU driver and a NVidia GPU driver).

**trinibwoy** · 08-10-2009, 06:02 AM

Originally Posted by Talonman

Could this mean that ATI is having issues getting OpenCL to run on their GPU's?

Naturally, it's easier to write an x86 OpenCL driver than one targeting multiple generations of GPU hardware. Besides, having a CPU driver allows developers to start learning the API and porting their algorithms now so that once GPU support is available they will have had a head start. It'll be interesting to see how much overhead OpenCL brings though, compared to writing your own stuff directly in C/C++.

Originally Posted by deeperblue

And btw the OSX 10.6 beta already has a functioning AMD GPU OpenCL driver. (as well as a CPU driver and a NVidia GPU driver).

I know device detection and enumeration is up and running on OSX but I haven't seen any functioning OpenCL apps as yet.

**Mechromancer** · 08-10-2009, 06:08 AM

OpenCL is supposed to be able to run on both CPUs and GPUs simultaneously. Serial workloads are better suited for CPUs and parallel loads work best on GPUs. This SDK will make sure we get the most bang for our buck out of our 4 core or more, single/multi-GPU desktops. I'm very excited to see what game engines and GPGPU software will take advantage of OpenCL over the next few years.

**Talonman** · 08-10-2009, 06:10 AM

So it is actually a positive thing, by getting the API rolling ASAP. GPU support can come in time. I can see that.

It just initially sounded odd to me for a company having produced GPU's.

**B.E.E.F.** · 08-10-2009, 06:53 AM

Originally Posted by Talonman

It just initially sounded odd to me for a company having produced GPU's.

AMD produces CPUs.

**blindbox** · 08-10-2009, 06:56 AM

They produce processors too Talonman. I have to admit though, it sounds more like an intel's thing (just a thought. They created x86 anyway. It made me think that AMD did intel a favour).

**nn_step** · 08-10-2009, 07:06 AM

Specifically it would enable for people without OpenCL cards to start creating programs for them.

Bravo AMD

**Drwho?** · 08-10-2009, 07:33 AM

Originally Posted by Talonman

AMD does reverse GPGPU, announces OpenCL SDK for x86

By Jon Stokes | Last updated August 6, 2009 6:15 AM CT

http://arstechnica.com/hardware/news...dk-for-x86.ars

"AMD has announced the release of the first OpenCL SDK for x86 CPUs, and it will enable developers to target x86 processors with the kind of OpenCL code that's normally written for GPUs. In a way, this is a reverse of the normal "GPGPU" trend, in which programs that run on a CPU are modified to run in whole or in part on a GPU.

Why would you want to run GPU programs on a CPU? Debugging is one reason, if you don't have access to an OpenCL-compliant GPU. And for now, that's essentially what'll be doing, since the new SDK doesn't appear to be able to target GPUs, yet. But eventually, developers will be able to write in OpenCL and target multicore x86 CPUs alongside GPUs from NVIDIA, AMD, and Intel. Of course, when you can write once and target a variety of parallel hardware types, the fact that Larrabee runs x86 will be irrelevant; so Intel had better be able to scale up Larrabee's performance, because its x86 support will not be a selling point (at least for Larrabee as a GPU, though an HPC coprocessor might be a different story).

Note that you can already write once, run anywhere for GPUs and multicore x86 already, but you'd have to use RapidMind's proprietary middleware layer. Because it's more than just an API—the middleware does just-in-time compilation targeting whatever hardware is in the system, dynamic load-balancing, and real-time optimization—an OpenGL vs. RapidMind comparison is a little bit apples-to-oranges, but only just a bit.

In reality, few workloads are such that you can break them up in the design phase into parallel chunks so that a middleware layer can dynamically map them to hardware resources at run-time. Certainly there are some problem domains that this works for—finance is one that comes to mind at the moment—but these are very specialized (though profitable) niches. Most of the stuff that ordinary developers will want to do with GPGPU in the medium-term is more mundane and application-specific, like using the GPU to speed up some specific part of a common application in order to give a performance boost vs. the CPU alone. In other words, these common apps don't solve data-parallel, compute-intensive problems—rather, they have specific parts that need acceleration, and if there's a capable GPU available then they can use OpenCL to hand off that part to it.

Note that Snow Leopard will come with an OpenCL implementation that works on both CPUs and GPUs. Ars will have a review when it launches, so stay tuned."
__________________________________________________ ________

I find this odd...

Could this mean that ATI is having issues getting OpenCL to run on their GPU's?

Note the last comment...
"Snow Leopard will come with an OpenCL implementation that works on both CPUs and GPUs."

it is because the GPGPU vs CPU paradox ... I tried to explain it in my blog ...
At OpenCL : http://www.khronos.org/opencl/
when you look at the OpenCL API, you ll understand very quickly that CPU has some skill today that the GPU does not have yet ... With many cores, you need alot of Caches, to store the data between the steps of your OpenCL procedure calls. If you go out of the Socket or your GPU, your performance sucks ... The tricks that nVidia use with CUDA only works if you touch your data 1 time. NV uses the thread scheduler to freeze a thread when you have a stall because memory access, and move to the next thread and come back to the 1st one when the memory request is done. They use an hardware scheduler.
This works when your algorythm is not f(n-1), when each loop has no relation between loop n and loop (n-1)
what ever you saw yet out of CUDA is bunch of cornet case algorithm, and to prove that it is not generic, you can not get any version of the Spec_int or Spec_fp on those GPUs , while some part of Spec are "very parralelizable"

AMD found that x86 is very flexible for 128bits vector, with many loop using each other results ... And that is the base of x86 (via SSE2)! I am sure they will add more support for the GPU when it does make sense, when it is massively parallel, and take advantage of the GPU acceleration. With AVX coming, it will be a really big challenge to beat the processor at 256 bits FLOATs and Double processing.

the other part is the memory size limit, the GPUs today are seriously limited, places were you need TFLOPS requires a lot of Memory ... right now, the GPGPUs are using their DDR5 as a cache to the main memory ... PCIexpress is a very poor cache protocol ...

The GPUs are in a world that requires the programmer to make its code 100% perfect, load have to be aligned 100%, store too, vectorization has to be done perfectly ... It is a world of Pain, and if you don t have a programmer with a PhD in parralelism or SIMD, you will not see the end of the project ... If you want OpenCL to perform well on GPU, you still have to plan for all of those tricks.

So, AMD will probably increase their support for GPGPU, over time, they got their own stable starting point, and it is always x86.

you will see the CPU wining all the very iterative workloads on OpenCL.

Francois

**Smartidiot89** · 08-10-2009, 08:06 AM

Thanks for a very informative post even I can understand Francois

**Heinz68** · 08-10-2009, 08:22 AM

Same here. Thanks Francois

**Talonman** · 08-10-2009, 08:23 AM

That was a good post...

I dont have a good grip on this...

"what ever you saw yet out of CUDA is bunch of cornet case algorithm, and to prove that it is not generic, you can not get any version of the Spec_int or Spec_fp on those GPUs , while some part of Spec are "very parralelizable""

What does that mean in terms of GPU resource allocation or performance?

**Eastcoasthandle** · 08-10-2009, 08:25 AM

Nice post Francois

Originally Posted by deeperblue

This is in no way weird.

The point of OpenCL actually IS to run on every computation device in you machine. That's the beauty about it! Just wait for the time where you can fire up a OpenCL folding client and you can send the same jobs to your GPU, CPU and your Cell addon card. Sure different working units will still run faster on the different architectures.
I'm just waiting for the first guys to implement a driver that on the fly generates an optimized FPGA image for every executed kernel

And btw the OSX 10.6 beta already has a functioning AMD GPU OpenCL driver. (as well as a CPU driver and a NVidia GPU driver).

As mentioned OpenCL is not GPU specific.

**deeperblue** · 08-10-2009, 09:29 AM

Originally Posted by trinibwoy

I know device detection and enumeration is up and running on OSX but I haven't seen any functioning OpenCL apps as yet.

I'm actually working on a benchmark tool for OpenCL on my Mac. The drivers are there and running.

I hope to have the tool finished and released for Mac, Linux and Windows as soon as Snow Leopard is officially released.

**Mechromancer** · 08-10-2009, 09:30 AM

Some of the elements Francois just pointed out are things I've read AMD is trying to improve upon on the next gen GPUs (the cache issues). I bet Larrabee will eat up OpenCL workloads because of this. OpenCL is on the fast track to fame with AMD, Intel, and Apple getting the software and hardware worked out for its' implementation. It makes me giddy thinking about the possibilities.

**Talonman** · 08-10-2009, 10:14 AM

Originally Posted by deeperblue

I'm actually working on a benchmark tool for OpenCL on my Mac. The drivers are there and running.

I hope to have the tool finished and released for Mac, Linux and Windows as soon as Snow Leopard is officially released.

Good deal!

Keep us posted...

**LordEC911** · 08-10-2009, 10:17 AM

Originally Posted by Mechromancer

Some of the elements Francois just pointed out are things I've read AMD is trying to improve upon on the next gen GPUs (the cache issues). I bet Larrabee will eat up OpenCL workloads because of this. OpenCL is on the fast track to fame with AMD, Intel, and Apple getting the software and hardware worked out for its' implementation. It makes me giddy thinking about the possibilities.

Well we already know they radically changed and improved the scheduler and it is much more complex that previous generations.

**trinibwoy** · 08-10-2009, 12:04 PM

Originally Posted by Drwho?

With AVX coming, it will be a really big challenge to beat the processor at 256 bits FLOATs and Double processing.

GPUs aren't standing still my friend. 256-bit AVX will still be a joke compared to the GPUs out by that time.

The GPUs are in a world that requires the programmer to make its code 100% perfect, load have to be aligned 100%, store too, vectorization has to be done perfectly ... It is a world of Pain, and if you don t have a programmer with a PhD in parralelism or SIMD, you will not see the end of the project ... If you want OpenCL to perform well on GPU, you still have to plan for all of those tricks.

This is no different to SSE programming on a CPU, so not sure what you're trying to say here. And yes, OpenCL will be making full use of SSE/AVX in order to extract maximum performance out of the hardware.

you will see the CPU wining all the very iterative workloads on OpenCL.

OpenCL is split into task-based and data-based parallelism. Not sure what you mean by "iterative workloads". GPUs aren't so hot at task-based stuff, they downright suck and that should definitely stay on the CPU for the forseeable future.

But we have a big problem, one that Larrabee is trying to solve. What's the point of doing task based stuff on the CPU if it has to communicate over the bottlenecking PCIe bus? This is why folks are looking to move more and more task based stuff over to the GPU because while the CPU is faster, by the time it travels over PCIe it would've been faster to keep everything on the GPU.

**Cybercat** · 08-10-2009, 08:28 PM

Originally Posted by Farinorco

Probably, if you mean that their GPU OpenCL drivers are not polished or functional enough yet. This things take time to develope. NVIDIA should need much less time to do it if what I've read about OpenCL API being so similar to CUDA API are true. So it doesn't really surprise me...

It's all a matter of time, though.

All NVIDIA had to do was use the CUDA interface to add in OpenCL support. So OpenCL runs on top of CUDA, as I understand it.

**Farinorco** · 08-11-2009, 02:30 AM

Originally Posted by Cybercat

All NVIDIA had to do was use the CUDA interface to add in OpenCL support. So OpenCL runs on top of CUDA, as I understand it.

I don't think OpenCL runs on top of CUDA API, because it's not a higher level API, but the opposite if any. Obviously, it surely runs on top of CUDA architecture, since that's a commercial name to their architecture, but that has nothing to do with developement times, obviously for all vendors, the API will run on top of their architecture...

**zanzabar** · 08-11-2009, 02:39 AM

Originally Posted by blindbox

They produce processors too Talonman. I have to admit though, it sounds more like an intel's thing (just a thought. They created x86 anyway. It made me think that AMD did intel a favour).

not really with amd creating the sdk and compiler that allows them to have optimized code, the largest problem with amd server parts is that x86 with heavy int on sse1/2 runs drastically different depending on what compiler u use as to the performance comparison on amd and intel. so this is a huge boost for amd in the long run, since who controls the compiler controls the optimization path for the hardware

it also makes it so intel wont be so hasty to remove the x86 licensing and may give amd enough to claim that they changed x86 enough to get their own rights and wont have to pay licensing for x86 GF parts

Originally Posted by Farinorco

I don't think OpenCL runs on top of CUDA API, because it's not a higher level API, but the opposite if any. Obviously, it surely runs on top of CUDA architecture, since that's a commercial name to their architecture, but that has nothing to do with developement times, obviously for all vendors, the API will run on top of their architecture...

open CL is higher than cuda, cuda is a low lvl C, openCL is closer to C++. and there is a cuda interface on the driver that it has to run through no matter what GPGPU language ends up on it.

**Farinorco** · 08-11-2009, 03:02 AM

Originally Posted by zanzabar

open CL is higher than cuda, cuda is a low lvl C, openCL is closer to C++. and there is a cuda interface on the driver that it has to run through no matter what GPGPU language ends up on it.

Maybe, I don't know the APIs themselves (yet, I will when I have some time for it

), but I read the interview with the Khrono's president (also Nvidia's VP of Embedded Content) published in TechReport and posted here by Trinibwoy where he said the following about OpenCL and CUDA:

OpenCL and C for CUDA are actually at very different levels. OpenCL is the typical Khronos API. Khronos likes to build the API as close as possible to the silicon. We call it the foundation-level API that everyone is going to need. Everyone who's building silicon needs to at some point expose their silicon capability at the lowest and most fundamental, and in some ways the most powerful, level because we've given the developer pretty close access to the silicon capability—just high enough abstraction to enable portability across different vendors and silicon architectures. And that's what OpenCL does. You have an API that you have control over the way stuff runs. It gives you that level of control.

Whereas C for CUDA, it takes all of that low-level decision making and automates it. So you just write a C program, and the C for CUDA architecture will figure out how to parallelize. Now, some developers will love that, because it's much easier, and the system is doing a lot more figuring out for you. Other developers will hate that, and they will want to get down to bits and bytes and have a more instant level of control. But again, it's all good, and as long as the developers are educated as to what are the various approaches that the different programming languages are taking, and are enabled to pick the one that best suits their needs, I think that's a healthy thing.

Here is the complete interview: http://www.techreport.com/articles.x/17321/1

This is why I said that OpenCL isn't a higher level than CUDA but the opposite if any, that's what he's saying there...

Of course, given that NVIDIA names with CUDA from the hardware achitecture to the higher level language/API, any GPGPU are going to be on top of some CUDA layer...

**Talonman** · 08-11-2009, 03:10 AM

A handy picture...

Thread: AMD does reverse GPGPU, announces OpenCL SDK for x86

Thread Tools

Search Thread

Rate This Thread

Display

AMD does reverse GPGPU, announces OpenCL SDK for x86

Bookmarks

Bookmarks

Posting Permissions