View Full Version : Nvidia releases OpenCL GPGPU driver
saaya
04-21-2009, 03:01 AM
http://www.xbitlabs.com/news/video/display/20090420170635_Nvidia_Enables_Non_Proprietary_GPGP U_by_Releasing_OpenCL_Drivers_to_Developers.html
Today’s news-release is an announcement of beta OpenCL-supporting drivers that are due to be available later this year. At present, Nvidia is providing its OpenCL driver and software development kit (SDK) to solicit early feedback in advance of a beta release which will be made available to all GPU Computing Registered Developers in the coming months. So, while software developers are unlikely to start creating commercial software using OpenCL API, they will get access to tools necessary to start development.
seems they are preparing for OpenCL in case Physix fails :D
zanzabar
04-21-2009, 03:06 AM
seems they are preparing for OpenCL in case Physix fails :D
or they realized that if havok dose open CL physX will be dead so they are making it do openCL. and apple supports openCL and not CUDA in osx. so with the 48xx in the mac pro, NV dosnt want to be left out.
saaya
04-21-2009, 03:28 AM
sory, dont get it... can you fix the multiple fractions this sentence suffered by falling down the stairs please? :D
posted: http://www.xtremesystems.org/forums/showthread.php?t=222461
only for developers ... but there are NO any OpenCL developer today ... :) It is music of farest future ...
zanzabar
04-21-2009, 03:39 AM
i think that i fixed it.
1) if havok runs on openCL and intel and amd back openCL, havok has an advantage over pysX, but if physX and NV cards do openCL then they are back to equal ground for the card and api
2) apple has had job postings for GPGPU programmers recently and they only support openCL. but NV is used in most apples as now and to keep that going NV needed openCL. then with apple offering the 48xx in the new i7 mac pros if something comes out major that dose openCL and NV will be left out.
in summation, apple backs openCL, NV likes having apple use them as the main gpu platform, apple wants to use openCL, NV has to add it or apple had to change gpu suppliers
if that makes any sense.
trinibwoy
04-21-2009, 04:49 AM
lol...wut?
hollo
04-22-2009, 05:03 AM
HD4870 - 800 stream processors
GTX285 - 240 stream processors
larrabee - 16 hyperthreaded cores (64 virtual cores)
FIGHT
Shintai
04-22-2009, 05:07 AM
HD4870 - 800 stream processors
GTX285 - 240 stream processors
larrabee - 16 hyperthreaded cores (64 virtual cores)
FIGHT
Thats not entirely right.
HD4870 got 160 processors capable of 5 each = 800.
GTX240 got 240 capable of 2 each = 480.
Larrabee got 32 cores, (128 virtual). Each core got a 512bit SMID unit = 512.
800 vs 480 vs 512. But all 3 got different clocks.
trinibwoy
04-22-2009, 08:59 AM
HD4870 - 800 stream processors
GTX285 - 240 stream processors
larrabee - 16 hyperthreaded cores (64 virtual cores)
FIGHT
What's a virtual core? I think what you meant was:
HD4870 - 10 cores (800 ALUs)
GTX285 - 30 cores (240 ALUs)
larrabee - 16 cores (256 ALUs)
Hyperthreading is meaningless. GPUs are way more "hyperthreaded" than Larrabee. e.g GT200 supports 1024 threads per core compared to Larrabee's 4.
deeperblue
04-22-2009, 09:50 AM
What's a virtual core? I think what you meant was:
HD4870 - 10 cores (800 ALUs)
GTX285 - 30 cores (240 ALUs)
larrabee - 16 cores (256 ALUs)
Hyperthreading is meaningless. GPUs are way more "hyperthreaded" than Larrabee. e.g GT200 supports 1024 threads per core compared to Larrabee's 4.
I don't know what core count Larrabee really has, but just assume it has 32 and it runs with 2Ghz:
Pure Max FP32 FLOPS
HD4890 10Cluster * 16SP * 10(5 FP32 MADD = 10FLOP) * 850MHz = 1360GFLOPS
GT285 10Cluster * 3SM * 8SP * 3(2 FP32 MADD + 1 FP32 MUL = 3FLOP) * 1476MHz = 1062GFLOPS
Larrabee 32Cores * 32(512Bit SIMD => 16 FP32 MADD = 32FLOP) * 2000MHz = 2048GFLOPS
Worst Case FP32 FLOPS Scenario, totally non-parallel code, which means, all SIMD units are basically reduced to SISD units:
HD4890 10Cluster * 1SP * 10(5 FP32 MADD = 10FLOP) * 850MHz = 85GFLOPS
GT285 10Cluster * 3SM * 1SP * 3(2 FP32 MADD + 1 FP32 MUL = 3FLOP) * 1476MHz = 132GFLOPS
Larrabee 32Cores * 2(512Bit SIMD => 1 FP32 MADD = 2FLOP) * 2000MHz = 128GFLOPS
We can see due to their internal layout the ATI/AMD and Intel designs take a deeper dip in max performance in comparison to Nvidia. (1/16 for Intel and AMD, 1/8 for Nvidia)
Which makes AMD and Intel much more dependent on Driver/Software optimization than Nvidia.
For sure it will be very interesting when the first OpenCL applications appear and maybe folding@home and others release an opencl version and we can compare :)
trinibwoy
04-22-2009, 11:13 AM
GT285 10Cluster * 3SM * 1SP * 3(2 FP32 MADD + 1 FP32 MUL = 3FLOP) * 1476MHz = 132GFLOPS
You're right, Nvidia's corner cases aren't as bad as the other guys. But I'm not sure there's any value in looking at numbers for single-threaded serial code since that stuff is never run on these architectures. The most relevant corner case would be where there's TLP but no ILP.
AMD: 10C * 16SP * 2F * 850Mhz = 272 GFlops
NVDA: 10C * 3SM * 8SP * 2F * 1476Mhz = 708GFlops
LRB: 32C * 16SP * 2F * 2000Mhz = 2048 GFlops
Both Nvidia and Intel have gone the route where they do not suffer as badly from a lack of ILP. Intel has taken it a step further than Nvidia since they don't have to worry about issuing to the SFU. But these are just theoretical MADD numbers so their usefulness is limited. For example, Larrabee's actual throughput is gonna fall dramatically when you consider how many flops will be burned just emulating fixed function stuff that the other guys run on dedicated hardware.
Hyperthreading is meaningless. GPUs are way more "hyperthreaded" than Larrabee. e.g GT200 supports 1024 threads per core compared to Larrabee's 4.
Actually that doesn't make sense. If you consider hyperthreading to mean the number of independent threads the hardware can track at once, a Larrabee thread would be more akin to Nvidia's warp of which GT200 supports 32 per SM/core.
LordEC911
04-22-2009, 11:20 AM
AMD: 10C * 16SP * 2F * 850Mhz = 272 GFlops
NVDA: 10C * 3SM * 8SP * 2F * 1476Mhz = 708GFlops
LRB: 32C * 16SP * 2F * 2000Mhz = 2048 GFlops
Ummm... are you sure your Nvidia numbers are right??
trinibwoy
04-22-2009, 11:23 AM
Ummm... are you sure your Nvidia numbers are right??
Yep.
LordEC911
04-22-2009, 11:55 AM
Yep.
Yeah, my bad.
I got mixed up as to why you had 3SM x 8SP and instead of working it out my own way, I just thought it looked wrong.
deeperblue
04-22-2009, 12:38 PM
AMD: 10C * 16SP * 2F * 850Mhz = 272 GFlops
NVDA: 10C * 3SM * 8SP * 2F * 1476Mhz = 708GFlops
LRB: 32C * 16SP * 2F * 2000Mhz = 2048 GFlops
You're right that's probably the more important extreme case. Make's even more clear why AMD needs to optimize their drivers more than Nvidia to extract as much computational power as possible.
And I agree, Larrabee looks like a monster on the paper. I'm sure it's nice for number crunching tasks but for 3d graphics? We will see :)
I'm much more looking forward to see what AMD is working on and of course OpenCL drivers from all vendors to have some coding fun.
W1zzard
04-23-2009, 11:24 AM
anyone got those drivers? send me a pm
saaya
04-23-2009, 06:18 PM
thx for the insight guys :toast:
interesting to learn a bit more about larrabee compared to other gpus.
so basically nvidias gpus and larrabee are ideal for dumb code, while ati has a lot of raw power which needs a lot of optimizations?
optimizations as in optimized for ati or optimizations in general that would make the same code run faster on nv and intel hw as well?
nsegative
04-23-2009, 06:50 PM
Interesting. Hope to see it running ingame sometime soon on ati :D
trinibwoy
04-23-2009, 09:31 PM
so basically nvidias gpus and larrabee are ideal for dumb code, while ati has a lot of raw power which needs a lot of optimizations?
optimizations as in optimized for ati or optimizations in general that would make the same code run faster on nv and intel hw as well?
It's not about dumb code vs smart code. I don't know why the inquirer started that bs in the first place. It's about having enough ILP to fill a VLIW processor. In some cases it's just not there and you have to jump through hoops to get good utilization from the hardware.