ratGPU OpenCL raytracing benchmark

**Arctucas** · 12-07-2010, 05:33 AM

@joshy,

You have PM.

**jogshy** · 01-03-2011, 06:26 PM

I just uploaded the 0.4.6 for Windows:

v0.4.6 Alpha
- Extended the alpha to June 1, 2011.
- Patched a problems with ATI cards that was causing some system hangs. Unfortunately, it has been “patched” and not really fixed. The patch will avoid a system hang but at the cost of some rendering speed.
- Removed the version from the 3dsmax's plug-in filename because it was causing some duplicate GUID messages. Newer versions will just overwrite the file.
- Fixed a color mismatch between 3dsmax and the standalone renderer.
- Now the plug-in for 3dsmax uses the 3dsmax's gamma settings.
- The gamma slider have been removed from the ratGPU's config dialog.
- Recompiled using the latest libraries.

**jogshy** · 01-04-2011, 07:08 PM

Ok! Version 0.4.7 has been released!
ATI cards should run ok now at the same speed as previous versions ( but without hanging ).

Go, test with your new 6970s, Via/S3 eH1 and Sandy Bridges, pls

**realbabilu** · 01-06-2011, 02:36 AM

downloading....

Any plan to use add and compare using CUDA, MS DX11 Directcompute to this openCL?
Hmm will it faster using Intel C++ with Intel MKL for CPU render?

Nice to have more renderer benchmark for job [not games] like ratGPU and cinemabench.

**~~purecain~~** · 01-06-2011, 03:41 AM

excellent work dude.... dllin....

**mattkosem** · 01-06-2011, 04:32 AM

Here's what I see on my 4GHZ thuban with ATI Stream (the 5870 is not being used, since it dies under even modest loads - crappy gen, terrible QC). Not sure why it shows both devices, since I checked the "Use this device" checkbox.

--Matt

**mattkosem** · 01-06-2011, 04:42 AM

Woohoo, 5th try's a charm. Cypress @ stock.

--Matt

**~~purecain~~** · 01-06-2011, 08:56 AM

@mattkosem- which driver are you using... the 10.9's...???

this bench locks up for me when i try and run using 10.12...

@jogshy- so do i need to install 10.9's untill ati update stream sdk...???

thanks again for the effort it must of taken to produce this bench...

**mattkosem** · 01-06-2011, 10:35 AM

Originally Posted by purecain

@mattkosem- which driver are you using... the 10.9's...???

this bench locks up for me when i try and run using 10.12...

@jogshy- so do i need to install 10.9's untill ati update stream sdk...???

thanks again for the effort it must of taken to produce this bench...

I'm using the 10.11s and stream SDK 2.3. I was getting the same crashing on the 0.4.5 build. Are you running the latest one?

--Matt

**jogshy** · 01-06-2011, 10:48 AM

Not sure why it shows both devices, since I checked the "Use this device" checkbox.

You can use multiple devices. Just change the device using the combo box and check in the "use this device" for the devices you want to use to render.
For example, you can use hybrid rendering with CPU+GPU... or only the GPU... or only the CPU.... or 3 GPUs ( yes, the test supports SLi/Crossfire ).

Originally Posted by realbabilu

Any plan to use add and compare using CUDA, MS DX11 Directcompute to this openCL?

Yep, I plan to write a CUDA and DC implementation too to compare... but currently I'm focusing on OpenCL because it's portable and multivendor.

Hmm will it faster using Intel C++ with Intel MKL for CPU render?

Maybe.
The "ratGPU C++" implementation currently uses VS 2008 to compile because it optimizes for both AMD/Intel CPUs and TBB, but it's really aimed just to use the program in computers without OpenCL.
For all the other cases, the OpenCL implementation itself will optimize properly based on the platform ( for instance, the Intel OpenCL SDK Alpha uses TBB2.2, SSE3 intrinsics and the latest Intel's Compiler... and NVIDIA uses an onion layer based on CUDA... and ATI uses an onion layer based on CAL, etc... ).

jogshy- so do i need to install 10.9's untill ati update stream sdk...???

Old versions prior to the newest 0.4.7 may hang with ATI cards, but I think I solved the problem in the 0.4.7 ( almost it is for the 5750 which I use to develop ). So, the 0.4.7 should work ok with Catalyst 10.12 APP ( no need for stream SDK, The APP version already includes it ). AFAIK, Catalyst 10.11 are a bit buggy for OpenCL, use the 10.12 APP better if you can.
If ratGPU hangs then please try with Catalyst 10.9 and the Stream SDK v2.3 ( or the old 2.2 ).
And, yep, OpenCL's drivers still are a bit unstable... 8(

Btw, anybody there with a S3/Via eH1 card to test, pls?

**mattkosem** · 01-06-2011, 01:16 PM

Originally Posted by jogshy

You can use multiple devices. Just change the device using the combo box and check in the "use this device" for the devices you want to use to render.
For example, you can use hybrid rendering with CPU+GPU... or only the GPU... or only the CPU.... or 3 GPUs ( yes, the test supports SLi/Crossfire ).

That much is clear. I'm not home now, but I was only intending to run with the CPU on that bench(mostly because my janky gpu crashes my pc constantly). If it really did use both, its odd that it was so much slower that way. One would think that more processing power would yield a better score, not worse. That is, unless it is evenly dividing the workload across both of them. Does this app not load balance based on performance like smallptgpu?

--Matt

**jogshy** · 01-06-2011, 02:14 PM

Originally Posted by mattkosem

If it really did use both, its odd that it was so much slower that way. One would think that more processing power would yield a better score, not worse. That is, unless it is evenly dividing the workload across both of them. Does this app not load balance based on performance?

Hybrid rendering ( as many heteregeneous tasks ) has a problem: a chain is as strong as its weakest link. That's, for instance, why supercomputers based on MPICH2 don't like heterogeneous clusters.

If the GPU is rendering very fast but the CPU is not, then the CPU will just make a slowdown in the process. On the other hand, with the CPU under heavy workload the asynchronous kernel calls(which are called from a thread in the CPU ) and the PCI transfers ( DMA is a myth ) will slowdown too.
The program balances the workload but if the image is not very large then the CPU will make all slower ( because the last tile will be in wait state almost all the time ). For big HD images could worth the effort tho.

**mattkosem** · 01-06-2011, 04:10 PM

Originally Posted by jogshy

Hybrid rendering ( as many heteregeneous tasks ) has a problem: a chain is as strong as its weakest link. That's, for instance, why supercomputers based on MPICH2 don't like heterogeneous clusters.

If the GPU is rendering very fast but the CPU is not, then the CPU will just make a slowdown in the process. On the other hand, with the CPU under heavy workload the asynchronous kernel calls(which are called from a thread in the CPU ) and the PCI transfers ( DMA is a myth ) will slowdown too.
The program balances the workload but if the image is not very large then the CPU will make all slower ( because the last tile will be in wait state almost all the time ). For big HD images could worth the effort tho.

Actually, it doesn't seem like it's as efficient as you stated. It seems to spool off the tiles in pairs, but the next pair of tiles doesn't start rendering until both finish. One sits at 99% while the other finishes. If the tiles were smaller and it didn't wait like that it would probably not be such a big slowdown.

Could this be related to using the CPU and gpu through stream? I haven't tried using the built in CPU render with the stream gpu renderer. I'll give it a try when I have a chance.

--Matt

**mattkosem** · 01-06-2011, 05:07 PM

Yeah, it definitely seems to be related to running with the cpu and gpu via stream. If I separate the CPU off with the c++ renderer, I do not have the problem and the bench finishes drastically faster. Unfortunately, the stream cpu renderer is much faster than the c++ renderer so there's some efficiency loss. Smaller tiles would definitely still pick up the pace when running with non-matching processors though. The processor would at least have to complete a much smaller workload at a time, which would provide more opportunities for the GPU to pick up the slack.

--Matt

**realbabilu** · 01-06-2011, 05:28 PM

maybe adding goto blas or acml with vc 2008 can add more power to cpu render, so the different will not to much between cpu and gpu.

**jogshy** · 01-06-2011, 06:06 PM

Originally Posted by mattkosem

If the tiles were smaller and it didn't wait like that it would probably not be such a big slowdown.

Well, the GPU tiles must be big or I won't get enough data to fill the stream proccessors properly, hehe.

Think that Fermi has 30 multiprocessors and to get full occupancy you'll need 30 multiprocessors x 48 max active warps/multiprocessor x 32 threads/warp = 46080 threads which are very close to the 256x256=65536 threads of a tile.
( I must use a greater value for future architectures ).

However, I could try to use a heterogenous tile size when you enable hybrid-rendering : smaller for CPUs and big for GPUs... I definitively must investigate that.

Could this be related to using the CPU and gpu through stream?

It might be. I read something in the AMD's forums about DRDMA not working properly ( and, therefore, blocking the kernel invokations ).
Try with the evil green goblin / dark green side of the force and see

the stream cpu renderer is much faster than the c++ renderer so there's some efficiency loss

Yep, that's probably because VS2008 optimizes in a way that it's good for both Intel and AMD ( probably something in the middle, like a Pentium 4 ). AMD's Stream CPU I bet it's optimized only for AMDs as Intel's OpenCL SDK is optimized for Intels.
On the other hand, LLVM does a very good job optimizing, while VS2008's compiler is a bit old ( and not very good dealing with X64 ).... but I cannot recompile using VS2010 yet because some libraries I use aren't prepared.
I'm also using TBB for the C++ renderer, which is really an Intel's library ( probably, not really optimized for AMD ).

Well, my C++ implementation was designed just to catch bugs and to provide a compatibility layer for the people without OpenCL though. I don't think it will worth the effort to optimize it, specially when both Intel and AMD are providing OpenCL implementations.

Btw, somebody with a Via eH1 or one of those amazing mini-ITX Zacate APUs to test, pls? And whereeeeeeeee are those Sandy monsters !!!

**mattkosem** · 01-06-2011, 07:53 PM

Can you thread the multiprocessors individually, giving each bank a smaller chunk of the task?

--Matt

**jogshy** · 01-06-2011, 08:00 PM

Originally Posted by mattkosem

Can you thread the multiprocessors individually, giving each bank a smaller chunk of the task?
--Matt

Don't think so. All the data grid is automatically assigned to all the GPU's multiprocessors. I neither can "device-fission" an AMD's GPU ( it only works for CPU apparently ) nor assign task's priorities.
Other alternative was to fire several "incomplete" kernels asyncronously to compensate the low # of threads, but the problem is that Fermi only allows to fire 2 kernels simultanously ( In case of ATI, one ).
There's also another problem: each kernel's launch takes a lot of time, so firing a lot of small kernels takes much more time than to fire one big data chunk ( probably due to how PCI-x works ).

**mattkosem** · 01-07-2011, 04:35 AM

Originally Posted by jogshy

Don't think so. All the data grid is automatically assigned to all the GPU's multiprocessors. I neither can "device-fission" an AMD's GPU ( it only works for CPU apparently ) nor assign task's priorities.
Other alternative was to fire several "incomplete" kernels asyncronously to compensate the low # of threads, but the problem is that Fermi only allows to fire 2 kernels simultanously ( In case of ATI, one ).
There's also another problem: each kernel's launch takes a lot of time, so firing a lot of small kernels takes much more time than to fire one big data chunk ( probably due to how PCI-x works ).

Ah, hardware or api limitations, that stinks. Have you taken a look at SmallPTGPU2 (http://davibu.interfree.it/opencl/sm...lptGPU2.html)? It does a brief pre-rendering with the workload spread evenly across all supported devices to asses the performance, then spreads the workload appropriately based on its findings. Since the scene complexity isn't consistent across the whole picture it can never be perfect, but it definitely adds performance instead of taking it away. Food for thought. He made a different balancing solution for SmalLuxGPU as well, but I think the approach there is too far from your type of workload to be applicable.

Breaking off smaller chunks for the CPU would probably bring the same benefits as long as the GPU is a possible candidate for those chunks if the CPU falls behind. It would be neat to see something like Cinebench 10 where the rendering threads pick up remaining portions of the scene when they finish, but from the sound of it the overhead of spawning off threads in that manner might make that impossible or detrimental.

--Matt

**mattkosem** · 01-08-2011, 09:13 PM

GTX260 beats 5870, with or without the cpu.

--Matt

**STEvil** · 01-09-2011, 12:14 AM

Dual 6c/12t Westemere @ 3.6ghz (19x190) gets 211.24

**jogshy** · 01-09-2011, 01:44 PM

Originally Posted by mattkosem

GTX260 beats 5870, with or without the cpu.

Yep, I still must optimize a few things for ATI but, due to bugs and problems with ATI Stream SDK's tools, I'm not able to profile correctly the program. For instance, my kernels hangs their SKA tool. It's really hard to do without the proper tools.

On the other hand, notice the ATI's shaders work at much slower Mhz than NVIDIA's: Cypress 850Mhz vs GTX260 1242Mhz... So ATI's performance is not really so bad considering the results proportionally.

Remember also that NVIDIA is ahead ATI in GPGPU a few years with CUDA, so probably their compilers and drivers are more optimized. In fact, many professional products like VRay-RT just decided not to support ATI's OpenCL due to this. Other simply are using CUDA which is much mature and optimized ( Octane, Arion, iRay, etc... ).

In my opinion, ATI has too wide VLWI registers because secondary rays will flush them too much on branching. I think NVIDIA's scalar architecture is better for ray tracing.

Originally Posted by STEvil

Dual 6c/12t Westemere @ 3.6ghz (19x190) gets 211.24

I'm curious about what a Sandy Bridge could do

**fellix_bg** · 01-09-2011, 02:51 PM

Originally Posted by jogshy

In my opinion, ATI has too wide VLWI registers because secondary rays will flush them too much on branching. I think NVIDIA's scalar architecture is better for ray tracing.

AMD's architecture has larger batch size -- 64 vs. 32 fragments for NV. This could be the reason for the slower dynamic branch performance.

**Neova** · 01-16-2011, 12:13 AM

6950 CFX 10.12a RC3 drivers, Stream 2.3 OpenCL ... hangs when I tried to run it when selecting ATI Stream GPU. Also, after talking to a driver modder about your app, he claims your software cpu module has bugs... might want to check that.

**AmdAtiTom420** · 01-16-2011, 08:42 AM

Originally Posted by mattkosem

GTX260 beats 5870, with or without the cpu.

--Matt

This definitely seems to favor the Nvidia gpus. Pretty sure its something to do with the compute units and/or shader speeds.

I'll try the new version later when I have some time.

v0.4.5e

Thread: ratGPU OpenCL raytracing benchmark

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions