Page 4 of 7 FirstFirst 1234567 LastLast
Results 76 to 100 of 173

Thread: ratGPU OpenCL raytracing benchmark

  1. #76
    Registered User
    Join Date
    May 2006
    Posts
    67
    @joshy,

    You have PM.

  2. #77
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    I just uploaded the 0.4.6 for Windows:

    v0.4.6 Alpha
    - Extended the alpha to June 1, 2011.
    - Patched a problems with ATI cards that was causing some system hangs. Unfortunately, it has been “patched” and not really fixed. The patch will avoid a system hang but at the cost of some rendering speed.
    - Removed the version from the 3dsmax's plug-in filename because it was causing some duplicate GUID messages. Newer versions will just overwrite the file.
    - Fixed a color mismatch between 3dsmax and the standalone renderer.
    - Now the plug-in for 3dsmax uses the 3dsmax's gamma settings.
    - The gamma slider have been removed from the ratGPU's config dialog.
    - Recompiled using the latest libraries.
    Last edited by jogshy; 01-04-2011 at 07:07 PM.

  3. #78
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    Ok! Version 0.4.7 has been released!
    ATI cards should run ok now at the same speed as previous versions ( but without hanging ).

    Go, test with your new 6970s, Via/S3 eH1 and Sandy Bridges, pls
    Last edited by jogshy; 01-04-2011 at 07:11 PM.

  4. #79
    Registered User
    Join Date
    Dec 2010
    Posts
    12
    downloading....

    Any plan to use add and compare using CUDA, MS DX11 Directcompute to this openCL?
    Hmm will it faster using Intel C++ with Intel MKL for CPU render?
    Nice to have more renderer benchmark for job [not games] like ratGPU and cinemabench.
    Last edited by realbabilu; 01-06-2011 at 02:41 AM.

  5. #80
    Banned
    Join Date
    Oct 2006
    Posts
    963
    excellent work dude.... dllin....

  6. #81
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    Here's what I see on my 4GHZ thuban with ATI Stream (the 5870 is not being used, since it dies under even modest loads - crappy gen, terrible QC). Not sure why it shows both devices, since I checked the "Use this device" checkbox.

    --Matt
    Attached Images Attached Images
    Last edited by mattkosem; 01-06-2011 at 04:41 AM.
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  7. #82
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    Woohoo, 5th try's a charm. Cypress @ stock.

    --Matt
    Attached Images Attached Images
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  8. #83
    Banned
    Join Date
    Oct 2006
    Posts
    963
    @mattkosem- which driver are you using... the 10.9's...???

    this bench locks up for me when i try and run using 10.12...

    @jogshy- so do i need to install 10.9's untill ati update stream sdk...???

    thanks again for the effort it must of taken to produce this bench...

  9. #84
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    Quote Originally Posted by purecain View Post
    @mattkosem- which driver are you using... the 10.9's...???

    this bench locks up for me when i try and run using 10.12...

    @jogshy- so do i need to install 10.9's untill ati update stream sdk...???

    thanks again for the effort it must of taken to produce this bench...
    I'm using the 10.11s and stream SDK 2.3. I was getting the same crashing on the 0.4.5 build. Are you running the latest one?

    --Matt
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  10. #85
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    Not sure why it shows both devices, since I checked the "Use this device" checkbox.
    You can use multiple devices. Just change the device using the combo box and check in the "use this device" for the devices you want to use to render.
    For example, you can use hybrid rendering with CPU+GPU... or only the GPU... or only the CPU.... or 3 GPUs ( yes, the test supports SLi/Crossfire ).

    Quote Originally Posted by realbabilu View Post
    Any plan to use add and compare using CUDA, MS DX11 Directcompute to this openCL?
    Yep, I plan to write a CUDA and DC implementation too to compare... but currently I'm focusing on OpenCL because it's portable and multivendor.

    Hmm will it faster using Intel C++ with Intel MKL for CPU render?
    Maybe.
    The "ratGPU C++" implementation currently uses VS 2008 to compile because it optimizes for both AMD/Intel CPUs and TBB, but it's really aimed just to use the program in computers without OpenCL.
    For all the other cases, the OpenCL implementation itself will optimize properly based on the platform ( for instance, the Intel OpenCL SDK Alpha uses TBB2.2, SSE3 intrinsics and the latest Intel's Compiler... and NVIDIA uses an onion layer based on CUDA... and ATI uses an onion layer based on CAL, etc... ).

    jogshy- so do i need to install 10.9's untill ati update stream sdk...???
    Old versions prior to the newest 0.4.7 may hang with ATI cards, but I think I solved the problem in the 0.4.7 ( almost it is for the 5750 which I use to develop ). So, the 0.4.7 should work ok with Catalyst 10.12 APP ( no need for stream SDK, The APP version already includes it ). AFAIK, Catalyst 10.11 are a bit buggy for OpenCL, use the 10.12 APP better if you can.
    If ratGPU hangs then please try with Catalyst 10.9 and the Stream SDK v2.3 ( or the old 2.2 ).
    And, yep, OpenCL's drivers still are a bit unstable... 8(

    Btw, anybody there with a S3/Via eH1 card to test, pls?
    Last edited by jogshy; 01-06-2011 at 11:16 AM.

  11. #86
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    Quote Originally Posted by jogshy View Post
    You can use multiple devices. Just change the device using the combo box and check in the "use this device" for the devices you want to use to render.
    For example, you can use hybrid rendering with CPU+GPU... or only the GPU... or only the CPU.... or 3 GPUs ( yes, the test supports SLi/Crossfire ).
    That much is clear. I'm not home now, but I was only intending to run with the CPU on that bench(mostly because my janky gpu crashes my pc constantly). If it really did use both, its odd that it was so much slower that way. One would think that more processing power would yield a better score, not worse. That is, unless it is evenly dividing the workload across both of them. Does this app not load balance based on performance like smallptgpu?

    --Matt
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  12. #87
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    Quote Originally Posted by mattkosem View Post
    If it really did use both, its odd that it was so much slower that way. One would think that more processing power would yield a better score, not worse. That is, unless it is evenly dividing the workload across both of them. Does this app not load balance based on performance?
    Hybrid rendering ( as many heteregeneous tasks ) has a problem: a chain is as strong as its weakest link. That's, for instance, why supercomputers based on MPICH2 don't like heterogeneous clusters.

    If the GPU is rendering very fast but the CPU is not, then the CPU will just make a slowdown in the process. On the other hand, with the CPU under heavy workload the asynchronous kernel calls(which are called from a thread in the CPU ) and the PCI transfers ( DMA is a myth ) will slowdown too.
    The program balances the workload but if the image is not very large then the CPU will make all slower ( because the last tile will be in wait state almost all the time ). For big HD images could worth the effort tho.
    Last edited by jogshy; 01-06-2011 at 02:19 PM.

  13. #88
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    Quote Originally Posted by jogshy View Post
    Hybrid rendering ( as many heteregeneous tasks ) has a problem: a chain is as strong as its weakest link. That's, for instance, why supercomputers based on MPICH2 don't like heterogeneous clusters.

    If the GPU is rendering very fast but the CPU is not, then the CPU will just make a slowdown in the process. On the other hand, with the CPU under heavy workload the asynchronous kernel calls(which are called from a thread in the CPU ) and the PCI transfers ( DMA is a myth ) will slowdown too.
    The program balances the workload but if the image is not very large then the CPU will make all slower ( because the last tile will be in wait state almost all the time ). For big HD images could worth the effort tho.
    Actually, it doesn't seem like it's as efficient as you stated. It seems to spool off the tiles in pairs, but the next pair of tiles doesn't start rendering until both finish. One sits at 99% while the other finishes. If the tiles were smaller and it didn't wait like that it would probably not be such a big slowdown.

    Could this be related to using the CPU and gpu through stream? I haven't tried using the built in CPU render with the stream gpu renderer. I'll give it a try when I have a chance.

    --Matt
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  14. #89
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    Yeah, it definitely seems to be related to running with the cpu and gpu via stream. If I separate the CPU off with the c++ renderer, I do not have the problem and the bench finishes drastically faster. Unfortunately, the stream cpu renderer is much faster than the c++ renderer so there's some efficiency loss. Smaller tiles would definitely still pick up the pace when running with non-matching processors though. The processor would at least have to complete a much smaller workload at a time, which would provide more opportunities for the GPU to pick up the slack.

    --Matt
    Attached Images Attached Images
    Last edited by mattkosem; 01-06-2011 at 05:10 PM.
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  15. #90
    Registered User
    Join Date
    Dec 2010
    Posts
    12
    maybe adding goto blas or acml with vc 2008 can add more power to cpu render, so the different will not to much between cpu and gpu.

  16. #91
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    Quote Originally Posted by mattkosem View Post
    If the tiles were smaller and it didn't wait like that it would probably not be such a big slowdown.
    Well, the GPU tiles must be big or I won't get enough data to fill the stream proccessors properly, hehe.

    Think that Fermi has 30 multiprocessors and to get full occupancy you'll need 30 multiprocessors x 48 max active warps/multiprocessor x 32 threads/warp = 46080 threads which are very close to the 256x256=65536 threads of a tile.
    ( I must use a greater value for future architectures ).

    However, I could try to use a heterogenous tile size when you enable hybrid-rendering : smaller for CPUs and big for GPUs... I definitively must investigate that.

    Could this be related to using the CPU and gpu through stream?
    It might be. I read something in the AMD's forums about DRDMA not working properly ( and, therefore, blocking the kernel invokations ).
    Try with the evil green goblin / dark green side of the force and see

    the stream cpu renderer is much faster than the c++ renderer so there's some efficiency loss
    Yep, that's probably because VS2008 optimizes in a way that it's good for both Intel and AMD ( probably something in the middle, like a Pentium 4 ). AMD's Stream CPU I bet it's optimized only for AMDs as Intel's OpenCL SDK is optimized for Intels.
    On the other hand, LLVM does a very good job optimizing, while VS2008's compiler is a bit old ( and not very good dealing with X64 ).... but I cannot recompile using VS2010 yet because some libraries I use aren't prepared.
    I'm also using TBB for the C++ renderer, which is really an Intel's library ( probably, not really optimized for AMD ).

    Well, my C++ implementation was designed just to catch bugs and to provide a compatibility layer for the people without OpenCL though. I don't think it will worth the effort to optimize it, specially when both Intel and AMD are providing OpenCL implementations.

    Btw, somebody with a Via eH1 or one of those amazing mini-ITX Zacate APUs to test, pls? And whereeeeeeeee are those Sandy monsters !!!
    Last edited by jogshy; 01-06-2011 at 06:57 PM.

  17. #92
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    Can you thread the multiprocessors individually, giving each bank a smaller chunk of the task?

    --Matt
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  18. #93
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    Quote Originally Posted by mattkosem View Post
    Can you thread the multiprocessors individually, giving each bank a smaller chunk of the task?
    --Matt
    Don't think so. All the data grid is automatically assigned to all the GPU's multiprocessors. I neither can "device-fission" an AMD's GPU ( it only works for CPU apparently ) nor assign task's priorities.
    Other alternative was to fire several "incomplete" kernels asyncronously to compensate the low # of threads, but the problem is that Fermi only allows to fire 2 kernels simultanously ( In case of ATI, one ).
    There's also another problem: each kernel's launch takes a lot of time, so firing a lot of small kernels takes much more time than to fire one big data chunk ( probably due to how PCI-x works ).
    Last edited by jogshy; 01-06-2011 at 08:12 PM.

  19. #94
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    Quote Originally Posted by jogshy View Post
    Don't think so. All the data grid is automatically assigned to all the GPU's multiprocessors. I neither can "device-fission" an AMD's GPU ( it only works for CPU apparently ) nor assign task's priorities.
    Other alternative was to fire several "incomplete" kernels asyncronously to compensate the low # of threads, but the problem is that Fermi only allows to fire 2 kernels simultanously ( In case of ATI, one ).
    There's also another problem: each kernel's launch takes a lot of time, so firing a lot of small kernels takes much more time than to fire one big data chunk ( probably due to how PCI-x works ).
    Ah, hardware or api limitations, that stinks. Have you taken a look at SmallPTGPU2 (http://davibu.interfree.it/opencl/sm...lptGPU2.html)? It does a brief pre-rendering with the workload spread evenly across all supported devices to asses the performance, then spreads the workload appropriately based on its findings. Since the scene complexity isn't consistent across the whole picture it can never be perfect, but it definitely adds performance instead of taking it away. Food for thought. He made a different balancing solution for SmalLuxGPU as well, but I think the approach there is too far from your type of workload to be applicable.

    Breaking off smaller chunks for the CPU would probably bring the same benefits as long as the GPU is a possible candidate for those chunks if the CPU falls behind. It would be neat to see something like Cinebench 10 where the rendering threads pick up remaining portions of the scene when they finish, but from the sound of it the overhead of spawning off threads in that manner might make that impossible or detrimental.

    --Matt
    Last edited by mattkosem; 01-07-2011 at 04:41 AM.
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  20. #95
    Xtreme Addict
    Join Date
    Jul 2008
    Location
    US
    Posts
    1,379
    GTX260 beats 5870, with or without the cpu.

    --Matt
    Attached Images Attached Images
    My Rig :
    Core i5 4570S - ASUS Z87I-DELUXE - 16GB Mushkin Blackline DDR3-2400 - 256GB Plextor M5 Pro Xtreme

  21. #96
    c[_]
    Join Date
    Nov 2002
    Location
    Alberta, Canada
    Posts
    18,728
    Dual 6c/12t Westemere @ 3.6ghz (19x190) gets 211.24
    Last edited by STEvil; 01-09-2011 at 02:54 AM.

    All along the watchtower the watchmen watch the eternal return.

  22. #97
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    Quote Originally Posted by mattkosem View Post
    GTX260 beats 5870, with or without the cpu.
    Yep, I still must optimize a few things for ATI but, due to bugs and problems with ATI Stream SDK's tools, I'm not able to profile correctly the program. For instance, my kernels hangs their SKA tool. It's really hard to do without the proper tools.

    On the other hand, notice the ATI's shaders work at much slower Mhz than NVIDIA's: Cypress 850Mhz vs GTX260 1242Mhz... So ATI's performance is not really so bad considering the results proportionally.

    Remember also that NVIDIA is ahead ATI in GPGPU a few years with CUDA, so probably their compilers and drivers are more optimized. In fact, many professional products like VRay-RT just decided not to support ATI's OpenCL due to this. Other simply are using CUDA which is much mature and optimized ( Octane, Arion, iRay, etc... ).

    In my opinion, ATI has too wide VLWI registers because secondary rays will flush them too much on branching. I think NVIDIA's scalar architecture is better for ray tracing.

    Quote Originally Posted by STEvil View Post
    Dual 6c/12t Westemere @ 3.6ghz (19x190) gets 211.24
    I'm curious about what a Sandy Bridge could do
    Last edited by jogshy; 01-09-2011 at 01:56 PM.

  23. #98
    Xtreme Member
    Join Date
    Jun 2005
    Location
    Bulgaria, Varna
    Posts
    447
    Quote Originally Posted by jogshy View Post
    In my opinion, ATI has too wide VLWI registers because secondary rays will flush them too much on branching. I think NVIDIA's scalar architecture is better for ray tracing.
    AMD's architecture has larger batch size -- 64 vs. 32 fragments for NV. This could be the reason for the slower dynamic branch performance.

  24. #99
    Xtreme Member
    Join Date
    Mar 2005
    Posts
    310
    6950 CFX 10.12a RC3 drivers, Stream 2.3 OpenCL ... hangs when I tried to run it when selecting ATI Stream GPU. Also, after talking to a driver modder about your app, he claims your software cpu module has bugs... might want to check that.

  25. #100
    Registered User
    Join Date
    Nov 2010
    Posts
    91
    Quote Originally Posted by mattkosem View Post
    GTX260 beats 5870, with or without the cpu.

    --Matt
    This definitely seems to favor the Nvidia gpus. Pretty sure its something to do with the compute units and/or shader speeds.

    I'll try the new version later when I have some time.
    v0.4.5e

Page 4 of 7 FirstFirst 1234567 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •