Results 1 to 10 of 10

Thread: GPU bandwidth question

  1. #1
    Xtreme Cruncher
    Join Date
    Dec 2008
    Location
    Texas
    Posts
    5,152

    GPU bandwidth question

    Hey guys, so I came across a problem at work and figured you guys might know how to solve it.

    We're currently porting a real-time application from using specially designed hardware to GPUs (testing currently - hoping to use more general purpose hardware). The performance is there: we're testing on a 580 and getting more than enough power for everything, some data sets are seeing 3-4x improvements over real time!

    The issue arises in getting all of that info on and off the card - PCI2.0 specs are not up to par and we cannot change to PCI3.0 at this time. Anyone know any ways to boost the bandwidth by about 15%? That's the last piece of the puzzle.

    Thanks for the help guys!


    24 hour prime stable? Please, I'm 24/7/365 WCG stable!

    So you can do Furmark, Can you Grid???

  2. #2
    Xtreme crazy bastid
    Join Date
    Apr 2007
    Location
    On mah murder-sickle!
    Posts
    5,878
    I know this is an odd thing to say on this forum but can you bump the PCI clock a few notches?

    Aside form that, what CUDA driver version are you using? A different version might have more efficient data throughput.

    This article has a discussion on factors affecting PCI bandwidth: http://www.design-reuse.com/articles...xpress-ip.html

    This pdf might have some helpful techniques as well: http://h20000.www2.hp.com/bc/docs/su.../c03045563.pdf

    [SIGPIC][/SIGPIC]

  3. #3
    c[_]
    Join Date
    Nov 2002
    Location
    Alberta, Canada
    Posts
    18,728
    only thing you can do is bump the PCI-E clock speed or move to PCI-E 3

    All along the watchtower the watchmen watch the eternal return.

  4. #4
    Xtreme Cruncher
    Join Date
    Dec 2008
    Location
    Texas
    Posts
    5,152
    Thanks guys! Bumping the PCI-E clock worked to a degree (or that's what I'm told, that's a different part of my project), but as it's running outside of spec they are reluctant to put it in a shipping product for tech support reasons. But it works for testing!

    (Sorry, lost internet connection for a few days. )


    24 hour prime stable? Please, I'm 24/7/365 WCG stable!

    So you can do Furmark, Can you Grid???

  5. #5
    Xtreme Addict
    Join Date
    Jan 2007
    Location
    Michigan
    Posts
    1,785
    With bandwidth problems you basically have a couple of choices: compression, shape, or just increase the bandwidth somehow. Sounds like increasing it really isn't an option. Is there a way to prioritize the flow of your data? Even if you could compress it you would probably reach diminishing returns very quickly...
    Last edited by Vinas; 05-31-2012 at 04:29 AM.
    Current: AMD Threadripper 1950X @ 4.2GHz / EK Supremacy/ 360 EK Rad, EK-DBAY D5 PWM, 32GB G.Skill 3000MHz DDR4, AMD Vega 64 Wave, Samsung nVME SSDs
    Prior Build: Core i7 7700K @ 4.9GHz / Apogee XT/120.2 Magicool rad, 16GB G.Skill 3000MHz DDR4, AMD Saphire rx580 8GB, Samsung 850 Pro SSD

    Intel 4.5GHz LinX Stable Club

    Crunch with us, the XS WCG team

  6. #6
    Xtreme crazy bastid
    Join Date
    Apr 2007
    Location
    On mah murder-sickle!
    Posts
    5,878
    Just a thought but how is the bandwidth being used? it is a continuous data stream or is it bursts? If it's in bursts can you do some scheduling to even the data flow out? Would it be possible to page on-card memory out to system RAM more aggressively and then page in required data earlier to give it more time, or conversely start returning completed data from the card earlier, possibly even before the entire data-set is finished?

    I'm just spit-balling here, but the GPU is running at full 16x? The cards in question aren't choked down to 8x for some odd reason?

    [SIGPIC][/SIGPIC]

  7. #7
    Xtreme Cruncher
    Join Date
    Mar 2006
    Posts
    613
    OpenCL or CUDA?

    OpenCL allows pinned memory allocation, and then transfers happen via DMA. Allocate your device memory with the CL_MEM_ALLOC_HOST_PTR flag. I would presume CUDA provides similar functionality, but I've not used CUDA before.
    Last edited by joshd; 05-31-2012 at 07:16 AM.

  8. #8
    Xtreme Cruncher
    Join Date
    Dec 2008
    Location
    Texas
    Posts
    5,152
    Quote Originally Posted by Vinas View Post
    With bandwidth problems you basically have a couple of choices: compression, shape, or just increase the bandwidth somehow. Sounds like increasing it really isn't an option. Is there a way to prioritize the flow of your data? Even if you could compress it you would probably reach diminishing returns very quickly...
    That 15% short is after shaping... and the data is a continuously-generated, real-time stream... think "trying to put image filters on 500 different HD TV channels at once... forever. Channels come in on the mobo, get passed to the GPU, GPU does it's work and passes it back...

    Not the actual premise, but has the same challenges... although ours is a bit less compressible...

    Quote Originally Posted by D_A View Post
    Just a thought but how is the bandwidth being used? it is a continuous data stream or is it bursts? If it's in bursts can you do some scheduling to even the data flow out? Would it be possible to page on-card memory out to system RAM more aggressively and then page in required data earlier to give it more time, or conversely start returning completed data from the card earlier, possibly even before the entire data-set is finished?

    I'm just spit-balling here, but the GPU is running at full 16x? The cards in question aren't choked down to 8x for some odd reason?
    No, it's a full, continuous stream, so no clever scheduling is going to do it... and the card isn't reaching full "theoretical" 16x speeds, but I believe it's above 8x. I'll double check on that tomorrow and make sure - but I'd be surprised if he missed something like that, pretty bright guy working on it, but a bunch of bright guys are better than one bright guy.

    Quote Originally Posted by joshd View Post
    OpenCL or CUDA?

    OpenCL allows pinned memory allocation, and then transfers happen via DMA. Allocate your device memory with the CL_MEM_ALLOC_HOST_PTR flag. I would presume CUDA provides similar functionality, but I've not used CUDA before.
    CUDA, but I'm sure we can find something similar if this works... how does that lower overall data usage?

    Oh, not that it matters (to my knowledge anyway), this is currently on an x58 board though...


    24 hour prime stable? Please, I'm 24/7/365 WCG stable!

    So you can do Furmark, Can you Grid???

  9. #9
    Xtreme Cruncher
    Join Date
    Mar 2006
    Posts
    613
    Quote Originally Posted by Otis11 View Post
    CUDA, but I'm sure we can find something similar if this works... how does that lower overall data usage?

    Oh, not that it matters (to my knowledge anyway), this is currently on an x58 board though...
    It doesn't, using DMA simply allows you to maximise your use of the existing PCIe bandwidth. How much bandwidth do you see actually being used in profiling?

  10. #10
    Xtreme Cruncher
    Join Date
    Dec 2008
    Location
    Texas
    Posts
    5,152
    I'll have to ask on Monday... i don't actually work on this project. I'm on a tangent project and just heard about the problems they were hitting so thought I'd look into it. Much more interesting that what i'm doing.


    24 hour prime stable? Please, I'm 24/7/365 WCG stable!

    So you can do Furmark, Can you Grid???

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •