GPU bandwidth question

**Otis11** · 05-24-2012, 03:29 PM

Hey guys, so I came across a problem at work and figured you guys might know how to solve it.

We're currently porting a real-time application from using specially designed hardware to GPUs (testing currently - hoping to use more general purpose hardware). The performance is there: we're testing on a 580 and getting more than enough power for everything, some data sets are seeing 3-4x improvements over real time!

The issue arises in getting all of that info on and off the card - PCI2.0 specs are not up to par and we cannot change to PCI3.0 at this time. Anyone know any ways to boost the bandwidth by about 15%? That's the last piece of the puzzle.

Thanks for the help guys!

**D_A** · 05-24-2012, 04:21 PM

I know this is an odd thing to say on this forum but can you bump the PCI clock a few notches?

Aside form that, what CUDA driver version are you using? A different version might have more efficient data throughput.

This article has a discussion on factors affecting PCI bandwidth: http://www.design-reuse.com/articles...xpress-ip.html

This pdf might have some helpful techniques as well: http://h20000.www2.hp.com/bc/docs/su.../c03045563.pdf

**STEvil** · 05-24-2012, 06:18 PM

only thing you can do is bump the PCI-E clock speed or move to PCI-E 3

**Otis11** · 05-30-2012, 02:41 PM

Thanks guys! Bumping the PCI-E clock worked to a degree (or that's what I'm told, that's a different part of my project), but as it's running outside of spec they are reluctant to put it in a shipping product for tech support reasons. But it works for testing!

(Sorry, lost internet connection for a few days.

)

**Vinas** · 05-31-2012, 04:26 AM

With bandwidth problems you basically have a couple of choices: compression, shape, or just increase the bandwidth somehow. Sounds like increasing it really isn't an option. Is there a way to prioritize the flow of your data? Even if you could compress it you would probably reach diminishing returns very quickly...

**D_A** · 05-31-2012, 05:02 AM

Just a thought but how is the bandwidth being used? it is a continuous data stream or is it bursts? If it's in bursts can you do some scheduling to even the data flow out? Would it be possible to page on-card memory out to system RAM more aggressively and then page in required data earlier to give it more time, or conversely start returning completed data from the card earlier, possibly even before the entire data-set is finished?

I'm just spit-balling here, but the GPU is running at full 16x? The cards in question aren't choked down to 8x for some odd reason?

**joshd** · 05-31-2012, 07:13 AM

OpenCL or CUDA?

OpenCL allows pinned memory allocation, and then transfers happen via DMA. Allocate your device memory with the CL_MEM_ALLOC_HOST_PTR flag. I would presume CUDA provides similar functionality, but I've not used CUDA before.

**Otis11** · 05-31-2012, 04:38 PM

Originally Posted by Vinas

With bandwidth problems you basically have a couple of choices: compression, shape, or just increase the bandwidth somehow. Sounds like increasing it really isn't an option. Is there a way to prioritize the flow of your data? Even if you could compress it you would probably reach diminishing returns very quickly...

That 15% short is after shaping... and the data is a continuously-generated, real-time stream... think "trying to put image filters on 500 different HD TV channels at once... forever. Channels come in on the mobo, get passed to the GPU, GPU does it's work and passes it back...

Not the actual premise, but has the same challenges... although ours is a bit less compressible...

Originally Posted by D_A

Just a thought but how is the bandwidth being used? it is a continuous data stream or is it bursts? If it's in bursts can you do some scheduling to even the data flow out? Would it be possible to page on-card memory out to system RAM more aggressively and then page in required data earlier to give it more time, or conversely start returning completed data from the card earlier, possibly even before the entire data-set is finished?

I'm just spit-balling here, but the GPU is running at full 16x? The cards in question aren't choked down to 8x for some odd reason?

No, it's a full, continuous stream, so no clever scheduling is going to do it... and the card isn't reaching full "theoretical" 16x speeds, but I believe it's above 8x. I'll double check on that tomorrow and make sure - but I'd be surprised if he missed something like that, pretty bright guy working on it, but a bunch of bright guys are better than one bright guy.

Originally Posted by joshd

OpenCL or CUDA?

OpenCL allows pinned memory allocation, and then transfers happen via DMA. Allocate your device memory with the CL_MEM_ALLOC_HOST_PTR flag. I would presume CUDA provides similar functionality, but I've not used CUDA before.

CUDA, but I'm sure we can find something similar if this works... how does that lower overall data usage?

Oh, not that it matters (to my knowledge anyway), this is currently on an x58 board though...

**joshd** · 06-01-2012, 02:52 AM

Originally Posted by Otis11

CUDA, but I'm sure we can find something similar if this works... how does that lower overall data usage?

Oh, not that it matters (to my knowledge anyway), this is currently on an x58 board though...

It doesn't, using DMA simply allows you to maximise your use of the existing PCIe bandwidth. How much bandwidth do you see actually being used in profiling?

**Otis11** · 06-02-2012, 01:31 PM

I'll have to ask on Monday... i don't actually work on this project. I'm on a tangent project and just heard about the problems they were hitting so thought I'd look into it. Much more interesting that what i'm doing.

Thread: GPU bandwidth question

Thread Tools

Search Thread

Rate This Thread

Display

GPU bandwidth question

Bookmarks

Bookmarks

Posting Permissions