460/ 560, etc efficiency

**Snow Crash** · 03-25-2011, 04:06 AM

This thread assumes you have figured out how to run BOINC/ GUPUGrid on your system, this is not intended to be a BOINC setup guide. If you are would like some help with the base setup please check out the great "everything you needed to know" thread by Otis11.
http://www.xtremesystems.org/forums/...d.php?t=246583

Issue: Currently, the Compute Capable (CC) 2.1 cards (GTX460, GTX560 Ti), run at a much lower GPU utilization than CC2.0 cards (GTX465, GTX470, GTX480, GTX560, GTX580, GTX590). Not that I make pretend to understand CUDA programming but the core performance bottleneck is based on the 1x48 ratio of cores to shaders compared to the 1x32 ratio. If your card is not listed above then this thread probably isn't going to help you.

Goal: Maximize output for a GTX460 for running GPUGrid applications on Windows by running 2 instances of ACEMD on a single card at the same time via a custom app_info.xml configuration file.

Factors: Other factors are how the OS handles drivers (Vista and Win7 slow, XP fast), CPU speed and availability, and perhaps the most challenging factor to try and control is that GPUGrid has multiple different types of Work Units (WUs) that run internally using different input paramters (the experiments themselves) which in turn directly effect the efficiency of the overall GPUGrid application, ACEMD. I have also seen conflicting information that the PCIE slot bandwidth can be a performance factor but it seems like most of the information on this points to it being a non-issue. If I have time I will try to address that topic but seriously, I have a life and there's only so much I can do

<aside>: The previously accepted methodology was to simply turn networking off, make a backup of your BOINC data folder and when the first run was compelte, turn BOINC off, restore the backup, reconfigure and run again, rinse and repeat until you had all configurations accounted for. This doesn't seem to work anymore in Win7, likely because I just haven't figured which file(s)in the BOINC program folder needs to be copied also. The upshot is that within the same WU type the total runtimes are within a few percent which doesn't really matter as we are looking for substantial increases. Anything less than 10% is probably not worth the effort and anyone who thinks it is has probably already done all this testing anyway :-)

Baseline: All that being out of the way, I want to see if we can increase the effeciency of perhaps most common OS configuration, Win7x64. I am going to start with what I have come to think of as the best balance approach for a i7-920 system, which is running with HT on and leaving one thread left free for GPUGrid, using the SWAN_SYNC environmental variable to dedicate that free thread to GPUGrid.

Off to set this up which is going to take a while ... move PC to basement, break it out of (gasp) it's case, breakdown the cable management, swap cards and PSU (crap I have to break down the other PC to do this ... WTF was I thinking putting them in cases and making them all nice and neat???

Three cheers go out to our GPUGrid team mate sponsoring the hardware for these tests

----------------------------------------------------------------
Baseline Win7x64: CPU (20*200) w/ HT ON, 1 threads free, SWANSYNC=0, running 1 instance of ACEMD

WU Type: KASHIF_HIVPR_wo and KASHIF_HIVPR_maca_wo:
Across 10 runs, there was only a .0143% runtime difference between shortest and longest so I'm not gonna bother with a stddev here

avg 40251.28 seconds per WU: 87-90% GPU Utilization

sidebar: there are two different sub types of HIV WUs, wo and so.
The above stat was based on only the "wo" subtype: KASHIF_HIVPR_wo, KASHIF_HIVPR_maca_wo. Back of the napkin calcs tells me the "so" subtype would take about an hour and 7 minutes longer per WU on a 460.
----------------------------------------------------------------
Test 1: Win7x64, CPU (20*200) w/ HT ON, 2 threads free, SWANSYNC=0, running 2 instances of ACEMD

After only half way through the first set of 2 instance I can call this a "no win" as there is no substantive improvement. The time doubled almost exactly. While the GPU usage did go up into the mid to high 90s I believe that internal resource contention inside the GPU caused by running 2 instances negated the almost 10% utilization increase for a net 0% overall efficiency.

----------------------------------------------------------------

**Otis11** · 03-25-2011, 07:42 AM

Originally Posted by Snow Crash

Off to set this up which is going to take a while ... move PC to basement, break it out of (gasp) it's case, breakdown the cable management, swap cards and PSU (crap I have to break down the other PC to do this ... WTF was I thinking putting them in cases and making them all nice and neat???

I know... took me a while to learn that lesson too.

Looking forward to the results.

**PoppaGeek** · 03-25-2011, 11:24 PM

Hmmm SC getting serious and experimenting, should we call his mom??

**SMTB1963** · 03-26-2011, 04:17 AM

Subscribed!

I've got a couple of 460s, so I'm very interested in seeing if this pans out.

Originally Posted by Snow Crash

... WTF was I thinking putting them in cases and making them all nice and neat???

I'm going to go out on a limb and guess that you're married?

The fairer sex often has a thing about tidiness (good thing for me).

Good Luck!

**Snow Crash** · 03-26-2011, 04:44 AM

Originally Posted by SMTB1963

Subscribed!
I'm going to go out on a limb and guess that you're married?

The fairer sex often has a thing about tidiness (good thing for me).

I can't blame this one on my wife, more like my A type personality feeling inadequate while looking through the air cooling subforum and seeing all the insane cable work people have done.

quicky update ... I decided that after having things to do that I was not expecting yesterday, I can get to 90% of my target config just by swapping hard drives. I know the XP system is a not an issue but I wonder how many times I have done this to Win7 and if it's gonna make me call M$ to get registered properly.

**Snow Crash** · 03-26-2011, 09:40 AM

back up and running, one instance of a short KASHIF WUs running on Win7x64 utilization is 90%, memory @19% ... looks good to me!
Man this card runs cool ... fan @67% and temps don't even touch 50c

I'll add this to the first post as I get more results but I would seriously be remiss if I didn't make sure everyone knows this really is a team effort as it is only because of a super generous XS team member who is loaning me this card for the tests, you know who you are and you ROCK!!!

**slaveondope** · 03-26-2011, 10:21 AM

Sub'd.

These results should apply to 450/550's as well being 1x32 shader ratio too.

**Snow Crash** · 03-27-2011, 01:22 PM

updated OP with average for baseline for KASHIF_HIV wu type

**INFRNL** · 03-27-2011, 09:13 PM

would like to see some solid results. Keep us posted as I'm sure you will.

off topic; What is the ideal card to get now-a-days? I sold my gtx 275's a while back, but thinking that may have been a bad idea. current cards are a bit above my spending limit. I am also not sure if they are truly worthy of their cost compared to the older cards, but I could be extremely wrong.

I currently only fold on the grid with 1 gtx 260-216 and I have a cheap backup card (GT240). GT 240 only avg's approx 15k ppd and the 260 I think is around 35k or so. Thanks

**Otis11** · 03-28-2011, 07:18 AM

Originally Posted by INFRNL

would like to see some solid results. Keep us posted as I'm sure you will.

off topic; What is the ideal card to get now-a-days? I sold my gtx 275's a while back, but thinking that may have been a bad idea. current cards are a bit above my spending limit. I am also not sure if they are truly worthy of their cost compared to the older cards, but I could be extremely wrong.

I currently only fold on the grid with 1 gtx 260-216 and I have a cheap backup card (GT240). GT 240 only avg's approx 15k ppd and the 260 I think is around 35k or so. Thanks

According to SKgiven, the best card for ppd/$ is the 470, but I think it might be better to get the 570 if you pay your own electric bills...

In either case:

The GTX580 and GTX570 are the best. Tomorrow you will be able to add the GTX590 to the top of the list. The GTX465, GTX470 and GTX480 cards are all better for crunching here than the GTX560 Ti, GTX550 Ti, GTX450 and GTS450.

The difference is in the architectures:

GTX465, GTX470 and GTX480 are Compute Capable (CC) 2.0 cards (GF100)
GTX460 is CC 2.1 (GF104)
GTS450 is CC 2.1 (GF106)
GT440, GT430 and GT420 are CC 2.1 (GF108)

GTX580 and GTX570 are CC2.0 (GF110)
GTX560 is CC 2.1 (GF114)
GTX550 is CC 2.1 (GF116)

GTX 590 is presumably CC2.0 (GF110)

All the CC2.0 cards have 32 cuda cores (shaders) per SM (core) and all the CC2.1 cards have 48shaders per SM. Unfortunately the CC2.1 cards under perform relative to the shader per SM as if they only have 32shaders per SM, making them about 33% slower per SM.

Bottom line - Get a CC2.0 card and not a CC2.1 card.

**SMTB1963** · 03-30-2011, 11:30 AM

Originally Posted by Snow Crash

updated OP with average for baseline for KASHIF_HIV wu type

Cool...so how many KASHIF WU's make up the 40250 sec avg?

**Snow Crash** · 04-01-2011, 09:24 AM

The original avg. was based on only a couple of runs but they are very consistent so I doubt there will be much change with a larger number of WUs used as a baseline. I will be posting up averages tonight/ tomorrow morning for all WUs that were run over the past week and I'll add "how many" were used in determining avererages. Unless someone thinks it would be valluable I had not planned on including standard deviation ... anyone interested in that stat?

**Otis11** · 04-01-2011, 09:39 AM

Originally Posted by Snow Crash

The original avg. was based on only a couple of runs but they are very consistent so I doubt there will be much change with a larger number of WUs used as a baseline. I will be posting up averages tonight/ tomorrow morning for all WUs that were run over the past week and I'll add "how many" were used in determining avererages. Unless someone thinks it would be valluable I had not planned on including standard deviation ... anyone interested in that stat?

I'd be interested in standard deviation if only to support that the lengths are fairly consistent...

Look forward to the update.

**Snow Crash** · 04-02-2011, 05:11 AM

OP updated ... baseline complete and there was only an increase of 1 sec from the original estimate

I will be setting up the 2 instance runs later today (maybe tomorrow). Wish me luck that I don't crash out and get stuck in the GPUGrid "no WUs for you" limbo.
If I keep getting the same WU type I will post up results after the first couple of runs to see if we're making progress or if it's time to start a new config test.

**Snow Crash** · 04-03-2011, 05:11 AM

OP updated with multi instance results

If you want the most from a 460 then run LONG WUs, just make sure to keep your BOINC cache really low so you always return inside the 24 hour bonus deadline. You will get better PPD even if you don't make the deadline as compared to running the short ones.

In case you're wondering ... no, you'll never be able to run 2 instances of LONG WUs and return them within the bonus deadline.

To validate my original baseline numbers I poked around through individual tasks for various team members who have the "show my computers on the website" set to YES at GPUGrid and found that our averages (for the same WU subtype) vary quite a bit but mostly explainable ... those of you not running with the environmental variable SWAN_SYNC set to 0 are getting hammered

and I fully understand and respect your choice to not take a core away from WCG.

I *think* I'm seeing that 260.99 drivers run a bit faster than 266.58 but it only amounts to 20 minutes over the course of an entire WU.

I checked that our standard "faster shader freq. = faster runtime" - still true.
I checked that our standard "WinXP is faster that Vista / Win7" - still true.

Interesting side note: The 480 did see an overall efficiency improvement but I'll be testing out to see if that holds true for the LONG WUs also as the typically run at a higher utilization rate to begin with.

The only other experiments at this point that I think can help identify ways to increase PPD for the 460 series of cards (includes all of the Compute Capable 2.1 cards), beyond everyhitng we are already aware of are driver version and PCIe. I would only do one of those at a time so how about we informally vote on it

Drivers or PCIe first?

**SMTB1963** · 04-03-2011, 05:55 AM

Originally Posted by Snow Crash

After only half way through the first set of 2 instance I can call this a "no win" as there is no substantive improvement. The time doubled almost exactly. While the GPU usage did go up into the mid to high 90s I believe that internal resource contention inside the GPU caused by running 2 instances negated the almost 10% utilization increase for a net 0% overall efficiency.

Bummer. But thanks for taking the time to look into it.

+1 for Drivers!

**Snow Crash** · 04-03-2011, 01:03 PM

I'll be doing this with long WUs.
Forward to 270.xx beta or backward to 260.99?

**SMTB1963** · 04-03-2011, 01:27 PM

Originally Posted by Snow Crash

I'll be doing this with long WUs.
Forward to 270.xx beta or backward to 260.99?

Otis11 started a thread inquiring about performance under the 270 beta driver, looks like a good opportunity to get some, erm...beta data.

**Otis11** · 04-03-2011, 05:29 PM

Originally Posted by SMTB1963

Otis11 started a thread inquiring about performance under the 270 beta driver, looks like a good opportunity to get some, erm...beta data.

Yeah, I'd do newer to older and see what's best. I'd imagine newer is better but that has no basis. (that I know of)

**Snow Crash** · 04-17-2011, 03:48 AM

OK ... after a week of 265.90 (yes, I forgot I as running hacked quadro driver) and a week of 270.51 beta it looks like the 270.51 looks to be slighly more efficient but I will say the difference is well within normal variation between WUs of the same type. It was difficult to get similar WU types as the new TONI_AGG started to ramp up AFTER I switched drivers and I got non of the IBUCH_***_EFGR after I switched. That being said, on average the driver gained 5 minutes per WU

Unless there is something else we can think of to test, and I can get permission from the card's true owner, I think we may be done with this round of testing on a 460 with no real optimizations identified, just a confirmation of things we already knew.

Run *long* WUs and keep a small cache to make sure you always get the <24 hour return point multiplier bonus.
The latest drivers (270.51) *may* provide a performance benefit but it is very small so there is no real reason to upgrade but also no reason not to.

Kirk out

**josh1980** · 04-18-2011, 11:31 PM

Something I've noticed recently that is related to your thread...

Because of the efficiency issues with Windows Vista and 7 I have been using Windows XP x64 to handle the dedicated crunchers for GPUGRID. My typical configuration has been to set aside one core for GPUGRID on the XP machines. For instance, on my GPUGRID cruncher I'd typically set up 87.5% CPU utilization for WCG. On my SR-2 rig that would leave GPUGRID a free core for crunching and another free core for actual application use. For whatever reason I found this helped maximize the response of my computer when using it while also providing for maximum crunching. When observing the "time remaining + time elapsed" for the long WUs I'd regularly see 14+ hours for a single WU. Of course, this was usually disappointing because the long WUs claim to be 8-12 hours. I'm running a GTX 570 for the above data and I've been disappointed because I was lucky if it finished in anything less than 14 hours.

Then I did something...

On my SR-2 rig I lowered the CPU utilization from 87.5% to 75% about 5 days ago because it was starting to get warm in the room. I also lowered the overclock by about 10% because of the heat that was causing WU errors. Since GPUGRID WUs typically cause 99% CPU utilization for one core I was sure that my WUs would finish more slowly for GPUGRID. Instead, I've been watching my GPU crunching time and since I've lowered the CPU utilization I've noticed that the GPUGRID times have always been less than 12 hours now.

Perhaps the issue with GPUGRID performing so poorly in Windows 7 and Vista is related to us maxing out our cores for WCG and GPUGRID? I just find it very strange that I lowered my OC and lowered the CPU utilization and my GPU crunching times have improved.

Anyone know if GPUGRID changed something recently that could account for this change and it was just a coincidence that I changed my settings around the same time? I know a few days ago they had an issue where GPUGRID was handing out tons of WUs to everyone and filling our queues.

**Snow Crash** · 05-07-2011, 12:26 PM

Couple of things come to mind ... there are different types of WUs which will run different lengths and will give different points. Another is that if you are cranking your GPU too high you can actually be causing *recoverable* errors ... you'll never know this is happening until you start to see poor overall runtimes. Here's a dirty little secret for you ... the amount of CPU used by a GPU app through BOINC is not governed by the CPU utilization preference so in general the more free CPU you have the better GPUGrid will run. None of this has anything to do with the new DMA in Vista/ Win7 and afaik there just isn't anything that can be done about that.

**Otis11** · 05-07-2011, 12:46 PM

Originally Posted by Snow Crash

Couple of things come to mind ... there are different types of WUs which will run different lengths and will give different points. Another is that if you are cranking your GPU too high you can actually be causing *recoverable* errors ... you'll never know this is happening until you start to see poor overall runtimes. Here a dirty little secret for you ... the amount of CPU used by a GPU app through BOINC is not governed by the CPU utilization preference so in general the ower the more free CPU you have the better GPUGrid will run. None of this has anything to do with the new DMA in Vista/ Win7 and afaik there just isn't anything that can be done about that.

The other way you might notice those errors is to find the standard Dev of your run times. (have to keep projects seperate) They should be VERY close if you aren't having any issues. (My SDEV is ~1.6% for reference.)

Thread: 460/ 560, etc efficiency

Thread Tools

Search Thread

Rate This Thread

Display

460/ 560, etc efficiency

Bookmarks

Bookmarks

Posting Permissions