Page 1 of 2 12 LastLast
Results 1 to 25 of 33

Thread: Power Issues = EUEs with Nvidia?

  1. #1
    Xtreme Cruncher
    Join Date
    May 2007
    Location
    CA
    Posts
    1,885

    Power Issues = EUEs with Nvidia?

    "MARVIN THE MARTIAN" suggested everyone read this in another thread.
    (Can't remember which one or I'd link the post.)

    It rates it's own thread as many avoid the F@H forums like the plague.

    Posted by Dr. Pande 11/7/08 in "News"

    "Culprit found in NV core v 1.15 issues for certain hardware?

    Engineers at NVIDIA (notably Scott LeGrand) have come up with a theory for the EUE's seen in core 1.15 (and a few others in the 1.15 to 1.18 range) on certain hardware. They found that this core had code optimizations that drove the GPU so hard that it would draw a lot more electricity (one sign of this was running hotter). In some boxes, this was too much electricity and this lead to numerical instabilities. When the same machine was given a beefier power supply, the problem went away.

    We've been told that 8800's require 600W power supplies, but we're finding that even a little bigger (eg at least 650W) is important to leave some room for error. We are working to see if there is some way to detect this issue in software, but for now, if you're getting EUE's on the NV GPU client, this is something to consider.

    By the way, this will be very important for us to consider future code optimizations. NV core v1.19 removed some optimizations to solve this problem, but there are many cards which would run fine w/this more optimized code. If we can find a way to detect whether the card can draw enough power, we may be able to choose different code paths to allow for greater optimization for cards which can handle it.

    We're still looking into this. For now, if you're seeing issues with your card, please consider trying out a bigger power supply. We will continue to look to see if this is indeed the problem and what we can do to help the situation such that the code runs stably on all machines."

    Oddly one of my machines that had EUE problems also had the best (80 AMP 12v single rail) PSU. Core 1.18 helped with the EUEs on one core of a 9800gx2 on that rig. All my PSUs are 600W or better. Posted as an FYI and a FWIW. Ummm.. credit to MTM for seeing and reporting this first. We all have out hot buttons and I don't want to push anyone's.
    Last edited by WFO; 11-07-2008 at 04:11 PM.
    Cooler Master HAF 942
    Sabertooth X79
    Win7 64
    3960X @ 4805 1.376 v-core
    32GB DDR3 1866 G.SKILL Ripjaws Z
    OCZ RevoDrive 3 series RVD3-FHPX4-120G PCI-E 120GB
    3 X 6T Raid 0 Hitachi Storage
    Themaltake Tough Power 1200
    1 HD 7970

    F@H badge by xoqolat



  2. #2
    Xtreme Cruncher
    Join Date
    Jan 2008
    Posts
    1,169
    Quote Originally Posted by WFO View Post
    All my PSUs are 600 amp or better.
    OMG that would be 66,000 watt power supply!!


    "[crunching is] a minor service to humanity as a side effect of our collective hardware fetish" - Blauhung

  3. #3
    Xtreme Cruncher
    Join Date
    May 2007
    Location
    CA
    Posts
    1,885
    Quote Originally Posted by coo-coo-clocker View Post
    OMG that would be 66,000 watt power supply!!


    LOL... Sorry. It should have read 600w. As posted in the race thread it's Friday and tonight's libations are kicking in.

    Edit: Why oh why do I have the feeling only Dak would understand my gaff???? Original post edited for clarity.
    Last edited by WFO; 11-07-2008 at 04:12 PM.
    Cooler Master HAF 942
    Sabertooth X79
    Win7 64
    3960X @ 4805 1.376 v-core
    32GB DDR3 1866 G.SKILL Ripjaws Z
    OCZ RevoDrive 3 series RVD3-FHPX4-120G PCI-E 120GB
    3 X 6T Raid 0 Hitachi Storage
    Themaltake Tough Power 1200
    1 HD 7970

    F@H badge by xoqolat



  4. #4
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    yeah, that is definitely worthy. especially for those with a high eue rate...

    but I see the same small 10% eue rate on all mine, and my main pc has a HX520. it used to have a powerhog 2900pro in it and never had power issues.. can't imagine it has power issues with a single 9800gt and 2 hard drives with a low voltage oc on a Q6600....

    plus is there anyway a card can exceed it's tdp without volt modding... a piece of software like the folding client just can't magically cause a piece of hardware to enter the twightlight zone and double it's tdp.. that's pure fantasy... -maybe in the twightlight zone I guess, but not on earth.. I mean I can understand this comment being directed to people that have cheap or under powered systems... but how many of us at XS have under powered systems on weak psu's...

    that is a relevant point. but it doesn't explain the average 10% eue rate they see across the population of folders... that's software based errors..

    and I knwo some people think they dont get eue's because they dont check their logs thoroughly enough. but if they ran Marvin's FAHWATCH program they'd see the light.
    here's some tdp's of common cards.
    http://en.wikipedia.org/wiki/GeForce_8 http://en.wikipedia.org/wiki/GeForce_9_Series http://en.wikipedia.org/wiki/GeForce_200_Series
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	1.jpg 
Views:	152 
Size:	40.0 KB 
ID:	88642  
    Last edited by MikeB12; 11-07-2008 at 04:19 PM.

  5. #5
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    smells like another Pande excuse to me...

  6. #6
    Da Goose
    Join Date
    Oct 2005
    Location
    Chicago
    Posts
    4,913
    Good Point Mike...Let's get angra on this asap. I am tired of EUE's and 5748's.


    i7-860 Farm with nVidia GPU's

  7. #7
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    here'a little insider tip from the "subforum that shall not be named" at the bottom of the fah forums.....

    Q: Do you know why they have released a2 core for linux and not windows for smp?

    A: Because they can't get a2 stable in windows.



    a2 core includes optimizations for the smp client that allow it to more efficiently use the cpu cycles available to it. and they can't get the software stable in windows, yet it works in linux...
    this is why we get more ppd out of dual smp on intel quads, because we're throwing 2 threads at each core with a1... one thread of a1 per core is not powerful enough to get at all the cycles. it has nothing to do with power, but has everything to do with a weakness in the software... yet to fix it they need to update to a2 core, which chokes and pukes in windows.

    imo, this is the same general scenario that's going on with the gpu core optimizations. it's not about power... we still get the same eue rate at stock clocks. that my friend is a software inadequacy, not a hardware problem or power problem.

  8. #8
    Xtreme Cruncher
    Join Date
    May 2007
    Location
    CA
    Posts
    1,885
    Quote Originally Posted by MikeB12 View Post
    here'a little insider tip from the "subforum that shall not be named" at the bottom of the fah forums.....

    Q: Do you know why they have released a2 core for linux and not windows for smp?

    A: Because they can't get a2 stable in windows.



    a2 core includes optimizations for the smp client that allow it to more efficiently use the cpu cycles available to it. and they can't get the software stable in windows, yet it works in linux...
    this is why we get more ppd out of dual smp on intel quads, because we're throwing 2 threads at each core with a1... one thread of a1 per core is not powerful enough to get at all the cycles. it has nothing to do with power, but has everything to do with a weakness in the software... yet to fix it they need to update to a2 core, which chokes and pukes in windows.

    imo, this is the same general scenario that's going on with the gpu core optimizations. it's not about power... we still get the same eue rate at stock clocks. that my friend is a software inadequacy, not a hardware problem or power problem.
    Cooler Master HAF 942
    Sabertooth X79
    Win7 64
    3960X @ 4805 1.376 v-core
    32GB DDR3 1866 G.SKILL Ripjaws Z
    OCZ RevoDrive 3 series RVD3-FHPX4-120G PCI-E 120GB
    3 X 6T Raid 0 Hitachi Storage
    Themaltake Tough Power 1200
    1 HD 7970

    F@H badge by xoqolat



  9. #9
    Xtreme Cruncher
    Join Date
    Jun 2005
    Location
    Northern VA
    Posts
    1,285
    Mike, what are you trying to say?? that if i got my hands on a Linux client, like the free one from Unbunto 64 bit, and installed it on 2 of my 4400x2, and my 5420, it would be a lot better that windows??

    will i also be able to gpu fold with linux as well. does the cuda and every thing else work with it as well?
    Its not overkill if it works.


  10. #10
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    yes, you'll get more smp ppd in linux with a1 core and a good bit more with a2 core. my x2 4600 when I ran linux got about 1000ppd on a1 and 1700 on a2..

    but there is no gpu linux client... so you have to use wine. don't ask me how, I'm a linux noob.. all I know how to do is install the os, enable samba for fahmon, and install smp...
    wine is a creature I've never seen before... I think Shadow or one of the other guys knows, there's a thread around here somewhere..

    I run all Vista rigs, Q6600's-dual smp (3000ppd), and nividia gpu (5000ppd).

    RR runs a linux smp folder on a 4ghz QX9650 and gets like 8000ppd out of the cpu alone.


    BUT, what I was demonstrating with my post above is PG and FAH have issues with getting there optimizations stable in windows for GPU and SMP.. I think that power excuse by Pande in the op is a reach for an excuse.
    Last edited by MikeB12; 11-07-2008 at 04:54 PM.

  11. #11
    Xtreme Cruncher
    Join Date
    May 2007
    Location
    Conroe, Texas
    Posts
    3,010
    Well I have a 1000 watt Enermax on my main rig and was still getting eue with the gpu client, probably the 10% mike is talking about. On Ubuntu 64 8.04 vs. windows SMP with a C2D linux gets about 800 more PPD with the a1s, it rocks with the a2s...


  12. #12
    Banned
    Join Date
    Feb 2006
    Location
    Hhw
    Posts
    4,036
    Quote Originally Posted by MikeB12 View Post
    yes, you'll get more smp ppd in linux with a1 core and a good bit more with a2 core. my x2 4600 when I ran linux got about 1000ppd on a1 and 1700 on a2..

    but there is no gpu linux client... so you have to use wine. don't ask me how, I'm a linux noob.. all I know how to do is install the os, enable samba for fahmon, and install smp...
    wine is a creature I've never seen before... I think Shadow or one of the other guys knows, there's a thread around here somewhere..

    I run all Vista rigs, Q6600's-dual smp (3000ppd), and nividia gpu (5000ppd).

    RR runs a linux smp folder on a 4ghz QX9650 and gets like 8000ppd out of the cpu alone.


    BUT, what I was demonstrating with my post above is PG and FAH have issues with getting there optimizations stable in windows for GPU and SMP.. I think that power excuse by Pande in the op is a reach for an excuse.

    Easy to try Get a pos 400w psu and run it :P

    Sorry been bussy all day http://code.google.com/p/fahwatch/

    Edit: WFO thanks for that but I can't even recall mentioning this as I didn't spend allot of time on the fah forums today, and when I did it was on a diffrent mirror ( server went down, came up but moved location, it been up for awhile on 2 ip's ) So I made a bunch of posts, and poof the next minute they where gone
    Last edited by Marvin_The_Martian; 11-07-2008 at 05:10 PM.

  13. #13
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    lol Marvin. yeah, I've got a couple of those laying around.. actually I've got an old 350w if I could find it, and it's got like 5yrs capacitor aging on it. but don't think I'll smoke my hardware over a little test of a Pande excuse.

    that would be a waste of resources on the scale of GPU1..

  14. #14
    Banned
    Join Date
    Feb 2006
    Location
    Hhw
    Posts
    4,036
    That's a fact

    Though them again, they probably did learn things from it... like making sure they never never ever take a chanche of that repeating itself.

    Btw I heard someone say there are no other gpu dc projects, but there are. gpugrid If anyone really is sick of fah, check that out but I'll stay right here if you don't mind.. now I haven't drank something in a year or so and I'm going to make me a glass of Krupnik.. anyone know why

  15. #15
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    ok, you peaked my curiosity...
    http://en.wikipedia.org/wiki/Krupnik
    Krupnik, or Krupnikas as it is known in Lithuanian, is a traditional sweet vodka, similar to a liqueur, based on grain spirit and honey, popular in Poland and Lithuania. It consists of 40%-50% (80-100 proof) alcohol, honey and up to 50 different herbs. It originated in the territories of present day Lithuania.

    It is a distant relative of the mieducha, a honey-made spirit popular in all Slavic countries.

    Legend has it that the recipe was created by the Benedictine monks at a monastery in Niaśviž which was founded by Mikołaj Krzysztof "Sierotka" Radziwiłł. Known in Poland and Lithuania at least since 16th century, it soon became popular among the szlachta of the Polish-Lithuanian Commonwealth. There are numerous recipes preserved to our times in countless szlachta diaries. Krupnikas was also used as a common medicinal disinfectant to Polish soldiers in World War II.

    At times, spicy seasonings and herbs are added to flavour. The brand of the honey and the ratio of seasonings are key points for final taste of krupnik. It is either served hot or chilled. A specific sort of krupnik which contains more herbs and less honey is brewed by Karaims.

    Krupnik is also the Polish name of a barley soup.
    sounds pretty good..

  16. #16
    Banned
    Join Date
    Feb 2006
    Location
    Hhw
    Posts
    4,036
    Tastes good yes
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	IMAGE_029.jpg 
Views:	124 
Size:	170.4 KB 
ID:	88647   Click image for larger version. 

Name:	IMAGE_031.jpg 
Views:	129 
Size:	163.4 KB 
ID:	88648  

  17. #17
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    Marvin must have been pretty buzzed, he's taking pictures laying down..

    btw: here's the thread at fah on this latest finger pointing episode by pande... http://foldingforum.org/viewtopic.php?f=52&t=6761 I still cant believe people have bought into this over there.. Pande even brought up that this is testable like Marvin said, but I dont see any test results... know why? because it wont yield any verifiable results. they'll still get a 10% eue rate with a HX1000 feeding a single card, just like they did when it was on a Antec EW430.
    Last edited by MikeB12; 11-08-2008 at 12:06 AM.

  18. #18
    Xtreme Addict
    Join Date
    Jul 2007
    Location
    Germany
    Posts
    1,592
    IF Pande want to be so nice and put all the blame on nV instead of just saying that they didn't get GPUv2 to run stable with optimisations enabled, and IF it's really power issues, which I don't believe, maybe it's the power circuitry on the cards themselves that aren't made for the stress that fah is putting them under, i.e. it's irrelevant what PSU you use, the card still will choke?
    The XS Folding@Home team needs your help! Join us and help fight diseases with your CPU and GPU!!


  19. #19
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    yup p2501, which leads us right back to it's a software issue... unless PG thinks they have the influential power to revamp Nvidia's mfr process on the fly.. and reproduce all G92 core cards... ---and then the opium wore off...

  20. #20
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    a little fah forum drama for those that don't read fah forum... from that thread.

    Quote Originally Posted by Ivoshiee
    Quote Originally Posted by mikeb12
    Quote Originally Posted by Xilikon
    I believe this is bogus. A quality PSU like Antec, Seasonic or Corsair can run it well. I ran 2x8800GT with a Corsair VX550W without issues and right now, 2 of my GPU computers have just a 500w PSU without tossing a EUE. If it eue even with a 1000W psu, something else is causing issues, not a insufficient PSU.
    I'm so with you brutha... I still can't believe we have people in this thread that have been around hardware and the overclocking community for so long, buying into this story. this story sounds like a stab in the dark to explain the core optimization problems.

    face the facts people. core 1.15-1.18 was not ready for the masses across all cuda hardware.. period.. I know it's a hard pill to swallow. just buck up and accept we'll be going back to the previous core speed level in a later revision...
    I do not understand why it is so hard to believe that stated reasons can be at least one cause of the issue.
    A modern IC is not just a block of silicon and local power issues are not unimaginable things.

    Who else to believe than nVIDIA itself?

    TDP (http://en.wikipedia.org/wiki/Thermal_Design_Power) and actual IC power draw is not the same.

    Quote Originally Posted by mikeb12
    OK Ivoshiee, fair enough...

    Then let's say it is the IC.

    What does the fah project plan to do about it?
    A. revamp nvidia's mfr process to improve the IC...
    B. correct the client software so that it doesn't cause this...

    are you a unrealistic dreamer (A) or a logical realist(B)?

    it's about the solution here...
    Even if the idea/theory is coming straight from the mfr, it's kind of ridiculous to suggest people start upgrading their psu's, when many of us are already running overpowered psu rigs on +$100 quality psu's... and that is what the OP is suggesting and how readers interpret it... it's a core problem and it needs to be corrected as such.... pointing fingers at IC's, mfr's, and psu's does not solve the underlying issue.

  21. #21
    Banned
    Join Date
    Feb 2006
    Location
    Hhw
    Posts
    4,036
    Quote Originally Posted by MikeB12 View Post
    a little fah forum drama for those that don't read fah forum... from that thread.
    Idk Mike, there are things there which make me wonder. Ok they haven't linked us to the data they said they gatherd in conjunction with Nvidia which I think they felt they don't need to. It's like BenchZowner here and his degradation thread, no pics no movies nothing yet we believe him because he is a a trusted community member.

    Like you said it does not have to be the psu alone, it could be the psu in combination with a certain card in combination with other hw which causes the amps&volts to become unstable causing the resulting instabilities. To me that sounds very plausible.

    Just woke up need to check the fah forums to see if they got more already then your quote, if not I'm going to ask them to ask the Nvidia rep. who came to this conclusion to post on the fah forums and offer their testing data. It's only a fair request seen the controversy, and I think PG would not object to it.

  22. #22
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    Marvin's awake at 11:46AM after a night of Krupnik

    yeah I know the theory is plausible, but the request for users to upgrade psu's to fix it is not nice or cool or even logical to the user community. if the theory of nvidia hardware not being able to handle the core optimizations is true. then the core needs to be corrected even if that means rolling the speed back... the way it is being handled is they are pointing fingers at everything external to the core itself.. it's pure fantasy for them to expect nvidia to redisign their mfr process to incorporate fah new core or for users to go out and spend money on new psu's to try and fix an obvious core issue..

    Are you going out to buy a new psu this morning? I know I'm not, and neither are most fah users.
    Last edited by MikeB12; 11-08-2008 at 02:52 AM.

  23. #23
    Banned
    Join Date
    Feb 2006
    Location
    Hhw
    Posts
    4,036
    Quote Originally Posted by me
    I think it's not that unplausible, could be the psu as main quilly component but even then you have so many factors outside the psu alone. The power in your house, I can tell you I lived in an appartment where the light dimmed when I turned the vaccuumcleaner on, I couldn;t leave my pc on there all day because I would get bsod's all the time ( didn't stay there long offcourse ). Then there is the powerdraw from other components, and the qualitu of the power circtuitry on the card itself.

    Remember we're talking about numerical precision here which goes deeper then graphical stability, remember the client has an integrety check which get's triggerd when any data seems suspicous, there only has to be one moment where the card can not draw it's ower and blam.. eue. Where other programs might be much more forgiving, Folding is not.

    I'm offcourse pulling this from a dark place, as I don't have their testing data, just some trust in what they say so I would like to request that the community is shown the testing data from Nvidia. Get Scot who tested it to show his data, and have trust in this community to be able to take the data for what it's worth.
    I got a 570w ( pos trust psu ) with a single low end card. I'm quite sure my psu is not to blame for eue's I had with the larger proteins.. If it where the psu, wouldn't it mean it would break with smaller wu's as well? Though maybe 576 atoms are not enough to utilize the full core and that's why it draws less power ( not maybe, been confirmed by people with killawats )...

    Edit: And they haven't said people should upgrade psu's, their not forcing anyone atleast I don't see it that way. They point out something they think is the reason the issues appear to be not bound to a single type of card(s), though the lower end class have even wurse onboard power circuitry on them explaining why 8600's would be that much more susebtable?

    Mike you should look it up but I posted this some time ago as a suggestion in the beta forums and I think I made a better point back then then now, has to start havig effect
    Last edited by Marvin_The_Martian; 11-08-2008 at 03:03 AM.

  24. #24
    Attack Dachshund
    Join Date
    Jul 2007
    Location
    South Carolina USA
    Posts
    3,161
    yeah, someone made a valid point about the purpose of this announcement that kinda slipped by me on it's interpretation...

    Quote Originally Posted by mikeb12
    Quote Originally Posted by P5-133XL
    The proper attitude should be:

    Vijay, thank you for telling us this. Now we can understand and deal with the issue.
    if this is the case then I apologize for coming down on the OP.


    What I'm hearing below is a suggestion to upgrade the psu on my machines to fix the small eue rate experienced on my 9600gso, 9800gt, and 2 8800gt's. when they are already running beefy psu's. not that I would consider doing that.. my gpu's turn in 10-20 back to back 0-100% wu's, then kick out an unstable eue... it's hard to imagine they can run though 10-20 consecutive successful wu's if power was an issue.
    Quote Originally Posted by VijayPande
    Engineers at NVIDIA (notably Scott LeGrand) have come up with a theory for the EUE's seen in core 1.15 (and a few others in the 1.15 to 1.18 range) on certain hardware. They found that this core had code optimizations that drove the GPU so hard that it would draw a lot more electricity (one sign of this was running hotter). In some boxes, this was too much electricity and this lead to numerical instabilities. When the same machine was given a beefier power supply, the problem went away.

    We've been told that 8800's require 600W power supplies, but we're finding that even a little bigger (eg at least 650W) is important to leave some room for error. We are working to see if there is some way to detect this issue in software, but for now, if you're getting EUE's on the NV GPU client, this is something to consider.

    By the way, this will be very important for us to consider future code optimizations. NV core v1.19 removed some optimizations to solve this problem, but there are many cards which would run fine w/this more optimized code. If we can find a way to detect whether the card can draw enough power, we may be able to choose different code paths to allow for greater optimization for cards which can handle it.

    We're still looking into this. For now, if you're seeing issues with your card, please consider trying out a bigger power supply. We will continue to look to see if this is indeed the problem and what we can do to help the situation such that the code runs stably on all machines.

  25. #25
    Banned
    Join Date
    Feb 2006
    Location
    Hhw
    Posts
    4,036
    I don't understand your above post? What are you saying.. need more

Page 1 of 2 12 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •