XtremeSystems Forums

Go Back   XtremeSystems Forums > Xtreme > Xtreme News

Reply
 
Thread Tools Rating: Thread Rating: 63 votes, 3.79 average. Display Modes
Old 05-19-2009, 11:31 AM   #1
Tetedeiench
Xtreme Member
 
Join Date: Dec 2006
Posts: 211
OCCT 3.1.0 shows HD4870/4890 design flaw - they can't handle the new GPU test !

**UPDATE 05/23/2009**

Problem has been confirmed by professionnal testing done by french websites in their labs. (I'll let you use google translation to understand what's going on).
http://www.canardpc.com/news-36049-g...et_4890__.html
http://www.pcinpact.com/actu/news_multi/51011.htm

it is *NOT* related to Temperature, but to a weakness in the Power Supply stages of the Reference design HD4870/4890 cards from AMD. We don't know yet if it is the VRM themselves, or the OCP that triggers.

The problems occurs with OCCT GPU:3D test at stock frequencies. You can also make it appear with Furmark, renamed exe to bypass ATI/AMD limitations, by bumping the vGPU a little (Furmark is a tad less effective than GPU:3D on those cards, that's why).

It seems the problem occurs when you reach around 82A/83A on the VRMA on those cards. Use RivaTuner with the appropriate plugins for Monitoring.



Original Post :





First of all, if you want to reproduce the crash, be sure to read the Is there a specific test configuration to reproduce it ? section !

Hello guys,

Who are you ?
I'm the guy behind the well-known program OCCT, which has recently become an all-rounder stability check program (CPU, GPU, Power supply). English is not my native language, so please excuse the mistakes

What's the point ?
Recently, i've been working alot on the RC1 version, and especially on the GPU:3D test. It has been improved dramatically, and this new test revealed a hardware design flaw in the new Radeon HD4870/4890 cards who followed the ATI/AMD reference design. Cards like PowerColor 4870PCS+ are NOT affected, as they are custom design.

They basically crash because they can't handle the load. Early testing shows the VRM cannot supply enough power to the card. That's the only thing i could dig until now, with limited testing means. They seem limited to 82/83A. Please help us dig more !

For a 3d developer, that means : "Do not optimize your code too much for ATI cards, or you could reach the limits, and crash them".

What's this test doing ?
The new test is still a furry donut, but it sports, among new features, a shader complexity parameter, which is, as its names states, how complex the shader will be, i.e. the amount how work the graphic card will have to do in one pass. The highest value is not always the best. For HD4XXX cards, 3 is the best value.

Let me stress the following points :
  • This test uses DirectX9, which is updated SEPARATLY from DirectX 10 and 11. Install it from Microsoft Website
  • I do NOT use other functions than DirectX's basic function. Really, nothing fancy. Shader Model 3 shaders, alot of Alpha blending... and that's it !

Is there a specific test configuration to reproduce it ?
Yes. First download OCCT by going on the official website and grabbing the RC1 : http://www.ocbase.com/forum/viewtopic.php?f=5&t=68

Next, the goal is simple : maximise the GPU load. Here is the way to do so on those cards. Be sure to use these settings !
  • Enable Fullscreen Mode
  • Disable Errorcheck Mode (comparing images is NOT effective)
  • Use a High resolution. Preferably the native resolution of your screen (i.e. 1680x1050 for a 22" LCD, etc)
  • Shader Complexity 3 for HD4XXX cards
Click "Go" and watch your screen go black. Your card has gone into protection mode. Frequencies dropped to 200Mhz. Reboot is needed.

How many cards are affected ?
Right now, we've successfully crashed about 10 different cards using this test, using alot of different power supplies (ranging from 550W Antec to 1500W ToughPower (!!!). We had Seasonic, Corsairs... etc).

Why are you so sure your crappy test is not at stake here ?
Hey, very good question. Here is why :
  • Underclocking the card makes the test run fine ! So my code is supported by the GPU and drivers.
  • The test run on HD3XXX cards.
  • The test runs when you lower the load somehow. You do that by : Lower the resolution drastically, lowering the Shader complexity parameter to 0,... (Enabling v-Sync DO NOT lower the load, as EVERY FRAME is still calculated, just not sent to the screen).
  • We had a HD4870 that used a Powercolor-specific design that did NOT suffer the symptoms described here.

Are you 100% sure of what you're saying ?
No. 95% sure. I have limited testing hardware. In fact, my own computers, and my beta-testers. I couldn't contact ATI/AMD, they don't answer my mails yet. I tried to contact professionnal websites, so far no good. So i'd thought i'd start communicating the info to people who already supported OCCT before, who knew the program, and see what happens, if that would be of interest.

Do you have recording that shows the cards crashing ?
Yes. You'll see the RivaTuner screenshots of one of my testers. He used 2 HD4870s :
The first one is of PowerColor design (GPU0). The Vrm are a 4-phase instead of 3 in the reference design. More robust.
The 2nd one is the reference design (GPU1).

The first picture shows that everything is fine : the tests are increasing the cards power stage loads by changing the cards frequencies, everything runs fine, he doesn't hit the 82/83A border line on the GPU1 (reference design) card :

For those who wonders, here are the frequencies that were used :
* 500/500 for both
* 500/600 for both
* 500/700 for both
* 800/500 GPU0 750/500 GPU1
* 800/600 GPU0 750/600 GPU1
* 800/700 GPU0 750/700 GPU1

This picture shows you the test going into protection mode. The cards goes to 200Mhz. The Current spike is too brief for it to be recorded, unfortunatly :


That's it, you know everything about the problem. Please help us dig more about this problem, and if i'm wrong, please, oh please, correct me ! But i really do think we have something there...

If a hardware guru, or someone else, could conduct more advanced testing and confirm all this...

Last edited by Tetedeiench; 05-23-2009 at 04:39 AM.
Tetedeiench is offline   Reply With Quote
Old 05-19-2009, 11:35 AM   #2
EniGmA1987
Xtreme Addict
 
Join Date: Oct 2006
Posts: 1,138
Send a message via MSN to EniGmA1987
sorry but it just seems highly unlikely to me that if modern GPUs pull 100amps sometimes during full load on modern games, ATI wouldnt design a card that can only handle a max load of 83 amps.
__________________
Quote:
Originally Posted by Shintai
Imagine Microsoft said that Windows 7 would only work on Acer/HP/Dell/Lenovo machines.

Its just Apple thats full of BS
Tyan S2932G2NR-SI
AMD Opteron quads @ 2.6GHz (Shanghai) x2
16GB of Kingston DDR-800 ECC RAM
ATI 4890 1GB

ASUS M3A79-T Deluxe
Phenom II 940 @ 3.75GHz w/ 1.4v
NB and HTT @ 2.75GHz w/ 1.4v & 1.3v
G.Skill PI @ DDR1150 w/ 2.1v
ATI 4890 @ 1022MHz core
EniGmA1987 is offline   Reply With Quote
Old 05-19-2009, 11:38 AM   #3
Reznik Akime
We swam on!
 
Join Date: Sep 2004
Posts: 1,289
What drivers were you using? Possible driver hiccup? If this is the case I would imagine that someone would have experienced this before outside of OCCT.
Reznik Akime is offline   Reply With Quote
Old 05-19-2009, 11:39 AM   #4
Manicdan
Xtreme Mentor
 
Join Date: Dec 2007
Posts: 2,560
why does the memory clock jump to almost 3200mhz when the core clock drops to 200 on that last picture showing it go into protection mode?
Manicdan is offline   Reply With Quote
Old 05-19-2009, 11:39 AM   #5
I34z1k
Xtreme Member
 
I34z1k's Avatar
 
Join Date: Oct 2006
Location: South Africa
Posts: 386
Send a message via MSN to I34z1k
Very interesting!

Nice work i think
__________________
920 c0
6gig Mushkin 1600 6-7-6-18 (Many thanks to TheGoatEater)
285gtx
Dfi x58 t3eh6
I34z1k is offline   Reply With Quote
Old 05-19-2009, 11:40 AM   #6
ajaidev
Xtreme Addict
 
ajaidev's Avatar
 
Join Date: Jul 2008
Location: Shimla , India
Posts: 1,129
I can conform that the VRM's used on some of the 4850's were a joke they got reaallly hot while under load maybe your continues load made a few wear and tear "Due to heat" and lose some power for a fraction of a time when the core needed them the most. Heat does lower efficiency.

Try putting a heatsink on the VRM and see if that solves anything!!
ajaidev is offline   Reply With Quote
Old 05-19-2009, 11:42 AM   #7
Ch@pS
Xtreme Member
 
Ch@pS's Avatar
 
Join Date: Mar 2008
Posts: 106
Unlikely or not, thats what the tests are showing. I'm getting a 4890 later this week, have a 4870 sitting in a friends PC, will test both within the next week and get back to you about this.
__________________
Mother used to say, if you want you'll find a way, but mother never danced through fire showers.
Ch@pS is offline   Reply With Quote
Old 05-19-2009, 11:46 AM   #8
Manicdan
Xtreme Mentor
 
Join Date: Dec 2007
Posts: 2,560
im also not sure if you noticed, the Amps being put out are different for each, GPU1 is ahead by 2-5A across the board, except the last test where its lower by almost 5A, then the failure happens after youve hit the next jump.

i would adjust the program to give it a slight bit more complex render each frame. i really dont know if you could or not, but try to make it a very slight increase, and OC the card so you know that it will hit the cap of 82/83A before your program gives it the most complex image to render.
Manicdan is offline   Reply With Quote
Old 05-19-2009, 11:48 AM   #9
Tetedeiench
Xtreme Member
 
Join Date: Dec 2006
Posts: 211
Quote:
Originally Posted by EniGmA1987 View Post
sorry but it just seems highly unlikely to me that if modern GPUs pull 100amps sometimes during full load on modern games, ATI wouldnt design a card that can only handle a max load of 83 amps.
Well, we had a 4870 PCS+ (Asus) overclocked that could pull 106A, flawlessly, with the very same test configuration.

This value was pulled with an heavy overclock.

My test pulls about 87A (mesured on the non-reference design) @stock frequencies. That's an estimation.
Tetedeiench is offline   Reply With Quote
Old 05-19-2009, 11:49 AM   #10
Tetedeiench
Xtreme Member
 
Join Date: Dec 2006
Posts: 211
Quote:
Originally Posted by Reznik Akime View Post
What drivers were you using? Possible driver hiccup? If this is the case I would imagine that someone would have experienced this before outside of OCCT.
We tried about 4 driver versions. As underclocking make the test work, the driver version problem is ruled out. Especially since the asus design, with the very same driver version, works.

My test pulls about 30/40% more power on the VRM than Crysis. That's why you don't see that outside OCCT.
Tetedeiench is offline   Reply With Quote
Old 05-19-2009, 11:52 AM   #11
Tetedeiench
Xtreme Member
 
Join Date: Dec 2006
Posts: 211
Quote:
Originally Posted by Manicdan View Post
im also not sure if you noticed, the Amps being put out are different for each, GPU1 is ahead by 2-5A across the board, except the last test where its lower by almost 5A, then the failure happens after youve hit the next jump.

i would adjust the program to give it a slight bit more complex render each frame. i really dont know if you could or not, but try to make it a very slight increase, and OC the card so you know that it will hit the cap of 82/83A before your program gives it the most complex image to render.
It is doable, but it would only tell you, programatically, that it would fail @complexity 3.

We did alot of testing (using the frequencies, as it is the easiest way to increase the load on the VRM), and we came to the conclusion : if you go through the 82A barrier, your card go into protection mode.

problem is : a simple app as OCCT GPU can make this happen. And believe me : i'm not a 3d guru.

It is as if a CPU would not support Linpack, and crash.
Tetedeiench is offline   Reply With Quote
Old 05-19-2009, 11:54 AM   #12
Eastcoasthandle
Xtreme Guru
 
Eastcoasthandle's Avatar
 
Join Date: Jul 2005
Posts: 4,372
Having done this what games does this show a problem with this new test of yours?
__________________
Eastcoasthandle is offline   Reply With Quote
Old 05-19-2009, 11:54 AM   #13
kemo
Xtreme Addict
 
Join Date: Apr 2006
Location: Cairo
Posts: 2,280
Send a message via MSN to kemo Send a message via Skype™ to kemo
my HD4850 went past 115C with older version of the test , it is non reference HD4850 with zalman cooling so i wont be surprised if some cards will fail
__________________
Intel Core I7 920 @ 3.8GHZ 1.28V (Core Contact Freezer)
Asus X58 P6T
6GB OCZ Gold DDR3-1600MHZ 8-8-8-24
XFX HD5870
WD 1TB Black HD
Corsair 850TX
Cooler Master HAF 922
kemo is offline   Reply With Quote
Old 05-19-2009, 11:55 AM   #14
Tetedeiench
Xtreme Member
 
Join Date: Dec 2006
Posts: 211
UPDATE : the 4870 on the test is a 4870 PCS+ from PowerColor whose VRM are a 4-phase numerical VRM instead of 3, and that's why it is not crashing.

Sorry, i thought it was an Asus design. It is Not. My mistake.
Tetedeiench is offline   Reply With Quote
Old 05-19-2009, 11:59 AM   #15
Eastcoasthandle
Xtreme Guru
 
Eastcoasthandle's Avatar
 
Join Date: Jul 2005
Posts: 4,372
Quote:
Originally Posted by Tetedeiench View Post
UPDATE : the 4870 on the test is a 4870 PCS+ from PowerColor whose VRM are a 4-phase numerical VRM instead of 3, and that's why it is not crashing.

Sorry, i thought it was an Asus design. It is Not. My mistake.
So you basically designed a new test which is designed for 4-phase numerical VRM instead of 3?
__________________

Last edited by Eastcoasthandle; 05-19-2009 at 04:17 PM.
Eastcoasthandle is offline   Reply With Quote
Old 05-19-2009, 11:59 AM   #16
Reznik Akime
We swam on!
 
Join Date: Sep 2004
Posts: 1,289
Quote:
Originally Posted by Tetedeiench View Post
We tried about 4 driver versions. As underclocking make the test work, the driver version problem is ruled out. Especially since the asus design, with the very same driver version, works.

My test pulls about 30/40% more power on the VRM than Crysis. That's why you don't see that outside OCCT.
Ah, gotcha. Well, this kinda miffs me. A latent problem with almost all early 48xx series, save for no refrence! I would test this on my own but I got transfers going and can't do a reboot.
Reznik Akime is offline   Reply With Quote
Old 05-19-2009, 12:00 PM   #17
ChinStrap
.
 
ChinStrap's Avatar
 
Join Date: Dec 2007
Location: CR:IA
Posts: 268
Seems to me your putting an unrealistic load on the GPU anyways.
__________________
ChinStrap is offline   Reply With Quote
Old 05-19-2009, 12:01 PM   #18
v_rr
Xtreme Addict
 
Join Date: Nov 2005
Posts: 1,082
Design flaw? lol

You have a bench that don't run on those cards. So what, What does it prove?
Yes you can say that there are design flaws but it's not by this way.

By the way every chip these days have design flaws. Look at CPU erratas of Intel and AMD.
__________________
Quote:
Originally Posted by Shintai View Post
And AMD is only a CPU manufactor due to stolen technology and making clones.
v_rr is offline   Reply With Quote
Old 05-19-2009, 12:05 PM   #19
NapalmV5
LOVE=MC²
 
NapalmV5's Avatar
 
Join Date: Dec 2002
Posts: 3,444
crashes? turn up the fan

fan settings has been plaguing the 3000/4000 series from the beginning

the cards that crash ^ need fan bios settings fix/manual fan settings
__________________
NapalmV5 YouTube vidz
NapalmV5 is offline   Reply With Quote
Old 05-19-2009, 12:06 PM   #20
Aberration
Xtreme Enthusiast
 
Join Date: May 2008
Posts: 622
IIRC the reference design has an over current and over voltage protection.

My guess is you are hitting this. So I dont think this is a design flaw, but an intended result of the design.
__________________
Quote:
Originally Posted by alacheesu View Post
If you were consistently able to put two pieces of lego together when you were a kid, you should have no trouble replacing the pump top.
Aberration is offline   Reply With Quote
Old 05-19-2009, 12:09 PM   #21
Manicdan
Xtreme Mentor
 
Join Date: Dec 2007
Posts: 2,560
Quote:
Originally Posted by Tetedeiench View Post
It is doable, but it would only tell you, programatically, that it would fail @complexity 3.

We did alot of testing (using the frequencies, as it is the easiest way to increase the load on the VRM), and we came to the conclusion : if you go through the 82A barrier, your card go into protection mode.

problem is : a simple app as OCCT GPU can make this happen. And believe me : i'm not a 3d guru.

It is as if a CPU would not support Linpack, and crash.
going from 0 complexity to 3 would have too many major steps. the goal would be to having it increase in extremely small amounts and watch the power load across time.
it seems like the card that failed was increasing the Amps by a much larger amount, then reaching 80A gave in and couldnt provide anymore until it reached the failure point.

aslo could you show the total PC watts when running a game and your benchmark. so we can see load difference the is able to put onto the cards.

sounds like this is another version of furmark which ATI purposefully changed to run at a lower frequency on that program to prevent damage
Manicdan is offline   Reply With Quote
Old 05-19-2009, 12:09 PM   #22
NapalmV5
LOVE=MC²
 
NapalmV5's Avatar
 
Join Date: Dec 2002
Posts: 3,444
Quote:
Originally Posted by Aberration View Post
IIRC the reference design has an over current and over voltage protection.

My guess is you are hitting this. So I dont think this is a design flaw, but an intended result of the design.

thats a possibility.. then again it may just be bad fan settings/not working properly ^
__________________
NapalmV5 YouTube vidz
NapalmV5 is offline   Reply With Quote
Old 05-19-2009, 12:16 PM   #23
xoqolatl
Thrusters on full!
 
xoqolatl's Avatar
 
Join Date: Dec 2006
Location: Poland
Posts: 2,245
Send a message via MSN to xoqolatl
OCCT is a waste of time.....it's not ranked on HWBot

So the card has an OCP, and it's now considered a design flaw? Have you seen any game that puts that much load on a GPU? Have you at least created a special map for any game that would put that much load?
__________________
Quote:
Originally Posted by miahallen View Post
FurMark is a waste of time.....it's not ranked on HWBot
W3520 on Rampage II Gene | hardmounted TRUE + 2x 120mm push-pull | 3x 2gig of HCF0
Zombie GTX280 | look mom, i has no IHS | Thermalright HR-03 GT
Lian-Li PC-A70B | Brushed aluminium tank

MOA 2009 Poland #2

Test bench: Clarkdale, several AMD Confidentials
xoqolatl is online now   Reply With Quote
Old 05-19-2009, 12:18 PM   #24
Hornet331
Xtreme Mentor
 
Hornet331's Avatar
 
Join Date: Jul 2007
Location: Austria
Posts: 3,469
Send a message via ICQ to Hornet331
Quote:
Originally Posted by xoqolatl View Post
OCCT is a waste of time.....it's not ranked on HWBot

So the card has an OCP, and it's now considered a design flaw? Have you seen any game that puts that much load on a GPU? Have you at least created a special map for any game that would put that much load?
Try furmark and rename the exe, its the same (or nearly the same).

If you rename the exe on furmark its but so much load on the card that vrms can reach 130°C+ and if you let them run for time -> *poof*
__________________
Hornet331 is offline   Reply With Quote
Old 05-19-2009, 12:29 PM   #25
marten_larsson
Xtreme Member
 
Join Date: Dec 2008
Location: Sweden
Posts: 286
Quote:
Originally Posted by Hornet331 View Post
Try furmark and rename the exe, its the same (or nearly the same).

If you rename the exe on furmark its but so much load on the card that vrms can reach 130°C+ and if you let them run for time -> *poof*
You play furmark often? :P

This seems like a new furmark so it wasn't that shocking to me. One thing I find interesting is that the ATI cards seem to have a lot more power "left" compared to Nvidia's since power consumption goes through the roof with the former and not so much with the latter (please correct me if I'm using that wrong).
marten_larsson is offline   Reply With Quote
Reply

Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -8. The time now is 04:17 PM.


Powered by vBulletin® Version 3.7.6
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
XtremeSystems