Results 1 to 25 of 50

Thread: X2 Bug Warning: Thermal throttling bug / power saving mode use could fry your CPU!

Hybrid View

  1. #1
    Xtreme Member
    Join Date
    Jun 2003
    Location
    England, UK
    Posts
    162
    Also effects single core A64 "DH-E6" too.
    Still using the ADA4400DAA6CD CCBWE 0517MPMW

  2. #2
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by NickK
    Also effects single core A64 "DH-E6" too.
    No, it affects dual cores, both Toledo/Opteron (JH-E6) and Manchester (BH-E6), at 2.2GHz and higher.

    No single cores.

    See page 11.

    But this is not likely to be triggered by any power saving stuff, as the problem only happens when going from FULL LOAD to STPCLK.

    And there's no mention of "frying your CPU".
    Last edited by terrace215; 06-21-2005 at 09:30 PM.

  3. #3
    Xtreme Member
    Join Date
    Apr 2005
    Posts
    369
    Looks like amd is going the way of intel . i'm sure their genius engineers will fix this in no time. i fully trust amd .

  4. #4
    Registered User
    Join Date
    Mar 2005
    Location
    Sunnyvale 3453
    Posts
    21
    ...
    Hector Ruiz... circa 2011

  5. #5
    Xtreme Member
    Join Date
    Jun 2003
    Location
    England, UK
    Posts
    162
    Quote Originally Posted by terrace215
    No single cores.
    See page 11.
    Your quite right - perhaps reading the forums at 5:30am isn't such as good idea!
    Still using the ADA4400DAA6CD CCBWE 0517MPMW

  6. #6
    Xtreme Member
    Join Date
    Apr 2005
    Posts
    187

    It's not good.

    'Also effects single core A64 "DH-E6" too.' -- It effects
    BH-E4 (only X2's)
    and JH-E6 (X2's and Dual Core Opterons).

    "wonder if this could be related to p4 dualcores cooking mobos due to power transients when going into thermal protection modes (its the motherboards at fault really since power transients are caused by the vcore vregs not being able to cope with the power requirements of the CPU).
    Besides, if going from full load to all stop is that dangerous, then so is going from all stop to full load." --

    I agree, it seems likely.
    Personally I am very disappointed at the penny-pinching,
    quality-corner-cutting philosophy of consumer electronics design including
    motherboards. As a PCB designer I've always been inclined to select
    PCBs with full ground plane & power planes for each major voltage when
    designing boards a lot simpler and lower speed than 2GHz + PC motherboards!
    It really boggles my mind that they'd have such a push to try to get PCBs
    down to 4 layers for Socket 939 motherboards with PCI, AGP, DDR,
    etc. etc. that's all quite high speed and high pin count.
    I very much suspect that it's more of a matter of *luck* than *engineering*
    that they work as well as they do. Even though STPCLK / STPGNT type
    power throttling is an easy to generate extreme test case of power load
    fluctuation from ~ Zero power to full load, I suspect that many software
    situations involving CPU / I/O bursts do about as much variation of loading
    on the time scale of nanoseconds to microseconds. If the EMI design and
    power filtering, regulation, capacitors, PSU, etc. isn't up to the task of
    providing clean transitions from STPCLK to Full Run I suspect it's not REALLY
    stable running common software programs under stress / bursty conditions
    either, and that's probably a major reason for things crashing / flaking /
    burning out on many systems.

    What's needed is just better quality motherboards, 6+ layer PCBs, more
    conservative design margins of capacitors / power regulators, thicker PCB
    power traces, etc.

  7. #7
    Xtreme Member
    Join Date
    Apr 2005
    Posts
    187

    When this happens, speculations.

    I'm still trying to figure out all the common cases when this could happen.

    I think from what I've read that

    (a) STPCLK# can be set in chipset hardware semi-automatically
    (e.g. via CPU overheat detection & then resultant clock-throttling to protect
    the CPU's temperature but allow continued operation at a stuttered percentage
    of normal clock operating time).

    (b) STPCLK# signal is set by the chipset and can also be set via software
    by causing certain chipset registers to be used to enter the STPCLK state.

    (c) I think STPCLK# is the "hardware signal" (which can be set also by
    software) that causes the STPGNT state in the CPU as an effect of its presence. I think STPGNT is what the CPU does when it sees STPCLK# active. This is a little confusing since these seem to get referred to as
    if they're quasi-independent in some cases but often times there seems to
    be a causal link implied.

    (d) I've often seen it said that ACPI power state (C2) is what's effectively
    in operation when the AMD CPU is in STPCLK# STPGNT state. I think
    that's the usual case anyway from what I've read. I think the ACPI BIOS
    writers might be at liberty to define what C1, C2, C3, S1, S2 etc. actually
    do as long as they follow the overall rules of ACPI but I guess C2 is commonly
    chosen to be the "STPCLK, STPGNT" mode.

    (e) I've further seen it said that (S1) ACPI state is, for example, commonly
    implemented using the STPCLK / STPGNT mode which would really mean
    that in this case (S1) causes (C2) mode which is set up to work via STPCLK.
    e.g. ACPI specification 3.0: pp 414:

    "15.1.1.1 Example 1: S1 Sleeping State Implementation
    This example references an IA processor that supports the stop grant state through the assertion of the
    STPCLK# signal. When SLP_TYPx is programmed to the S1 value (the OEM chooses a value, which is
    then placed in the \_S1 object) and the SLP_ENx bit is subsequently set, the hardware can implement an S1
    state by asserting the STPCLK# signal to the processor, causing it to enter the stop grant state.
    In this case, the system clocks (PCI and CPU) are still running. Any enabled wake event causes the
    hardware to de-assert the STPCLK# signal to the processor whereby OSPM must first invalidate the CPU
    caches and then transition back into the working state."


    Given these findings it seems quite possible that, depending on the
    BIOS involved, one could enter this problematic STPCLK# state from
    "full speed" (>= 2000 MHz CPU clock) mode based on:

    (a) BIOS / chipset / utility software thermal monitoring detection of
    overheating on the CPU or other component.

    (b) Any kind of power management / ACPI / APM type BIOS function or utility
    that causes STPCLK# which might include (C2) or (S1) states depending
    on your utility / BIOS settings.

    I'm just not sure about "Cool N Quiet" I know there are tables of
    power / clock / VCore settings and transition timings that it uses to
    work. It's possible that some of those might use STPCLK to work; if
    so I'd guess it'd be the ones that operate the CPU at "almost full speed"
    but still reduced speed by some percentage like 88% or 75% of full speed.
    At least on Intel chips this kind of clock dividing does indeed use
    STPCLK to "burst" the clock between full speed and totally off in a given
    duty cycle to achieve an average effective percentage throttling.
    If AMD64 X2 CnQ uses this too then it's seemingly a risk for this
    bug also.

  8. #8
    Xtreme Member
    Join Date
    Apr 2005
    Posts
    187

    So what Socket 939 motherboards have the best Vcore regulators, most phases?

    What Socket 939 motherboards have the most robustly designed
    Vcore power regulators with the most number of "phases"?

    I've heard about 3-phase, 4-phase, 5-phase, etc. power converters
    for Vcore generation but I don't know what common motherboards
    for X2 / FX-53 / FX-55 Socket 939 really have the best regulator capacity.
    Anyone know? How about for S-939 AGP boards?

    AGP boards -- Asus A8V deluxe vs. MSI Neo2 plat vs. NVidia NF3 Ultra D?


    Every time I see a reference to someone "over-volting" something to
    "increase overclock stability" I sort of grimace and laugh. As an EE I know
    that "too many volts" usually just fries ICs, and doesn't really help them
    work "better/faster" in most ways. I think the MAIN effect of
    "increasing the volts" to demonstrably increase platform stability is really
    just to provide greater CHARGE / ENERGY into the "too-small" power filter
    capacitors so that when there's a large fast-rising load like a high MHz burst
    of I/O or computation that there's a better chance there'll be enough energy
    in the little filter capacitors to sustain at least the minimally necessary
    current / voltage to keep the electronics working correctly until the relatively
    slow (in comparison) voltage regulators can squirt out more energy / current
    to recharge the voltage on the filter capacitors up to the proper level again.

    Thus what's really needed isn't MORE voltage, it's BIGGER / BETTER / MORE
    capacitors and voltage regulators to provide more current / energy more
    frequently to the CPU / chips when those burts of power are needed. If
    this was done one should be able to run at 100% nominal Vcore / Vdimm / etc
    and have as much of a overclock with no risk of overclocking causing
    transient undervolting as is possible limited by the actual speed of the ICs.

    More Vcore regulator phases with higher quality and amperage FETs per
    phase should help this.

    Evidentally this kind of STPCLK transient causing power regulator
    instabliliy has been true for some time:
    http://unixmafia.port5.com/news/00211001.html

    "Overheating problems on some KT133(A) Motherboards, namely Asus
    Last update: 2002-12-05
    ...The problem
    The BIOS on several via KT133 and KT133A motherboards disable the HLT , STPCLK and STPGNT instructions of the processor (apparantly, some A7M266 boards have it too). These states are responsable for power savings (and heat production) under the APM and ACPI specifications. The boards that have these disabled therefor do not completely implement those specifications, although they do claim so in their propaganda.
    This can have several reasons, but one of the most common is to hide the fact that a particular board has an inferior power-supply. Which seems to be the case with my Asus.
    Asus uses a 2-phase power supply with 4 capacitators onboard, while most boards have a 3-phase supply with 6 capacitators. Especially the STPCLK instruction, which calls the C2 Power Management state of the Athlon processor, puts a heavy burden on the power supply, because switching between the lowest and highest power consumption can occur several times a second. The disadvantage of the C2 state is that it can interfere in realtime applications like video and audio, because it takes the processor a fraction of time to come out of it. Asus hides behind 'choppy audio' as a reason to disable both C1 and C2. If this were valid, why not supply a BIOS option to turn it on or off? "

  9. #9
    c[_]
    Join Date
    Nov 2002
    Location
    Alberta, Canada
    Posts
    18,728
    Quote Originally Posted by synergy
    What Socket 939 motherboards have the most robustly designed
    Vcore power regulators with the most number of "phases"?

    I've heard about 3-phase, 4-phase, 5-phase, etc. power converters
    for Vcore generation but I don't know what common motherboards
    for X2 / FX-53 / FX-55 Socket 939 really have the best regulator capacity.
    Anyone know? How about for S-939 AGP boards?

    AGP boards -- Asus A8V deluxe vs. MSI Neo2 plat vs. NVidia NF3 Ultra D?


    Every time I see a reference to someone "over-volting" something to
    "increase overclock stability" I sort of grimace and laugh. As an EE I know
    that "too many volts" usually just fries ICs, and doesn't really help them
    work "better/faster" in most ways. I think the MAIN effect of
    "increasing the volts" to demonstrably increase platform stability is really
    just to provide greater CHARGE / ENERGY into the "too-small" power filter
    capacitors so that when there's a large fast-rising load like a high MHz burst
    of I/O or computation that there's a better chance there'll be enough energy
    in the little filter capacitors to sustain at least the minimally necessary
    current / voltage to keep the electronics working correctly until the relatively
    slow (in comparison) voltage regulators can squirt out more energy / current
    to recharge the voltage on the filter capacitors up to the proper level again.

    Thus what's really needed isn't MORE voltage, it's BIGGER / BETTER / MORE
    capacitors and voltage regulators to provide more current / energy more
    frequently to the CPU / chips when those burts of power are needed. If
    this was done one should be able to run at 100% nominal Vcore / Vdimm / etc
    and have as much of a overclock with no risk of overclocking causing
    transient undervolting as is possible limited by the actual speed of the ICs.

    More Vcore regulator phases with higher quality and amperage FETs per
    phase should help this.

    Evidentally this kind of STPCLK transient causing power regulator
    instabliliy has been true for some time:
    http://unixmafia.port5.com/news/00211001.html

    "Overheating problems on some KT133(A) Motherboards, namely Asus
    Last update: 2002-12-05
    ...The problem
    The BIOS on several via KT133 and KT133A motherboards disable the HLT , STPCLK and STPGNT instructions of the processor (apparantly, some A7M266 boards have it too). These states are responsable for power savings (and heat production) under the APM and ACPI specifications. The boards that have these disabled therefor do not completely implement those specifications, although they do claim so in their propaganda.
    This can have several reasons, but one of the most common is to hide the fact that a particular board has an inferior power-supply. Which seems to be the case with my Asus.
    Asus uses a 2-phase power supply with 4 capacitators onboard, while most boards have a 3-phase supply with 6 capacitators. Especially the STPCLK instruction, which calls the C2 Power Management state of the Athlon processor, puts a heavy burden on the power supply, because switching between the lowest and highest power consumption can occur several times a second. The disadvantage of the C2 state is that it can interfere in realtime applications like video and audio, because it takes the processor a fraction of time to come out of it. Asus hides behind 'choppy audio' as a reason to disable both C1 and C2. If this were valid, why not supply a BIOS option to turn it on or off? "
    I suspect this is the way it is happening.

    In theory adding more voltage until the described affect creates a highly logarithmic increase in need for voltage shows how much "power" your board is capable of handling.

    For example:

    A processor scales well with voltage on one board, but not as well with another.

    This implies that the board is at fault, and if placing an X2 onto the weaker board you may pass the maximum the board can deliver at even default settings (assuming first tested CPU was a single core A64).

    This problem isnt exactly new either. Older KT133/KT133a boards only supported up to a certain CPU due to the power draw higher models required. Even KT266/266A had this problem. Some which stopped at a certain point were just because the manufacturer was too lazy to update the BIOS microcode (board was fully capable of going beyond "max" suggested CPU) but many (Epox 8kta3+, Asus A7V133, MSI...) couldnt handle it, even some running recommended CPU's couldnt.


    What i'm trying to point out is that this is not a new issue at all.. and its not being hyped at all if you remember past experiences.

    EDIT

    I should mention that I very much dislike temperature protection that is dependant on the BIOS.

    If a small IC were used to monitor PWM/Choke/CPU temperatures (accurately) and connected to the power switch (or a secondary one) so it could power off the machine even if the CPU had failed and was locked up this would be a much better solution.

    better yet if a temperature resistant co-processor were placed on the motherboard or CPU die that would function and maintain power loads (high power resistor may be required intermittantly) to reduce transients when the "master" processor(s) failed we wouldnt have this problem at all..


    And why does asus always skimp on parts? If memory serves correctly they completely removed the vref circuitry from their GeForce 3 boards feeding 3.3v direct to the I/O of the memory chips (not VDDR which can be taken as high as 3.8v at times with voltage modifications)..
    Last edited by STEvil; 06-21-2005 at 10:48 PM.

    All along the watchtower the watchmen watch the eternal return.

  10. #10
    Xtreme Member
    Join Date
    Apr 2005
    Posts
    187

    I agree with STEvil.

    I think it's "underhyped" in that many motherboards have
    very marginal quality / capacity design for their PCBs, voltage
    regulators, and filters.

    AMD designed the X2 so that it'd be compatible with the Socket 939
    "design envelope" of heatsink and motherboard design so that it should
    be supportable properly on all well designed motherboards and PSUs of
    adequate capacity with only a BIOS change.

    However the X2 *does* consume just a bit more power than ANY other
    AMD Socket 939 CPU that has ever existed including the FX-53/FX-55 or whatever, and certainly uses a fair bit more power than a common
    single core Athlon 64.

    So as STEvil said, any motherboard that is "marginal" with respect to
    stability / current / voltage regulation for ANY other AMD processor
    even under overclocking conditions will perform even worse and more
    marginally / unstably given the use of the even more power demanding X2.

    The main issue that limits PCs stress test or overclock stability is probably
    EMI (noise causing data corruption) and voltage / current limitations causing
    glitches of the logic and undue stresses on the ICs and the integrity of the
    electrical waveforms.

    If one model of motherboard can consistently overclock a given CPU
    to a higher frequency while using lower (e.g. stock / nominal) voltages than
    another model of motherboard, the one that achieves the best overclock
    at lowest voltages must have superior timing PCB layout and EMI / power
    design.

    This STPCLK issue is probably a decent "acid test" of a motherboard's Vcore
    power supply regulator phases and PCB / capacitor layouts to keep the CPU
    happy even under frequent and extreme load spikes in the case of working
    when going from STPCLK mode to RUN mode. It wouldn't make a bad
    "benchmark" type stress test, really, if someone actually made software to
    check for corruption / glitches and to stimulate this "on off on off" behavior
    many times a second.

    Will it fry your CPU / motherboard? Well the CPU could get overvolted
    if the Vcore regulators / traces "ring / spike too high" when the load
    is suddenly shut off (STPCLK happens from full load).

    It could also overvolt / ring too high when the CPU is stopped and all of
    a sudden there's a huge load demand (STPCLK is removed starting running
    again) and the regulator feeds the maximum possible voltage onto the CPU
    to compensate for the detected undervolting due to greatly increased load.

    At the least it's a high stress on the "pass current" and dynamic loading of
    the MOSFETs, CAPs, and could cause these to fail quickly.

    At the worst, it could glitch the CPU in a way that might fry it due to either
    overvolting spikes or undervolting "reverse voltage" situation where VIO is
    set to a normal level while Vcore is very much too low.

    I agree with STEvil that the "over-temperature" protection design
    isn't really something that is something that should normally occur
    (since even overclocked X2s tend to run fairly cool), and that it's not
    the best thing to rely on anyway. However if you DO have a fan
    failure or over clock generation glitch or whatever that DOES cause the CPU
    to run way too hot, you'll be in for a potentially nasty surprise
    (possibly smoke, flames, melted solder, ruined CPU / motherboard) if
    the "last resort" C2 / STPCLK thermal throttling does NOTHING because,
    according to AMD's suggested workaround for the problem, the BIOS
    just DISABLED the STPCLK mode!

    And in a more common / serious situation, I am still unconvinced that:

    (a) Cool N Quiet or other ACPI functions on BIOSs that *DON'T* disable
    STPCLK mode cannot trigger this kind of crash inducing glitch if they're
    trying to run the CPU at 90%, 80%, 75% or similar frequency since
    in those cases one could still be at 2000, 2200 MHz clock and maybe using
    STPCLK throttling to achieve the reduction.

    (b) that other kinds of power saving states other than CnQ won't enter
    C2 or other states that use STPCLK.

    Furthermore even hypothesizing if STPCLK is disabled in BIOS,
    even if CnQ can't cause this problem, any motherboard that'll fail
    Vcore regulation BECAUSE of STPCLK issue is *STILL GUARANTEED*
    to be unstable under stress testing of running normal software since
    there is very likely *SOME* combination of software and I/O related
    events that'll cause a similarly large variation in "low CPU activity to high
    CPY activity" that'll cause too much of a Vcore current demand transition
    and STILL crash / corrupt the PC. And THAT will be on a day-to-day
    gaming / computing / overclocking basis what may cause system
    crashing / flakiness.

    Personally I want a motherboard that is well enough designed that
    it CAN pass the STPCLK test at 2200+ MHz and not glitch!

  11. #11
    Registered User
    Join Date
    Dec 2004
    Posts
    19
    Quote Originally Posted by synergy
    'Also effects single core A64 "DH-E6" too.' -- It effects
    BH-E4 (only X2's)
    and JH-E6 (X2's and Dual Core Opterons).

    "wonder if this could be related to p4 dualcores cooking mobos due to power transients when going into thermal protection modes (its the motherboards at fault really since power transients are caused by the vcore vregs not being able to cope with the power requirements of the CPU).
    Besides, if going from full load to all stop is that dangerous, then so is going from all stop to full load." --

    I agree, it seems likely.
    Personally I am very disappointed at the penny-pinching,
    quality-corner-cutting philosophy of consumer electronics design including
    motherboards. As a PCB designer I've always been inclined to select
    PCBs with full ground plane & power planes for each major voltage when
    designing boards a lot simpler and lower speed than 2GHz + PC motherboards!
    It really boggles my mind that they'd have such a push to try to get PCBs
    down to 4 layers for Socket 939 motherboards with PCI, AGP, DDR,
    etc. etc. that's all quite high speed and high pin count.
    I very much suspect that it's more of a matter of *luck* than *engineering*
    that they work as well as they do. Even though STPCLK / STPGNT type
    power throttling is an easy to generate extreme test case of power load
    fluctuation from ~ Zero power to full load, I suspect that many software
    situations involving CPU / I/O bursts do about as much variation of loading
    on the time scale of nanoseconds to microseconds. If the EMI design and
    power filtering, regulation, capacitors, PSU, etc. isn't up to the task of
    providing clean transitions from STPCLK to Full Run I suspect it's not REALLY
    stable running common software programs under stress / bursty conditions
    either, and that's probably a major reason for things crashing / flaking /
    burning out on many systems.

    What's needed is just better quality motherboards, 6+ layer PCBs, more
    conservative design margins of capacitors / power regulators, thicker PCB
    power traces, etc.
    I agree as with what you have said as well, synergy. However, we can't place all the blame on the mobo manufacturers' for cost-cutting. In my opinion, most of the blame should be placed on the consumer.

    Why? Because in my experience the public at large is a bunch of cheap bas$%ds. People are ALWAYS complaining about how much something costs. They want the best quality and best best performance but they don't want to pay for it. Need proof? Overclocking. Long-time AMD users are probably also at fault too.

    Now you can't blame anyone for wanting to get the best price/performance ratio (who doesn't?). No one wants to pay more than they have to, or more than they feel is fair. But having worked in the high-performance aftermarket auto industry for many years (anyone ever heard of Neuspeed?), then PC and Mac computer hardware, and now having years of experience as a Realtor in California I have dealt with literally thousands of people. Some don't mind paying the premium that quality usually demands, but most will attempt to grind you down.

    Rant over.
    No, I am not Xtreme but I do have a sweet machine!

    Lian Li V1000 Plus II Silver w/Cougar CF-V12HPB Fan x 4 | Core i7 3930K | Thermalright True Spirit 120 w/2x Cougar CF-V12HPB Push-Pull | ASUS Sabertooth X79 | 4 x 8GB Patriot Viper 3 Intel Extreme Masters Limited Edition DDR3 1600 | MSI GTX670 Power Edition | 2x Raptor 300GB RAID 0 | LG BHLS20 Blu-ray | Corsair 850TX | Samsung SyncMaster 204B Silver | 64MB X-Fi PCIe| Logitech G15 Keyboard & G7 Cordless Mouse | Logitech Z-640 5.1 Speakers | APC Smart-UPS 1500 Black | Windows 7 Ultimate SP1

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •