Page 1 of 2 12 LastLast
Results 1 to 25 of 41

Thread: coldbug

  1. #1
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147

    coldbug

    what if the coldbug is only caused by the cpus internal overheating protection malfunctioning under low temperatures? after all its designed to work from 20 to 80 degrees celsius... maybe 0 to 150 degrees in an extreme case, but thats it... most engineers dont know that we use cpus at much lower temps or they simply dont care.

    there are several temp probes in a cpu, at least 1 per core plus one in the imc on newer amd cpus afaik. there is one probe per core that can be read externally, AND another probe which is read by the cpu itself! this probe cannot be read externally, at least not directly through a set of pins afaik. its purpose is to protect the cpu from overheating, as soon as it meassures 125 degrees internally inside the cpu it will trigger the termtrip process which forces the mainboard to shut down. It also shuts the cpu down and prevents it from doing any more work afaik, but im not sure about that.

    how are the temperatures meassured? afaik its resistor based... as we all know resistance changes the colder a material gets. but even the most ideal materials dont scale linearilly the colder they get... so amd had to decide for a resistor that works well in the range they will need, 40 to 150 degrees i guess... the closer to the limits you get the less accurate the reading gets. not only that, you have to read the temperatureand interpret the values you read... you want to keep things simple and short, so they used a basic algorythm i guess...

    so basically there are two probes in each core, and one is a closed circuit and is read by the cpu itself and the other one is ready by the mainboard. but as far as i know they are vitually the same thing. the mainboard/bios reading the temperature has proven to be very innaccurate as most of you probably already experienced first hand. does anybody remember how mainboards/bioses read temperatures below 0 degrees as +300 degrees or even higher some years ago? how come? well the algorythms used to meassure the temperatures couldnt handle the readings they got or simply werent coded to calculate below 0 temperatures.

    this resulted in mainboards not booting or shutting down back then, because they thought the cpu is overheating! it was fixed by ignoring the temperature or by using new algorythms that read below 0 more or less correctly and hence didnt shut down.

    so, the cpu overheat protection mechanism is based on exactly the same idea... and strangely enough we are having almost exactly the same problem now... couldnt it be that its the same problem causing the coldbug? the overheat protection beeing triggered somehow?

    AMD NPT Family 0Fh Processor Electrical Data Sheet #31119

    page 33:
    Notes:
    1.Thermal diode sourcing current should always be used in forward bias.
    2.TOffset supports diode thermal monitor using two sourcing currents only. AMD does not support other sourcing current
    implementations.
    3.TOffset is used to normalize the diode thermal monitor reading to the Tcontrol scale. TOffset should be added to the diode
    thermal monitor reading.
    Tcontrol = diode thermal monitor reading +TOffset .
    System thermal policy should ensure that Tcontrol_max is not exceeded. Tcontrol_max is specified in the appropriate
    power and thermal data sheet.

    4.Thermal solutions should not be designed and validated using the thermal diode. Thermal solutions should be designed
    and validated against the case temperature specification per the methodology documented in the appropriate processor
    thermal design guide.
    5.TOffset is unique for each processor and is programmed at the factory. TOffset value is found in the Thermtrip Status
    Register described in the BIOS and Kernel Developer’s Guide for AMD NPT Family 0Fh Processors, order# 32559.

    6.If the diode thermal monitor has an ideality factor different from 1.008, a small correction to this offset is required.
    Contact your diode thermal monitor vendor to determine if additional correction is required.
    then you face another problem, you wont be able to have the same resistance for all cpus... some will have a sligthly higher and some a slightly lower resistance... so one will read 40 degrees when its actually 30, and another one will only read 20... to compensate for that the resistance is meassured at the fab and an offset value is programmed into the cpu, to adjust the reading of the termal diode!

    how about the internal diode used for the overheat protection though?
    does it have an offset value as well? if we could reprogram this offset value we could prevent the cpu from shutting down thinking its running too hot and could improve the coldbug situation or maybe even get rid of it completely.
    IF the coldbug really is caused by this!

    Im pretty sure that itspossible to reprogram the offset value to change the temperature the internal diode freaks out at and shuts down the cpu to prevent a meltdown. i mean its only logical right? you mfg millions of cpus, knowing the laws of physics and probability you KNOW there will be so and so many units that will work flawlessly...BUT the resistance used for the internal overheat protection mechanism is way off and those cpus will shut themselves down at way too low or way too high temperatures.

    so what do you do? ignore it and sell them anyways hoping they will never run too hot so they melt down or will run cool enough to not shut down even though the temp is still ok? or do you throw them away?

    i suspect there is an offset value for the internal overheat protection probe as well... if we could find it and reprogram it with a microcode upate we might be able to fix the coldbug, very promising imo

  2. #2
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    hmmmmm what if the same offset value is used for both temp probes?
    is it possible to reprogramm that toffset value using msr walker from crystalcpu?

    #32559 Rev 3.08 July 2007 BIOS and Kernel Developer’s Guide for AMD NPT Family 0Fh
    Processors


    Chapter 3 Functional Description 39
    3.7.2.2 Hardware Thermal Control
    Hardware Thermal Control (HTC) is a hardware mechanism used to reduce processor power
    consumption. The processor enters HTC active state when the die temperature is greater than the
    HTC temperature limit or when PROCHOT_L is asserted. While it is in HTC active state, processor
    core clock grid (GCLK) is periodically ramped down for a specified period of time, and consequently
    processor performance is reduced. Deterministic power savings are accomplished by unconditional
    GCLK ramping down. After the temperature drops below the HTC temperature limit (plus a specified
    number of degrees of hysteresis), the processor automatically exits the HTC-active state if
    PROCHOT_L is not asserted.
    The processor can enter the HTC-active state only when PWROK and
    RESET_L are both high. The processor can not enter the HTC-active state while in the C3 ACPI
    state. Special bus cycles can be generated when HTC active state is entered or exited. The
    Northbridge clock grid (BCLK) is not ramped down when HTC is active.
    "plus a specified number of degrees of hysteresis" sounds to me like there IS an offset for the hardware overheat protection mechanism

  3. #3
    Xtreme Cruncher
    Join Date
    Mar 2005
    Location
    venezuela caracas
    Posts
    6,460
    you go buy intel and forget about cold bug low ocs and also you get more performance ? and this is coming from someone that loved amd
    Incoming new computer after 5 long years

    YOU want to FIGHT CANCER OR AIDS join us at WCG and help to have a better FUTURE

  4. #4
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    damn! its read only

    TOFFSET (thermal diode temperature offset) is supplied as a read only field in the Thermtrip Status Register.
    maybe theres a way to tell the cpu to STFU when it whines that its overheating?

    hmmmm

    Chapter 3 Functional Description 39
    3.7.2.2 Hardware Thermal Control
    Hardware Thermal Control (HTC) is a hardware mechanism used to reduce processor power consumption. The processor enters HTC active state when the die temperature is greater than the HTC temperature limit or when PROCHOT_L is asserted. While it is in HTC active state, processor core clock grid (GCLK) is periodically ramped down for a specified period of time, and consequently processor performance is reduced. Deterministic power savings are accomplished by unconditional GCLK ramping down. After the temperature drops below the HTC temperature limit (plus a specified number of degrees of hysteresis), the processor automatically exits the HTC-active state if PROCHOT_L is not asserted. The processor can enter the HTC-active state only when PWROK and RESET_L are both high. The processor can not enter the HTC-active state while in the C3 ACPI state. Special bus cycles can be generated when HTC active state is entered or exited. The Northbridge clock grid (BCLK) is not ramped down when HTC is active.

    3.7.2.3 PROCHOT_L
    The PROCHOT_L pin is asserted as an input to force the processor into the HTC-active state or becomes asserted as an output to indicate when the die temperature is greater than the HTC temperature limit.
    As in input to the processor, PROCHOT is ignored if THERMTRIP_L, PWROK or RESET_L is low or if HtcEn (Function 3 Offset 64h) is zero. As an output, it is low to reflect the HTC-active state. PROCHOT_L will be recognized when it's state has been stable high or low for greater than 10ns, however to get any effective power reduction from PROCHOT_L assertion, the assertion time must allow for several throttling periods to elapse. Each throttling period is approximately 6us. See “BIOS Requirements for PROCHOT_L Throttling” on page 286 for how to enable PROCHOT_L based throttling.

    3.7.2.4 THERMTRIP_L
    The processor provides a hardware-enforced thermal protection mechanism. When the processor’s die temperature exceeds a specified temperature, the processor is designed to stop its internal clocks and assert the THERMTRIP_L output. THERMTRIP_L assertion is only valid when PWROK is asserted and RESET_L is deasserted. THERMTRIP_L assertion indicates the processor die temperature has exceeded normal operating parameters. PWROK must be deasserted in response to a THERMTRIP_L assertion to enable proper processor operation. Once asserted THERMTRIP_L remains asserted until RESET_L is asserted. If the processor’s die temperature still exceeds the thermal trip point when RESET_L is deasserted, THERMTRIP_L will immediately be reasserted and the processor’s internal clocks stop.
    Quote Originally Posted by [XC] leviathan18 View Post
    you go buy intel and forget about cold bug low ocs and also you get more performance ? and this is coming from someone that loved amd
    intel has coldbugs as well and 45nm coldbugs are getting worse and worse ive already heard some 45nm quads dont like to run below -60 degrees... and the newer the batch the worse the coldbug afaik

  5. #5
    Wanna look under my kilt?
    Join Date
    Jun 2005
    Location
    Glasgow-ish U.K.
    Posts
    4,396
    Quote Originally Posted by [XC] leviathan18 View Post
    you go buy intel and forget about cold bug low ocs and also you get more performance ? and this is coming from someone that loved amd
    Not the point. We want understanding as well as great bench scores.

    Trying AMD under LN2 is still extreme
    Quote Originally Posted by T_M View Post
    Not sure i totally follow anything you said, but regardless of that you helped me come up with a very good idea....
    Quote Originally Posted by soundood View Post
    you sigged that?

    why?
    ______

    Sometimes, it's not your time. Sometimes, you have to make it your time. Sometimes, it can ONLY be your time.

  6. #6
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    well f3/g1 steppings can hit pretty high clocks with air and water, i think if we could get rid of the coldbug 4.5ghz would be possible with x2's yeah intel cpus still clock and perform better, but anybody calling an x2 with 4.5ghz slow is exaggerating quite a lot

  7. #7
    Xtreme Mentor
    Join Date
    May 2007
    Posts
    2,792
    Saaya, by the time I refreshed you had presented, analyzed, evaluated and already concluded all your original post theory.

    We'll see whats possible. Gotta read up more.

  8. #8
    Wanna Pull My Finger?
    Join Date
    Sep 2007
    Location
    Oklahoma
    Posts
    3,648
    I'm wondering if the black editions temp sensor bit is read only or if they unlocked that also...............
    Donate to Xtreme Systems!

    Now Showing:
    Gigabyte x48-DQ6, Q6600,OCZ 1066 Reapers,2 750gb seagate 7200.11 hd, BFG 8800GTS 512,PC P&C 750 Quad psu, 24" Sceptre lcd, Antec 900

    my wife's system now!
    Intel C2D 6400, Zotac Matx mobo, 1gb kingston mem, Nvidia 7050, I Feel really Good now!
    Jon C2D 6600 Zotac mobo 1gb mem............................................... ................. HTPC qx6700@3.0ghz
    Annabelle Amd 3800+@2.4ghz, Biostar mobo, 1gb ocz pc4500 beta's................. Optyx2 opty165@ 2.1 ghz

    'Want a real high?
    Come crunch WCG and you'll feel like your on QuadCaine"



    First loops are like first sex, all hands and thumbs till you figure out what goes where, then it's what ever works best for you.

  9. #9
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by KTE View Post
    Saaya, by the time I refreshed you had presented, analyzed, evaluated and already concluded all your original post theory.

    We'll see whats possible. Gotta read up more.
    hehehe, yeah i should have finished reading the docs and then write a post
    too bad i dont have a k10 setup, i was offered a k10 cpu 3 times and refused, then i was offered a board but only if i had a cpu, so i refused

    Quote Originally Posted by fart_plume View Post
    I'm wondering if the black editions temp sensor bit is read only or if they unlocked that also...............
    im affraid no
    too bad actually... i mean for amd making that msr writeable or disabling the internal overtemp protection should be really easy... and people who buy an fx want to run it with ln2 so... they really should do something like that...

  10. #10
    c[_]
    Join Date
    Nov 2002
    Location
    Alberta, Canada
    Posts
    18,728
    In the case of K8 I dont think the problem is temperature sensor related as you can increase temperature from the point of no POST to the point where the memory controller stops scaling in clock speed and have a significicant difference.

    For instance if you have a no-post at -20c you might be limited to 240mhz at -10c and be able to hit 350mhz at +15c...

    For CPU's that have the cold bug limit their clockspeed rather than FSB speed I think you are right on this issue.

    All along the watchtower the watchmen watch the eternal return.

  11. #11
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    well i discussed this with fredyama and he said he tried several fx cpus and while most of them would only run at -20, one of them could run at down to -50 degrees... hmmmm i need some dry ice or ln2 so i can do some research

  12. #12
    c[_]
    Join Date
    Nov 2002
    Location
    Alberta, Canada
    Posts
    18,728
    just check the "I found the golden one" thread in the AMD section for all the proof you need

    All along the watchtower the watchmen watch the eternal return.

  13. #13
    Xtreme Mentor
    Join Date
    May 2007
    Posts
    2,792
    There's many who've been subzero with K8 now past -25℃. I mean, look at this 2005 Prometeia review with K8 chip: http://www.pcstats.com/articleview.c...d=1793&page=12

    But you'll have variance everywhere naturally. It seems some chips are very badly CB'd while others are not so until around -60C (idle). In either case, the major problem is scaling with voltage and why you don't see high frequencies regardless of cooling. The voltage scaling limit is very low on too many K8/K10 chips, heck, you can at many times hit the limit on standard air.

  14. #14
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by STEvil View Post
    just check the "I found the golden one" thread in the AMD section for all the proof you need
    hmmm that thread is huge... mind pointing out a particular site to me, dont really want to read through the past 20 pages that came up after i last read it

    kte, so you think it does have to do with the mfg process after all?

  15. #15
    c[_]
    Join Date
    Nov 2002
    Location
    Alberta, Canada
    Posts
    18,728

    All along the watchtower the watchmen watch the eternal return.

  16. #16
    Brilliant Idiot
    Join Date
    Jan 2005
    Location
    Hell on Earth
    Posts
    11,015
    saaya you appear to really want to figure this out........fwiw i still have this chip and board......although i would suggest a diff board for better memory compatibility....unless sapphire fixed the bios which i highly doubt.....anyway I would be willing to lend them out, however they are not for sale as you can see this chip didn't seem to be bothered in the least......at -31C

    edit it appears my webspace is borked atm......it's an LDBHE x2 3800, best one i had......and manta prototype......3.5g @ -31C DI run.........

    meh check the image out in the I'm Back!!!!!!!!!!! thread in intel the image is in that thread.....
    Last edited by chew*; 01-17-2008 at 11:29 PM.
    heatware chew*
    I've got no strings to hold me down.
    To make me fret, or make me frown.
    I had strings but now I'm free.
    There are no strings on me

  17. #17
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    heh, yeah im on a prototype board as well and the bios SUXXXXXXXX
    is that a 939 cpu or am2? manta was am2 right?

    well am2 is damn cheap now, ill go ahead and buy a couple of am2 cpus and an am2 board i think. just dont have the time this month, but february might work...

    the thing is, if this is mfg rech related, all cpus and gpus will have this issue sooner or later.
    if its imc related, then intel will have this issue as well soon.
    if its temp probe related intel might have this issue as well soon.

    so either way so lets try to find workarounds or ways to improve coldbugs now and not wait until all new hardware is coldbugged and we dont know what to do about it!

    stevil thanks, @work atm, will have a look at this later when im home.

  18. #18
    OCTeamDenmark Founder Nosfer@tu's Avatar
    Join Date
    May 2004
    Location
    Denmark, Copenhagen
    Posts
    2,335
    Just to make sure ? So am I to understand this as AMD set the coldbug on purpous ? Now THAT WOULD BE STUPID :p
    Former owner of OCTeamDenmark.com
    MSI MOTHERBOARD!!!!!!

    Linkedin


  19. #19
    Xtreme Mentor
    Join Date
    May 2007
    Posts
    2,792
    Quote Originally Posted by saaya View Post
    hmmm that thread is huge... mind pointing out a particular site to me, dont really want to read through the past 20 pages that came up after i last read it
    It's a funny thread.

    Basically most chips with a CB and "some" which don't have one to -30C, extremely rare something goes beneath that. Now if they'd just bought any off the shelf and said "no CB" then I might have thought that as valid, but cherry picking one out of 1000s for no CB to sub -5C isn't exactly any good at showing CPU SKUs without a CB.

    I mean this, this, this, this and all these in green are mainly close to or subzero (depending on which temp sensor you look at).

    I'm sure guys there like fredyama went through tons of chips to find one without CB to get those MHz/cooling and even still, it was only the QX sorta FX chips or one or two odd Opties and X2. At 65nm G2, 5000+ BE seems to love cold to -20-15C IHS region though.

    Problem with those A64/AM2s was, they were sensitive to heat/voltage and thus could scale better under water/DICE than air usually. That doesn't look like Phenom since 1.2V gets you the same as 1.7V while temps are kept sub40C i.e. design/process/material problems. nFET leakage and performance is the hardest to enhance in CPUs, C2/Penryn chips are very improved and best in this department mainly and yet K10 is 40% worse than C2 in this leakage/perf alone. This'll decide MHz/volt/heat/power level to a major level as well as the current you can feed through the chips (K10 limited to 110A total cores/20AIMC).

    kte, so you think it does have to do with the mfg process after all?
    Can never be sure, but that's what the signs around now show. SOI.

    I mean, my K10 sensors stopped working since I tried for a CB. -14/15C IHS was fine but it read 0C on all cores (tcase) and 14C hottest internal (tjunction).



    Since then all my cores are stuck at 0C now but the main one I need (tjunction) still reads right. However, even at those temps it made no difference to oc, so no point. If CB existed because of a thermal protection circuitry monitoring these sensors values within the cores, they'd have shut down my system the minute I drop temps on it.

  20. #20
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by Nosfer@tu View Post
    Just to make sure ? So am I to understand this as AMD set the coldbug on purpous ? Now THAT WOULD BE STUPID :p
    no, at least not intentionally.

    kte... intel apparently limits the current that can go through their chips, do you know if amd does the same? cause the lower the temp the higher the current right? maybe it triggers some sort of an overcurrent protection?

    i still dont think its mfg process...
    its design or temp monitoring, im sure...

  21. #21
    Xtreme Mentor
    Join Date
    May 2007
    Posts
    2,792
    Yep, AMD does the same with current limits. 110A for K10 AM2+ and AM3 cores and 20A to the IMC. This is the peak limit.

    Not sure if any OCP trigger is activated which causes CB... if it did, why would it occur on some and not other CPUs?
    And why would it occur at a low current?
    It depends on what the values for VID registers are, different P-States enable a different current to the core and IMC AFAIK.

  22. #22
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by KTE View Post
    Yep, AMD does the same with current limits. 110A for K10 AM2+ and AM3 cores and 20A to the IMC. This is the peak limit.

    Not sure if any OCP trigger is activated which causes CB... if it did, why would it occur on some and not other CPUs?
    And why would it occur at a low current?
    It depends on what the values for VID registers are, different P-States enable a different current to the core and IMC AFAIK.
    every cpu draws a slightly different amount of current...
    you think the current is actively changed when the cpu shifts pstates?
    i always thought the current cant be controlled...

  23. #23
    Brilliant Idiot
    Join Date
    Jan 2005
    Location
    Hell on Earth
    Posts
    11,015
    nah its 939 saaya......and i had 3 chips no coldbug.....all were LDBHE, although i did cherry pick these for that exact code....I wasn't picky once i found them with that code......

    The one i still have was the best of them all....I lost my 3.0 gig 24/7 stable chip on grouper, wife shut the pc off and not the mach 1..........needless to say when she turned it back on the next day there was a fire in the PWM area and the chip screamed ohhhhhhhnoooooooooeeess.......

    As you can see at -31c it was still cranking.......and cold boot temps were colder.......problem was my homemade DI tube just couldn't get any colder....http://www.xtremesystems.org/forums/...51&postcount=4
    Last edited by chew*; 01-18-2008 at 02:07 PM.
    heatware chew*
    I've got no strings to hold me down.
    To make me fret, or make me frown.
    I had strings but now I'm free.
    There are no strings on me

  24. #24
    Xtreme Mentor
    Join Date
    May 2007
    Posts
    2,792
    Quote Originally Posted by saaya View Post
    every cpu draws a slightly different amount of current...
    you think the current is actively changed when the cpu shifts pstates?
    i always thought the current cant be controlled...
    True.
    Locked usually but for the unlocked Phenom BE models. 9500/9600 at 1.25VID draws 18.9A per core, lower is even lower and if you increase VID (P-States) you'll see a massive jump at the same VCore from 105W idling to 330W idling at the same MHz.

  25. #25
    Xtreme Cruncher
    Join Date
    May 2007
    Location
    In a hell hole called Sac
    Posts
    1,754
    Quote Originally Posted by chew* View Post
    nah its 939 saaya......and i had 3 chips no coldbug.....all were LDBHE, although i did cherry pick these for that exact code....I wasn't picky once i found them with that code......

    The one i still have was the best of them all....I lost my 3.0 gig 24/7 stable chip on grouper, wife shut the pc off and not the mach 1..........needless to say when she turned it back on the next day there was a fire in the PWM area and the chip screamed ohhhhhhhnoooooooooeeess.......

    As you can see at -31c it was still cranking.......and cold boot temps were colder.......problem was my homemade DI tube just couldn't get any colder....http://www.xtremesystems.org/forums/...51&postcount=4
    Sounds like an honest mistake. Hope you didn't yell at her.
    Quote Originally Posted by [XC] Kayin View Post
    Should the RIAA ever target me, I will immediately forfeit US citizenship and move back to reservation, which has no extradition policy and would probably tell Whitey to get bent or we'll scalp you and take your women...
    Free Omastar!

    [SIGPIC][/SIGPIC]

Page 1 of 2 12 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •