what if the coldbug is only caused by the cpus internal overheating protection malfunctioning under low temperatures? after all its designed to work from 20 to 80 degrees celsius... maybe 0 to 150 degrees in an extreme case, but thats it... most engineers dont know that we use cpus at much lower temps or they simply dont care.

there are several temp probes in a cpu, at least 1 per core plus one in the imc on newer amd cpus afaik. there is one probe per core that can be read externally, AND another probe which is read by the cpu itself! this probe cannot be read externally, at least not directly through a set of pins afaik. its purpose is to protect the cpu from overheating, as soon as it meassures 125 degrees internally inside the cpu it will trigger the termtrip process which forces the mainboard to shut down. It also shuts the cpu down and prevents it from doing any more work afaik, but im not sure about that.

how are the temperatures meassured? afaik its resistor based... as we all know resistance changes the colder a material gets. but even the most ideal materials dont scale linearilly the colder they get... so amd had to decide for a resistor that works well in the range they will need, 40 to 150 degrees i guess... the closer to the limits you get the less accurate the reading gets. not only that, you have to read the temperatureand interpret the values you read... you want to keep things simple and short, so they used a basic algorythm i guess...

so basically there are two probes in each core, and one is a closed circuit and is read by the cpu itself and the other one is ready by the mainboard. but as far as i know they are vitually the same thing. the mainboard/bios reading the temperature has proven to be very innaccurate as most of you probably already experienced first hand. does anybody remember how mainboards/bioses read temperatures below 0 degrees as +300 degrees or even higher some years ago? how come? well the algorythms used to meassure the temperatures couldnt handle the readings they got or simply werent coded to calculate below 0 temperatures.

this resulted in mainboards not booting or shutting down back then, because they thought the cpu is overheating! it was fixed by ignoring the temperature or by using new algorythms that read below 0 more or less correctly and hence didnt shut down.

so, the cpu overheat protection mechanism is based on exactly the same idea... and strangely enough we are having almost exactly the same problem now... couldnt it be that its the same problem causing the coldbug? the overheat protection beeing triggered somehow?

AMD NPT Family 0Fh Processor Electrical Data Sheet #31119

page 33:
Notes:
1.Thermal diode sourcing current should always be used in forward bias.
2.TOffset supports diode thermal monitor using two sourcing currents only. AMD does not support other sourcing current
implementations.
3.TOffset is used to normalize the diode thermal monitor reading to the Tcontrol scale. TOffset should be added to the diode
thermal monitor reading.
Tcontrol = diode thermal monitor reading +TOffset .
System thermal policy should ensure that Tcontrol_max is not exceeded. Tcontrol_max is specified in the appropriate
power and thermal data sheet.

4.Thermal solutions should not be designed and validated using the thermal diode. Thermal solutions should be designed
and validated against the case temperature specification per the methodology documented in the appropriate processor
thermal design guide.
5.TOffset is unique for each processor and is programmed at the factory. TOffset value is found in the Thermtrip Status
Register described in the BIOS and Kernel Developer’s Guide for AMD NPT Family 0Fh Processors, order# 32559.

6.If the diode thermal monitor has an ideality factor different from 1.008, a small correction to this offset is required.
Contact your diode thermal monitor vendor to determine if additional correction is required.
then you face another problem, you wont be able to have the same resistance for all cpus... some will have a sligthly higher and some a slightly lower resistance... so one will read 40 degrees when its actually 30, and another one will only read 20... to compensate for that the resistance is meassured at the fab and an offset value is programmed into the cpu, to adjust the reading of the termal diode!

how about the internal diode used for the overheat protection though?
does it have an offset value as well? if we could reprogram this offset value we could prevent the cpu from shutting down thinking its running too hot and could improve the coldbug situation or maybe even get rid of it completely.
IF the coldbug really is caused by this!

Im pretty sure that itspossible to reprogram the offset value to change the temperature the internal diode freaks out at and shuts down the cpu to prevent a meltdown. i mean its only logical right? you mfg millions of cpus, knowing the laws of physics and probability you KNOW there will be so and so many units that will work flawlessly...BUT the resistance used for the internal overheat protection mechanism is way off and those cpus will shut themselves down at way too low or way too high temperatures.

so what do you do? ignore it and sell them anyways hoping they will never run too hot so they melt down or will run cool enough to not shut down even though the temp is still ok? or do you throw them away?

i suspect there is an offset value for the internal overheat protection probe as well... if we could find it and reprogram it with a microcode upate we might be able to fix the coldbug, very promising imo