Be clear! They are talking desktop samples of Bulldozer architected cores in late 2010. There is no update on the server version of a die design with Bulldozer architected cores. That could be sampling a lot sooner as they said early 2010 for 32nm CPUs with production in H2 2010. That server version is called Sandtiger and has 8 Bulldozer cores.
The big boost with Bulldozer was always some way to get two physical cores to speed up the execution of one physical thread which is the reverse of hyperthreading. The latest speculation is that that one core executes the most likely path at all branches and the other takes the alternate path at the first branch and then takes the likely path thereafter. If the first path turns out to be correct, the second core jumps to the first non retired branch and begins the same speculation on the less likely path again. If the less likely path is correct, the roles swap and the first core jumps to the first non retired branch of the second path and takes the less likely path. The big boost comes from the single cycle it takes to dupe the registers on the true path to the other core.
With a conditional branch in x86 taken on average of one every 5 instructions, the speed up can be substantial. Of course the best boost comes from nearly likely paths at a high percentage of branches. Given the long execution pipelines, that would boost IPC by two or more times on such "nasty" code. The other effect is that the pipelines can be longer, may be 50-100% longer. That would push the clock rate up by 20-40% and between the two, could make the new core be 50-100% quicker on the same code.
Or use the same clock rate, but far less power. Currently 2 speed bins are between 10 and 15% of clock. So 20-40% is 4 to 6 bins. 2 bins halves the needed power per core. Thus even with twice the cores needed, the overall power required for a given performance level could be 1/2 to 1/4th on clock alone. Add to that the IPC improvement and that could be that for the same performance, power could be reduced by an order of magnitude.
So you could either get 1.5 to 2 times the single thread performance at twice the power or 1x performance at 1/10th the power. In one case, you leap frog the AMD flagship way beyond Core i7 (120W TDP), you get flagship performance in a ultrathin mobile (12W TDP) or some combination in between. Anyway you look at that, that could turn the current situation on its ear. For Intel, that would be worse than the P4/RDRAM debacle.
Bookmarks