More details on Bulldozer's multi-threading and single thread execution
by Dresdenboy @ 2009-07-07 - 11:36:01 am
Unfortunately I both did not have enough time and details (some things were to guess) to create the promised architecture diagram. However, now the missing details can be found in new published patent applications. I think that will help me getting back to the task. But now I switch to another topic: Will bulldozers have SMT or not?
AMD's John Fruehe recently said thread in an AMDZone forum that, AMD will not do SMT in the next years. That could be understood in a way that the architecture revealed here will not be able to execute more than one thread per core. However, given this is not the case, because such a statement has not been. So far, John said that, AMD would not implement SMT. In my eyes it was a smart move to mention SMT - just to be able to deny it

. However, this is still speculation.
Instead we saw the term "cluster-based multi-threading (also known as clustered multi-threading, CMT) already years ago in an AMD presentation. If you look at Chuck Moore's slide below, you see, that SMT is the least admirable multi-threading variant to AMD. So far they were underway in the CMP part of this diagram and it just seems logical to move to much greener CMT area from there - even more since they explicitly state a 50% area for investment gain 80% throughput. They had this view already four years ago with first patents covering the new architecture being filed just two years later. If bulldozers would have been ready already for 2009 or 2010, these time frames seem ok to me. And even the four year difference from patent filing dates to 2011 fits well to what we know from older architectures.
So we find the new arch again in:
20090164758 - System and method for performing operations locked
20090172359 - having parallel processing pipeline dispatch and method thereof
20090172362 - Processing pipeline stage having specific thread selection and method thereof
20090172370 - Eager execution in a processing pipeline having multiple integer execution units
And most of these patent applications now give much more detail on how the threads are executed and the likes. Most of it fits well to what Hans de Vries already described in his detailed post on aceshardware.
These patent application describe ways to execute a single thread on both clusters. This could be done by having a thread run ahead for early prefetches memory or by executing both ways of a branch in parallel and scrap the wrong way after branch resolution. A different variant is the parallel execution of the same code to gain reliability of the results by comparing them afterwards.
Some of the mentioned patent applications also state, that the 4 way decoders could decode more than 4 instructions per cycle if there are both a micro coded and a fastpath instruction (of different threads) in one decoding path
Another interesting and related topic is the way future general and how graphics processing units could be combined. This is covered in the following patent applications:
20090164726 - Programmable address processor for graphics applications
20090160863 - unified processor architecture for graphics and general processing workload
Bookmarks