Quote Originally Posted by kl0012 View Post
Who can explain me what is so exciting in this, home made scheme? (lets put aside possible errors in this scheme). What level of performance is expected and what based on? And what is the point of making prediction of a future hardware based on some patents? Intel alone gets 60-70 patents a month. Imagine how would look possible Intels CPU with all Intels patents implemented at once.
BTW Intels patent about clustered multithreading:
http://www.patents.com/Multithreaded...7478198/en-US/
Two things you should do is read Dresdenboy's blog and look at that diagram carefully. He has sorted through AMD's patents to find the ones that point to an actual upcoming uarch. Also he is mindful of the amount of time it takes between a patent being filed and it showing up in a part, around 4 years. Hes' been analyzing patents from 2007 to now so we'll possibly be seeing a lot of the ones he highlights in his blog between 2011 and 2013.

The real exciting part is that AMD has been AMD has been coming out with patents concerning per-core multithreading since the 1990s. It is finally looking like we will see it in the next processor revision in a more efficient form that Intel's SMT. AMD said they set this goal in 2005 (when they had cash money) so it looks like they actually did use their time at the top to go forward with a radically new uarch. A development time frame of 6 years (2005 to 2011) fits that theory. Concerning the CPU diagram he made, this chip could be 4 threads per core!

A few other interesting tidbits are these:
1)
There is also a new inventor name: Nhon Quach. Besides other companies he already worked for Intel on Itanium's RAS and system architecture. During that time he also worked on a reliable architecture with two cores, which don't share resources (see his patent no. 6,615,366, with a typo in the abstract BTW). So now he is at AMD doing similar stuff, which fits nicely to the clustered architecture.
2)Hans De Vries is the person that analyzed some patents at aceshardware and came up with the details about how CMT may work with this core configuration:
Bulldozer's clustered multiprocessor architecture

I've always interpreted AMD's clustered multiprocessing, which they
claimed as adding 80% performance with 50% extra transistor, as
something like the following:

A 2-way superscalar processor can reach 80%-100% of the performance
of a 3-way for lots of applications. Only a subset of programs really
benefits from going to a 3-way. A still smaller subset benefits from going
to a 4-way superscalar.

Now, if you still want to have the bennefits of a 4-way core but also
want to have the much higher efficiency of the 2-way cores then you
can do as follows:

Design a 4-way processor which has a pipeline which can be split
up into two independent 2-way pipes. In this case both threads have
there own set of resources without interfering with each other.

Part of the pipeline would not be split. Wide instruction decoding would
be alternating for both threads.

The split would be beneficial however for the integer units and the
read/write access units to the L1 data cache. The total 4-way core
could have more read/write ports which should certainly improve
IPC for a substantial subset.

The 128 bit SSE/FP units could be modified partly in connection
with the read/write ports. There was some improvement but not
that much when AMD almost doubled the SSE2/FP hardware going
from 64 bit units in K8 to 128 bit units in the K10.

There is lots of efficiency to be gained by using two K8 like SSE/FP
which can operate independently in 2-way mode and which can operate
together as a single 128 bit unit in 4-way mode. Other similar tricks
can be beneficial as well.

Part of the higher IPC of Itanium is due to it's multiple read write
ports to cache and it's 64bit FP units which can work independently
instead of in a "dumb" 2x64 way mode. The two independent FP units
of the Itanium can be fed directly from cache due to all these read
ports (and they can write directly to cache as well)

Something like this is what you would gain in the 4-way mode while
the 2-way modes bring the efficiency in throughput computing.


Regards, Hans
2-way to 4-way processing = Clustered Multi-threading possibly. I think Sun's Niagara(II) from 2005 (4 years ago!) uses a similar scheme as it is 4 threads per core. The Bulldozer uarch could simply be an x86 version of Sun's tech, just like K7 was very similar to Sun's Alpha.