Bulldozer's clustered multiprocessor architecture
I've always interpreted AMD's clustered multiprocessing, which they
claimed as adding 80% performance with 50% extra transistor, as
something like the following:
A 2-way superscalar processor can reach 80%-100% of the performance
of a 3-way for lots of applications. Only a subset of programs really
benefits from going to a 3-way. A still smaller subset benefits from going
to a 4-way superscalar.
Now, if you still want to have the bennefits of a 4-way core but also
want to have the much higher efficiency of the 2-way cores then you
can do as follows:
Design a 4-way processor which has a pipeline which can be split
up into two independent 2-way pipes. In this case both threads have
there own set of resources without interfering with each other.
Part of the pipeline would not be split. Wide instruction decoding would
be alternating for both threads.
The split would be beneficial however for the integer units and the
read/write access units to the L1 data cache. The total 4-way core
could have more read/write ports which should certainly improve
IPC for a substantial subset.
The 128 bit SSE/FP units could be modified partly in connection
with the read/write ports. There was some improvement but not
that much when AMD almost doubled the SSE2/FP hardware going
from 64 bit units in K8 to 128 bit units in the K10.
There is lots of efficiency to be gained by using two K8 like SSE/FP
which can operate independently in 2-way mode and which can operate
together as a single 128 bit unit in 4-way mode. Other similar tricks
can be beneficial as well.
Part of the higher IPC of Itanium is due to it's multiple read write
ports to cache and it's 64bit FP units which can work independently
instead of in a "dumb" 2x64 way mode. The two independent FP units
of the Itanium can be fed directly from cache due to all these read
ports (and they can write directly to cache as well)
Something like this is what you would gain in the 4-way mode while
the 2-way modes bring the efficiency in throughput computing.
Regards, Hans
Bookmarks