Quote Originally Posted by ohnoitseddy View Post
Highly dubious indeed. Considering that a module is slightly larger than a SB core, that would mean AMD is able to extract twice the performance from the same die area (roughly speaking). :/ I think not.
AMD actually claims they manage to get almost 2 cores worth of performance from a monolithic dual thread capable module(as they call it;another term is optimized dual core). This is achieved via clever sharing of resources. The thing is that with shared front end,all the benefits of SMT (filling in the pipeline bubbles) are still present with AMD's approach.Only this time,you don't share the same execution unit -you get a whole execution core ready for the thread. This is described by AMD as "smoothing of inefficient/bursty usage".AMD invested heavily in both instruction and data prefetch which are now order of magnitude better than in 10h family. So you are left with fully featured cores that have all the advantages of SMT and behave/perform nearly at the level of independent cores in hypothetical non-shared module(as if they were made a la Athlon/C2D with non shared parts except maybe the L2 cache).

What I'm trying to say is that with Bulldozer,we can't use the same comparison methods we used in the past in which we compare die area of single core ,say, Athlon and Nehalem,and derive the die area investments both firms made in terms of logic and cache. Since Bulldozer is organized the way it is,we have no clue how much die area would ,now shared parts ,occupy in a hypothetical non-shared design. There are some numbers being thrown around (from 15 all the way up to 30% bigger module),but the thing is that only AMD knows exact figures. Now ,since sharing has some benefits as previously mentioned and as well some possible performance downsides,AMD invested in areas which will maximize benefits and minimize the bad effects of sharing. How well they did this will make or brake the module approach and whole Bulldozer idea.

Whether Zambezi can be 2x faster than 980x in some cases I don't doubt.In optimized FMA4/AVX applications difference can be even higher than that. But in legacy code I'm skeptical. The fact is they don't need to be 2x faster,they just need to be faster . If they are,this means Zambezi's cores are much closer to the level of Nehalem ,no matter how they achieved it(whether it's just IPC or just clock or combination of the two). Next year will bring improved Bulldozer in the form of 10 core Komodo part for desktop. That's 25% more cores,with possible minor IPC tweaks and 25% more TDP headroom for cores to boost clocks in poorly threaded workloads .