Well, BD is very similar to Prescott in many ways, it is very very sensitive to the code quality, due to the fact that each thread of a core only have 2 large decoder, the front end is very limited, and you have to wait for more decode steps to feed your out of order more parallelism opportunity.

This is the major difference with hyper threading, where each thread can get up to 5 large decoding. This cause heavy dependancy on the code scheduling for BD, this is fairly hard to overcome without adding more hardware on each side of the BD decoder ... I am sure that for the last 2 years, AMD has been trying to overcome this .... The problem is that it will cost even more transistors and dice space ==> $$$

This is without counting with the register files of the FP units that need to be dispatch too to each side of the int pipelines. Many opportunity for locking issues there too.

BD will stay very sensitive to code quality as long as the front end is not 4 large on each side of the threads, this is the bottom line.

Hope it is ok to share my point of view and personal analysis of the performance issues.

It is ok to disagree ;-)