Quote Originally Posted by drfedja View Post
Old slides is probably based on CPU simulation, but in realworld scenario BD marchitecture behaves very different. Maybe some of advanced features are disabled because of serious bug, and there is performance penalty. I can't wait to see BD errata list. I can't believe in that the cache architecture is simple reason for low performance because they can simulate how much performance penalty came from smaller 16K L1D, and WT based cache policy. Hit rate of 4-way 16KB is only 1 or 2 % less than 64K 2-way. 2MB L2 has also much higher hit rate than 0.5MB L2. Maybe BD is better optimized for large data work sets.
The slide is from December 2010. Bulldozer taped out in Q2 2010. The data is most likely NOT based on simulations but on real hardware. The clocks on the initial steppings could be low but they can extrapolate performance of higher clocked models easily. The point is that by Dec 2010,the design was well pass the finish point and they were doing the sampling to partners. The only problem they may hit by that time is low clock,low yield or leaky parts with power draw issues(or any combination of those). If anything was wrong(design bug)they would have known it and figured it in all the performance expectations and hence that scorpius slide would have it too. That is why I think there is no performance problem but something else is going on.

@ Oliverda

While the scaling is that of an 8 core chip, the single core performance is abysmal. Leaks point out to a single thread score of around 24% lower than one Thuban core at same clock. Either both fmacs are not utilized in single thread test or something else is the problem. 2x fmac should not be slower in legacy code than one Thuban core,especially since we know that fmacs are more flexible and can do any of the fp commands.