Quote Originally Posted by Drwho? View Post
I propose you take a simple linear algorithm and run it on one thread, then, count the number of instructions retired, and then, divide by the number of clock ticks ... You ll be surprise ;-)
( make sure your code is totally compute, with 1 to 2 instructions dependancy ... )

Power point are one thing, but measuring and checking yourself is much better ... Otherwise , at 4.2ghz, how could you explain the poor performance of BD on superPI? Low IPC ... Then, ask yourself, if you measure the IPC for each thread, why it never goes about 2 on a single thread ... Please experiment before trying to correct me. I did my homework ;-)


Then , for your intel diagram, you forgot to count code fusion ... SandyB is 4 large + Fusion ... That gives you up to 5!

We saw a lot of powerpoint slide, but the measurement don t match what is showed in the ppt, sorry, you assume the marketing slide are correct, this is where is the gap. I looked for everywhere, I could not find anywhere clearly said that it will decode more than 2 per threads, and match it with an ASM code doing more than 2 IPC , did you try?


Hehe ...

Francois

Sorry I'm a bit lost here. Why are you focusing on one thread When the front end of bulldozer is responsble for two threads, just like Sandybridge?

I know it falls behind clock/clock, but I don't you think that has more to do with other bottlnecks? Including, for integer code, the much debated ALU resources ona single thread? What about the longer pipeine? Are you taking into account there may still be a deficiency in Branch prediction next to Intel ? Pipeline Bubbles (floating point) that get filled by a 2nd thread?

What would be more interesting I think, is comparing code thats exlusivley floating point with a single thread, then two threads, both on the one module. This would remove the integer clusters from the equation completely. (don't know if this is practical.. programming knowlege is my deficiency so help me out here! )