No, the front is vertical multi threaded, i.e. every clock 4 decoders to one thread, in the next clock to the other. If the 2nd thread does not decode anything, obviously the other thread can have the front end longer than 1 clock cycle ;-)
bulldozer4b.jpg
How do you count to 5 now? Do you include the Macro Op Fusion, too? There are only 3fastpath plus 1 complex decoder in Intel's design. Officially they count 4:This is the major difference with hyper threading, where each thread can get up to 5 large decoding.
32_m.png
Anyways, MacroOpFusion is used with Bulldozer, too now, so you have to count 5 for AMD, too (however in less cases, AMD's fusion is on the Conroe's level, Nehalem got more fusion capabilities, not sure about Sandy now.)
As said above, each thread has 4 or if you count Fusion then 5 decoders. How is intel running Hyperthreading on the 3+1 decoders? Each thread gets 4 decoders, so 8 total? That would be new to me and intel. If intel does it in another way than AMD, then they have to run both threads simultaneously. However, that would mean "only" 2 decoders for each thread, and that's exactly the baaaad case you wrote about above in your incorrect statement about AMD's decoder in the beginning.BD will stay very sensitive to code quality as long as the front end is not 4 large on each side of the threads, this is the bottom line.
Discussion is always fine, however in the above case, I assume you are rather wrong.Hope it is ok to share my point of view and personal analysis of the performance issues.
It is ok to disagree ;-)
cheers
Opteron




Reply With Quote
Bookmarks