Quote Originally Posted by Mechromancer View Post
Dresdenboy, concerning the below diagram, specifically the decode stage, you previously stated that it looks like this uarch will be able to do up to 8 threads if the threads are separate and made up of 4 individual microcode and 4 individual fastpath instructions. To get the most efficiency out of that kind of setup, will software engineers need to target this architecture specifically to get the most efficiency out of it? Will this turn out like another largely unsupported AMD implementation like SSE4A, or will this make the majority of multi-threaded software run crazy fast?
I didn't say 8 threads (just 2), but I wrote about the micro ops. So this means that one thread could have one (probably even more) microcoded (complex) operations decoded, while the other thread could be decoded using the fast path decoders. So one thread with simple ops could be decoded for the appropriate int cluster while the second thread could have some (probably microcoded, see the KGC blog entry) AVX instructions being decoded for the FPU.