
Originally Posted by
Mechromancer
Dresdenboy, concerning the below diagram, specifically the decode stage, you previously stated that it looks like this uarch will be able to do up to 8 threads if the threads are separate and made up of 4 individual microcode and 4 individual fastpath instructions. To get the most efficiency out of that kind of setup, will software engineers need to target this architecture specifically to get the most efficiency out of it? Will this turn out like another largely unsupported AMD implementation like SSE4A, or will this make the majority of multi-threaded software run crazy fast?
Bookmarks