Who said that?
At most, they were the first to successfully implement it in a commercial large volume product.
There is no misunderstanding. They are active at the same time. Even with the threads being decoded and retired in alternating cycle it doesn't mean anything about the parallel execution. Since we're talking after all of a shared resource it is logical that some arbitration exists at different level.Back to my posting: I linked to an Intel slide in an article, which didn't allow direct linking. Fixed that now.
My point is: There is some misunderstanding about when and where those two threads in a SMT core are active. I wanted to show, that during each cycle, both threads could be executed on the available execution units. In other units, like the decode or retirement units of the Netburst architecture (see the paper linked by you) , the threads are being decoded/retired in alternating cycles.
Says who ? You're basically narrowing the definition of SMT , so you can build an argument on a corner case which would come against the new definition. How do you want to have a discussion when you're changing the definition of things so it fit your stance ?Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them,
Talk about a logical fallacy the size of Everest.
Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
Most of code is programmed in a sequential manner and data dependencies are the norm.
Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]But why don't you try using the free execution resources with the thread itself? If for example the thread stalls due to a cache miss, it could speculatively continue to run those instructions, which load data, as kind of prefetch instructions. After serving the cache miss the thread could execute normally, but having the further needed data in place.
I am sure future CPUs will include run-ahead and advanced replay uarch. But it is possible that OoO execution will hinder such approaches. The first CPU with scout threads ( run ahead execution ), Sun Rock , was a major failure performance wise.




Reply With Quote
Bookmarks