Quote Originally Posted by savantu View Post
Who said that?
At most, they were the first to successfully implement it in a commercial large volume product.
That's correct. But you originally wrote:
Quote Originally Posted by savantu
Maybe instead of amateur sources and interpretations, we should look into real technical articles, done by the people who invented this technologies and which are published at conferences and tech journals.
and posted a paper made by Intel.

Quote Originally Posted by savantu View Post
There is no misunderstanding. They are active at the same time. Even with the threads being decoded and retired in alternating cycle it doesn't mean anything about the parallel execution. Since we're talking after all of a shared resource it is logical that some arbitration exists at different level.

Says who ? You're basically narrowing the definition of SMT , so you can build an argument on a corner case which would come against the new definition. How do you want to have a discussion when you're changing the definition of things so it fit your stance ?
Talk about a logical fallacy the size of Everest.
I wrote: "Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, [...]"

And it's not only me, who points to the simultaneous execution of instructions of multiple threads on a set of EUs:
Simultaneous multi-threading is the ability of a single physical processor to simultaneously dispatch instructions from more than one hardware thread context at the same time. Simultaneous multi-threading is designed to produce performance benefits in commercial environments and for workloads that have a high Cycles Per Instruction (CPI) count.
http://publib.boulder.ibm.com/infoce...2/iphb2smt.htm

Maybe you use a different definition of SMT? Could you please post it, so that I can see, if there is a difference?

Quote Originally Posted by savantu View Post
Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
Most of code is programmed in a sequential manner and data dependencies are the norm.
But we left the times of accumulator architectures behind us. There are 16 GPRs and 16 XMM/YMM registers. You can do a lot of calculations using these, which are independent over a couple of instructions each. This is, what can be exploited by OoO execution.

The low branch or cache miss rates don't mean, that they only have a small effect on IPC. For branch prediction:
Let's assume:
14 cycles of branch misprediction penalty
avg. IPC of 3 (with 4 decode/issue etc.) as long as there is no misprediction - CPI = 1/IPC = 0.33
BP hit rate = 95%
percentage of branch insts: 20%

CPI = 0.33 + 0.2*0.05*14 = 0.47 or an avg IPC of 2.1

Now let's add the cache:
L1D$ hit rate: 95%
miss latency: 9 cycles (L2 hit)
percentage of mem accesses: 30%

CPI = 0.33 + 0.3*0.05*9 = 0.465 or an avg IPC of 2.15

Now both effects together:
CPI = 0.33 + 0.2*0.05*14 + 0.3*0.05*9= 0.605 or an avg IPC of 1.65

I still left out L3, mem and other effects. But this should be enough to see the effect of a few percent of miss rates.

Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]

Quote Originally Posted by savantu View Post
I am sure future CPUs will include run-ahead and advanced replay uarch. But it is possible that OoO execution will hinder such approaches. The first CPU with scout threads ( run ahead execution ), Sun Rock , was a major failure performance wise.
I don't think that OoO execution would be a hindrance. It's the actual execution of code afterall. But it stalls sometimes to allow for more advanced techniques. And even if not, a lower priority handling of runahead execution (e.g. executing future mem ops as prefetches) would work.

Sun Rock added more to the design than just scout threads and there is not just one way to implement runahead execution. It's like saying: "Car A will be a fail because it uses 4 wheels and a motor like car B, which failed." There are other statements:
An efficient runahead execution processor employing these techniques executes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance.
From http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf