That's correct. But you originally wrote:
and posted a paper made by Intel.Originally Posted by savantu
I wrote: "Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, [...]"
And it's not only me, who points to the simultaneous execution of instructions of multiple threads on a set of EUs:
http://publib.boulder.ibm.com/infoce...2/iphb2smt.htmSimultaneous multi-threading is the ability of a single physical processor to simultaneously dispatch instructions from more than one hardware thread context at the same time. Simultaneous multi-threading is designed to produce performance benefits in commercial environments and for workloads that have a high Cycles Per Instruction (CPI) count.
Maybe you use a different definition of SMT? Could you please post it, so that I can see, if there is a difference?
But we left the times of accumulator architectures behind us. There are 16 GPRs and 16 XMM/YMM registers. You can do a lot of calculations using these, which are independent over a couple of instructions each. This is, what can be exploited by OoO execution.
The low branch or cache miss rates don't mean, that they only have a small effect on IPC. For branch prediction:
Let's assume:
14 cycles of branch misprediction penalty
avg. IPC of 3 (with 4 decode/issue etc.) as long as there is no misprediction - CPI = 1/IPC = 0.33
BP hit rate = 95%
percentage of branch insts: 20%
CPI = 0.33 + 0.2*0.05*14 = 0.47 or an avg IPC of 2.1
Now let's add the cache:
L1D$ hit rate: 95%
miss latency: 9 cycles (L2 hit)
percentage of mem accesses: 30%
CPI = 0.33 + 0.3*0.05*9 = 0.465 or an avg IPC of 2.15
Now both effects together:
CPI = 0.33 + 0.2*0.05*14 + 0.3*0.05*9= 0.605 or an avg IPC of 1.65
I still left out L3, mem and other effects. But this should be enough to see the effect of a few percent of miss rates.
Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]
I don't think that OoO execution would be a hindrance. It's the actual execution of code afterall. But it stalls sometimes to allow for more advanced techniques. And even if not, a lower priority handling of runahead execution (e.g. executing future mem ops as prefetches) would work.
Sun Rock added more to the design than just scout threads and there is not just one way to implement runahead execution. It's like saying: "Car A will be a fail because it uses 4 wheels and a motor like car B, which failed."There are other statements:
From http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdfAn efficient runahead execution processor employing these techniques executes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance.
Bookmarks