Another thing we should have in mind is that those percentage numbers vary greatly depending on what you are calculationg on.
50% extra core space could be 12% extra die space. There is lots of other things than cores on a die.
Printable View
That's correct. But you originally wrote:
and posted a paper made by Intel.Quote:
Originally Posted by savantu
I wrote: "Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, [...]"
And it's not only me, who points to the simultaneous execution of instructions of multiple threads on a set of EUs:
http://publib.boulder.ibm.com/infoce...2/iphb2smt.htmQuote:
Simultaneous multi-threading is the ability of a single physical processor to simultaneously dispatch instructions from more than one hardware thread context at the same time. Simultaneous multi-threading is designed to produce performance benefits in commercial environments and for workloads that have a high Cycles Per Instruction (CPI) count.
Maybe you use a different definition of SMT? Could you please post it, so that I can see, if there is a difference?
But we left the times of accumulator architectures behind us. There are 16 GPRs and 16 XMM/YMM registers. You can do a lot of calculations using these, which are independent over a couple of instructions each. This is, what can be exploited by OoO execution.
The low branch or cache miss rates don't mean, that they only have a small effect on IPC. For branch prediction:
Let's assume:
14 cycles of branch misprediction penalty
avg. IPC of 3 (with 4 decode/issue etc.) as long as there is no misprediction - CPI = 1/IPC = 0.33
BP hit rate = 95%
percentage of branch insts: 20%
CPI = 0.33 + 0.2*0.05*14 = 0.47 or an avg IPC of 2.1
Now let's add the cache:
L1D$ hit rate: 95%
miss latency: 9 cycles (L2 hit)
percentage of mem accesses: 30%
CPI = 0.33 + 0.3*0.05*9 = 0.465 or an avg IPC of 2.15
Now both effects together:
CPI = 0.33 + 0.2*0.05*14 + 0.3*0.05*9= 0.605 or an avg IPC of 1.65
I still left out L3, mem and other effects. But this should be enough to see the effect of a few percent of miss rates.
Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]
I don't think that OoO execution would be a hindrance. It's the actual execution of code afterall. But it stalls sometimes to allow for more advanced techniques. And even if not, a lower priority handling of runahead execution (e.g. executing future mem ops as prefetches) would work.
Sun Rock added more to the design than just scout threads and there is not just one way to implement runahead execution. It's like saying: "Car A will be a fail because it uses 4 wheels and a motor like car B, which failed." ;) There are other statements:
From http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdfQuote:
An efficient runahead execution processor employing these techniques executes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance.
The 2 weren't connected; should have been more explicit.
Now, we're perfectly matching. My comment was directed to the emphasis on the limitation of the decode stage in P4.Quote:
I wrote: "Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, [...]"
And it's not only me, who points to the simultaneous execution of instructions of multiple threads on a set of EUs:
http://publib.boulder.ibm.com/infoce...2/iphb2smt.htm
Maybe you use a different definition of SMT? Could you please post it, so that I can see, if there is a difference?
The definition that I would use is :
Computer Architecture: A Quantitative Approach, 3rd Edition by David Patterson and John L. Hennessy.Quote:
Simultaneous multithreading (SMT) is a variation on multithreading that uses the resources of a multiple-issue, dynamically-scheduled processor to exploit TLP at the same time it exploits ILP. The key insight that motivates SMT is that modern multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. Furthermore, with register renaming and dynamic scheduling, multiple instructions from independent threads can be is isued without regard to the dependences among them; the resolution of the dependences can be handled by the dynamic scheduling capability.
Quote:
But we left the times of accumulator architectures behind us. There are 16 GPRs and 16 XMM/YMM registers. You can do a lot of calculations using these, which are independent over a couple of instructions each. This is, what can be exploited by OoO execution.
The low branch or cache miss rates don't mean, that they only have a small effect on IPC. For branch prediction:
Let's assume:
14 cycles of branch misprediction penalty
avg. IPC of 3 (with 4 decode/issue etc.) as long as there is no misprediction - CPI = 1/IPC = 0.33
BP hit rate = 95%
percentage of branch insts: 20%
CPI = 0.33 + 0.2*0.05*14 = 0.47 or an avg IPC of 2.1
Now let's add the cache:
L1D$ hit rate: 95%
miss latency: 9 cycles (L2 hit)
percentage of mem accesses: 30%
CPI = 0.33 + 0.3*0.05*9 = 0.465 or an avg IPC of 2.15
Now both effects together:
CPI = 0.33 + 0.2*0.05*14 + 0.3*0.05*9= 0.605 or an avg IPC of 1.65
I still left out L3, mem and other effects. But this should be enough to see the effect of a few percent of miss rates.
Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.
By your examples, IPC is 1.5-2 a best out of a theoretical possible of 4. SMT could really help such a scenario.
I know that Rock added far more, it was a quick example.Quote:
I don't think that OoO execution would be a hindrance. It's the actual execution of code afterall. But it stalls sometimes to allow for more advanced techniques. And even if not, a lower priority handling of runahead execution (e.g. executing future mem ops as prefetches) would work.
Sun Rock added more to the design than just scout threads and there is not just one way to implement runahead execution. It's like saying: "Car A will be a fail because it uses 4 wheels and a motor like car B, which failed." ;) There are other statements:
From http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
i think we can agree that SMT offers more than 1% ipc gain per 1% increase in core size,
and i think we can also agree that some people are very negative about CMT, even though its not even out yet.
I would like to point out that Hyper-threading does not really run two threads in parallel but it actually means that the OS is free to schedule two threads and run either when the situation is right for which ever thread.
Eg:
T1 and T2 running on CPU
T1 and T2 scheduled to run by OS on a HTT enabled cpu
T1 has first run
T1 uses R1, R2 AND R3
R1, R2 and R3 locked by T1 since T1 doing work
T1 requests time on cpu leading to timeout
T2 gets cpu time and resources it needs like R1,R2 OR R3
This is how HTT runs simple as that.
I'm pretty sure Intel is already working on their bulldozer.
Except they can't scale that hypothetical monster past 2P if they do it your way(just one of the many cons if they do this). This way,we have 16 *improved* cores,with new power gating features(new turbo too),AVX support,a core design that is fully modular,scalable and die area savvy. This design will serve as a basis for future server Fusion part too,since the big shared FPU can be replaced by the future hybrid GPU part ,as AMD stated themselves.
Erm, no, it isn't.
At any point in time, T1 may be running in some execution units while AT THE SAME TIME, T2 is running in the others.
See virtually any of Intel's papers on Nehalem Hyperthreading...
http://software.intel.com/en-us/arti...ng-technology/
Note the little diagram of the 4-wide nehalem core.
And this:
The execution pipeline of processors based on Intel® Core™ microarchitecture is four instructions wide, meaning that it can execute up to four micro-operations per clock cycle. As shown in Figure 3, however, the software thread being executed often does not have four instructions eligible for simultaneous execution. Common reasons for fewer than four instructions per clock being retired include dependencies between the output of one instruction and the input of another, as well as latency waiting for data to be fetched from cache or memory.
Intel HT Technology improves performance through increased instruction level parallelism by having two threads with independent instruction streams, eliminating data dependencies between threads and increasing utilization of the available execution units. This effect typically increases the number of instructions executed in a given amount of time within a core, as shown in Figure 3. The impact of this greater efficiency is experienced by users as higher throughput (since more work gets completed per clock cycle) and higher performance per watt (since fewer idle execution units consume power without contributing to performance). In addition, when one thread has a cache miss, branch mispredict, or any other pipeline stall, the other thread continues processing instructions at nearly the same rate as a single thread running on the core. Intel HT Technology augments other advanced architectural features, higher clock speeds, and additional cores with a capability that is relatively inexpensive in terms of space on the silicon and production cost.
Why are highways built in such a way that 80% of the time, half of their capacity is unused? You could halve the number of lanes in my local interstate highway, and for 140 hours a week, it would be perfectly fine. For the other ~30 hours though, you'll be very happy to have those 3rd and 4th lanes available for use! Engineers design to a maximum throughput number, but in the majority of circumstances there just won't be enough traffic to use 100% capacity.
If algorithms were perfectly parallel, memory accesses were perfectly predictable, and compilers were perfect at scheduling every integer and floating point pipeline to have one computation per cycle, then it would be possible to build a CPU with perfect utilization. But we don't live in such a world, and never will. For the record, x86-64 processors tend to hover around an IPC of 0.8-1.5 on optimized benchmark code.
IPC chart for selected SPEC 2006 components: http://www.marss86.org/index.php/File:Ipc_chart.png
Quote:
The above chart shows the accuracy of the MARSS simulator by comparing IPCs (Instructions-Commits Per Cycle) obtained from simulation against the IPCs realized in executing the same benchmark programs on two real implementations, an AMD Athlon X2 and an Intel Core-2 Duo for SPEC 2006 benchmarks. These IPC values are for only user space execution and does not contain any kernel simulation because the kernel execution paths are different on the MARSS VM and on the test machines used to gather above statistics. All the benchmarks are run from start to completion for the simulated runs and the runs on the real machines. We have used Linux Kernel's Performance Counters to get the IPCs realized on the actual hardware.
After thinking about it for a bit I realize that this IPC and ILP discussion might be taking the wrong direction in relation to BD. I'll elaborate more later but first a question: what's the average IPC of Nehalem?
^^ i think i might do some experiments on that. nothing fancy though.
i think you mean blocks and no that's not what i am saying. to reach 5 ipc you must predict the the control flow instructions so that you can start working on the next block otherwise your pipeline will be filled with nops. control flow must be predicted correctly for ilp to reach 5 ipc. increasing ilp within a block is a task for dynamic execution.
can we get a die shot at least, Lllano is the only 32nm die we've seen
no Ontario or bulldozer yet.
A bit higher than Core and it depends on app. David Kanter tested in physics calculations and the best result was 2 IPC.
http://www.realworldtech.com/page.cf...0510142143&p=3
According to the author, once above 1 it is respectable. :) Chumbucket483 talks nonsense with his 5 IPC. Not even Itanium reaches that and it's based on static compiling and EPIC ( Explicitly Parallel Instruction Set ) architecture. EPIC was designed to solve the ILP problem by doing optimizations at compile time and inserting hints into the code so the CPU knows what to do next.
Nope! You seem to be assuming there is only one execution unit per core. When a microarchitect says execution unit, he means an ALU or FPU. Each core of any out-of-order x86 CPU contains multiple integer execution units and multiple floating-point execution units. For example, the first superscalar x86 chip, the Pentium Pro, started with two ALUs and one FPU. While each individual execution unit can only be executing one instruction at a time, there are multiple per core, and a Nehalem core can easily schedule two instructions from two different threads on two of its execution units in the same cycle.
The single-core Athlon XP had nine execution units.
Huynh, Jack (2003). "The AMD Athlon XP Processor with 512KB L2 Cache": http://courses.ece.uiuc.edu/ece512/Papers/Athlon.pdf
One Nehalem core has a set of nine execution units, much the same as Core 2.Quote:
At the heart of QuantiSpeed architecture is a fully pipelined,
nine-issue, superscalar processor core. The AMD Athlon XP processor
provides a wider execution bandwidth of nine execution pipes when
compared with competitive x86 processors with up to six execution
pipes. The nine execution engines are comprised of three address
calculation units, three integer units, and three floating-point units.
1x ALU Shift
1x ALU LEA
1x ALU Shift Branch
1x SSE Shuffle ALU
1x SSE Mul
1x SSE Shuffle ALU
1x 128-bit FMUL FDIV
1x 128-bit FADD
1x 128-bit FP Shuffle
Kanter, David (2008) "Inside Nehalem: Intel's Future Processor and System": http://www.realworldtech.com/page.cf...0208182719&p=6
Quote:
As with Core 2, the register alias table (RAT) points each architectural register into either the Re-Order Buffer (ROB) or the Retirement Register File (RRF) and holds the most recent speculative state (whereas the RRF holds the most recent non-speculative and committed state). The RAT can rename up to 4 uops each cycle, giving each one a destination register in the ROB. The renamed instructions then read their source operands and issue into the unified Reservation Station (RS), which is used by all instruction types.
Any instructions in the RS which have all their operands ready are dispatched to the execution units, which are largely unchanged from the Core 2 and unaffected by SMT, except for an increase in utilization.