I'm curious why you think an engineering fact of life is too limiting. How do you think the CPU world would be different if this absolute claim were true, versus if it were false? Which world do you think we live in?
Making sure the hardware has no bottlenecks, and making sure all the hardware is in use at all times, are two incompatible requirements. Engineers have to strike a balance between them. From an engineer's point of view, neither extreme can happen.
Last edited by intangir; 09-21-2010 at 12:06 PM. Reason: corrected use of absolution ;)
absolution in statements was the problem, saying always and never mean 100 and 0%, saying most or often or sometimes can mean any percentage. some things can be built with future proofing in mind, assuming that currently its not required, but will be eventually. making those assumptions can be profitable, and risky, but still realistic. and your right that sometimes never is needed, like bridge capacity should never be met.
In this case, I meant never as in bridge-capacity-being-exceeded never. Yeah, sure, a meteorite could fall on the bridge so it's not technically never, but in practice you will never use all execution resources on a CPU 100% of the time in any real workload. If you even came close, you would be executing an artificially-constructed power virus that would cause the chip to exceed its TDP, and so be operating the CPU out of spec.
I mean, there's specialized hardware in each core for FDIV/FSQRT operations. What proportion of the time do you think that is in use, and would that ever happen at the same time that all the other FPUs and integer units are in use, and could that occur for a significant amount of time while saturating the load/store units and activating every cache line at the same time? Should engineers even account for that case? No!
I think part of the confusion is that you are talking about average IPC and he is talking about maximum IPC.
If code could never have an ILP beyond 2 then there would be no point in having more then 2 EU in a non-SMT processor. Instead it should be clear that there are times where all the EUs are utilized and other times where few or none are in use.
So what do you do with the inefficiency of having idle EU after you have extracted as much ILP as you can? Intel fills them with instructions from other threads. An elegant solution, IMO, but not the only possible one. For one thing, you are still going to be severely limited by parts of the code where there is no parallelism to extract (explicit or implicit) as per Amdahl's law.
After looking at a bunch of AMD slides it seems like they think we are hitting a wall in ILP and their ability to extract it. They have explicit parallelism covered with lots of cores. But what if they aren't even going to try that hard to extract implicit parallelism? Through fine grain clock and power control what if you could power down unused EU? You could use the same controls and the remaining power budget to clock higher the EU that are being utilized. That would avoid some of the pitfalls of Amdahl's law by speeding up the serial sections but having the EU available for times when there is parallelism.
.....me and my big mouth..."sigh".
[MOBO] Asus CrossHair Formula 5 AM3+
[GPU] ATI 6970 x2 Crossfire 2Gb
[RAM] G.SKILL Ripjaws X Series 16GB (4 x 4GB) 240-Pin DDR3 1600
[CPU] AMD FX-8120 @ 4.8 ghz
[COOLER] XSPC Rasa 750 RS360 WaterCooling
[OS] Windows 8 x64 Enterprise
[HDD] OCZ Vertex 3 120GB SSD
[AUDIO] Logitech S-220 17 Watts 2.1
And what is maximum IPC ? How do you set a limit to that ? The literature has discussed cases on simulated CPUs were in theory at least, you could hit over 100 IPC.
I'm talking real life where it is good going over 1 and exceptional in going over 2 IPC.
It doesn't work like that. Few instructions are single cycle. Most need more cycles and are complex enough that they are broken in uops which can be processed either sequentially or in parallel depending on data dependencies with he final result being reconstructed at the end. So you have more EU busy with a single instruction. IPC is 1 or even less, yet there is activity in the core.If code could never have an ILP beyond 2 then there would be no point in having more then 2 EU in a non-SMT processor.
Instead it should be clear that there are times where all the EUs are utilized and other times where few or none are in use.
Secondly, what's the chances of using 9-10 execution units at the same time ? We're talking about integer units, FP x87 units, FP vector units ? Virtually none with a single thread. Only by employing something like SMT you could up the utilization.
SMT brings thread level parallelism to ILP. That is the most elegant solution you can have for increasing core utilization. If even 2 threads aren't enough to fill all your units, than go for 4 or 8.So what do you do with the inefficiency of having idle EU after you have extracted as much ILP as you can? Intel fills them with instructions from other threads. An elegant solution, IMO, but not the only possible one. For one thing, you are still going to be severely limited by parts of the code where there is no parallelism to extract (explicit or implicit) as per Amdahl's law.
Power 7 has SMT with 4 threads per core. Niagara has 8 per core ( they use a different form of MT though )
Your idea saves power and makes a mess of the clock domains in the chip. Basically, every time you change the clock domain or power up units you insert latency. SUch an aproach is ok if power is your main concern.After looking at a bunch of AMD slides it seems like they think we are hitting a wall in ILP and their ability to extract it. They have explicit parallelism covered with lots of cores. But what if they aren't even going to try that hard to extract implicit parallelism? Through fine grain clock and power control what if you could power down unused EU? You could use the same controls and the remaining power budget to clock higher the EU that are being utilized. That would avoid some of the pitfalls of Amdahl's law by speeding up the serial sections but having the EU available for times when there is parallelism.
But if you're doing high performance MPUs, you want to increase utilization of what you already have, not try to adapt power usage to what's being run ( I men at EU level; at core and chip level this is already being done with Speedstep and Turbo like technologies ).
By average I don't think he meant in a specialized programming language running on a theoretical processor. He probably meant in average programs today the max you will find is ~5. Though we can't be sure unless he wants to come back and clarify.
Your 1 or 2 IPC is, again, an average not the maximum or minimum. During execution of a program you might have some parts with >2 IPC and some parts with 0.
Right, I program in assembly and I understand that most instructions take more then one cycle to complete and they are broken into μops. You are just bolstering my point though. There is only so much ILP to extract from any given section of code, whether you are talking about the x86 instructions or the resulting μops. At some point you are limited by the speed of serial execution.It doesn't work like that. Few instructions are single cycle. Most need more cycles and are complex enough that they are broken in uops which can be processed either sequentially or in parallel depending on data dependencies with he final result being reconstructed at the end. So you have more EU busy with a single instruction. IPC is 1 or even less, yet there is activity in the core.
SMT is nice, but throwing extra threads at a core is an exercise in diminishing returns. It doesn't come remotely close to running those 4 or 8 threads on independent cores. And it doesn't do anything for the single-threaded case.SMT brings thread level parallelism to ILP. That is the most elegant solution you can have for increasing core utilization. If even 2 threads aren't enough to fill all your units, than go for 4 or 8.
Power 7 has SMT with 4 threads per core. Niagara has 8 per core ( they use a different form of MT though )
If you have a long pipeline you should already have enough latency to get the units powered back up by the time the instruction needs them.Your idea saves power and makes a mess of the clock domains in the chip. Basically, every time you change the clock domain or power up units you insert latency. SUch an aproach is ok if power is your main concern.
The customer wants the processor that will execute their program(s) the fastest with acceptable power consumption and heat - the path taken doesn't matter. If all else is equal, of course you'd want higher core utilization. But all else isn't equal. The customer isn't going to care if a processor has high per core utilization if it gets outperformed by a processor with lower per core utilization but more cores and/or higher frequency in a similar power envelope.But if you're doing high performance MPUs, you want to increase utilization of what you already have, not try to adapt power usage to what's being run ( I men at EU level; at core and chip level this is already being done with Speedstep and Turbo like technologies ).
So how long does it take then?
Both.
Well look at the table here, for example:
http://www.anandtech.com/show/2919/4
Say we have a 2GHz part: then the fastest change (getting the clock dist stopped) is 1 microsec, or 2000 cycles. Getting the power gate shut off is 60 microsec, or 120000 cycles. Now, this is for a larger element, but still...
I don't think the notion is feasible.
Changing the C-state of a core isn't exactly analogous to shutting down a currently unused subunit. How much of flushing pipelines, halting threads, turning off core PLL, and flushing caches would you have to do to turn off a subunit?
I was definitely thinking something more dynamic then changing a C state. That's why I was wondering if we could get finer grained power and clock control then just the whole core level. How fast is this Intel cache sizing?
Sorry for digging out this quote. I just found a paper by J. and A. González, now working at Intel Labs Barcelona, with some max IPC data for an Alpha like machine with infinite resources (instruction window, rename registers, perfect predictions etc). They got IPC numbers from 10s to 1000s over the different SPEC95 sub benchmarks using real compiled code traces. I think the difficult point to grasp is that e.g. different loop iterations might work on independent data and with perfect branch/loop prediction you could execute future loop iterations in parallel to the current one. Ideally the whole loop could be executed in parallel.
Here is the paper:
"Limits of instruction level parallelism with data speculation"
http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
Just an FYI, absolution makes zero sense in the context you are using it.
abˇsoˇluˇtion /ˌćbsəˈluʃən/ [ab-suh-loo-shuhn]
–noun
1. act of absolving; a freeing from blame or guilt; release from consequences, obligations, or penalties.
2. state of being absolved.
3. Roman Catholic Theology .
a. a remission of sin or of the punishment for sin, made by a priest in the sacrament of penance on the ground of authority received from Christ.
b. the formula declaring such remission.
4. Protestant Theology . a declaration or assurance of divine forgiveness to penitent believers, made after confession of sins.
http://dictionary.reference.com/brow...ution?o=100074
[SIGPIC][/SIGPIC]
Re: The use of absolution
Please see: http://www.xtremesystems.org/forums/...&postcount=169
[SIGPIC][/SIGPIC]
^did you really just reply to something from 2 months ago with just a definition, because the word absolute and absolution were used wrongly?
if you read on it goes over the use of the word in the very next post
oh dear. EPIC failed epicly. ILP is a wasteful area to research anyways. on top of that data parallel is where computing is headed. VLIW created new problems on top of the old ones it kind of fixed. most of itanium's execution time is spent waiting on memory, the caches to be more specific. the biggest issue perhaps is the fact that compile time does not provide the information that runtime will. very few microarchitectural optimizations are possible with out destroying software compatibility. sure static instruction grouping will reduce hardware complexity, but that's not a huge issue until cores are much larger than that of today's. essentially the whole idea behind itanium and its fruition is a massive clusterf***.
sorry for the late reply. it's way more than i would know w/o inside information and this is hard to explain with out a background knowledge,
power gating is done at a very high level because it creates a lot of timing issues and requires careful transistor sizing.
clock gating can be much more granular. a wild guess would be ~100K latches for nehalem. latches basically sequence the clock and make sure there are no race conditions. then you have logic in between the latches, in general less logic the less delay which means higher clockspeeds. you can add an AND gate to the latch with clock and a state from the microarchitecture as inputs to turn the clock off between those latches. this will add a few picoseconds of delay on a modern process. how much delay the latches in total is extremely difficult to know. adding more clock gating will increase delay and create signal noise which can harm sensitive circuits like SRAM and analog circuits.
You continued to misuse it here:
http://www.xtremesystems.org/forums/...&postcount=153
The other person I quoted below my first reply to you also used it incorrectly.
Also, the very next post after your post that I quoted (http://www.xtremesystems.org/forums/...&postcount=149) makes no mention of the misuse of the word, so I'm not sure where you are getting that.
No offense, but if the English isn't clear, and no one points it out, the mistake will continue to be made, as we can see with multiple posts now in this thread using "absolution" rather than absolute (by more than one person). Had I not intervened, you would go on never realizing the difference. Your welcome.![]()
[SIGPIC][/SIGPIC]
ive always been bad with some words, i will make plenty more mistakes in the future
and while i didnt correct my misuse, i did define the perspsective:
hopefully reading what defined and not assuming it by picking a word and ignoring the definition would have prevented any confusion.absolution in statements was the problem, saying always and never mean 100 and 0%, saying most or often or sometimes can mean any percentage.
Bookmarks