AMD Tapes Out First "Bulldozer" Microprocessors.

**Hornet331** · 07-20-2010, 12:34 PM

Originally Posted by savantu

A bit higher than Core and it depends on app. David Kanter tested in physics calculations and the best result was 2 IPC.

http://www.realworldtech.com/page.cf...0510142143&p=3

According to the author, once above 1 it is respectable.

Chumbucket483 talks nonsense with his 5 IPC. Not even Itanium reaches that and it's based on static compiling and EPIC ( Explicitly Parallel Instruction Set ) architecture. EPIC was designed to solve the ILP problem by doing optimizations at compile time and inserting hints into the code so the CPU knows what to do next.

Yeah but even epic only reaches ~4 and is 6 issues wide, maybe 5 with hand tuned assambler code and simple programms.

**intangir** · 07-20-2010, 01:05 PM

Originally Posted by Manicdan

you missed the point, i understand "some of the time" or "not very often", but saying "never" is an absolution that was way to limiting.

I'm curious why you think an engineering fact of life is too limiting. How do you think the CPU world would be different if this absolute claim were true, versus if it were false? Which world do you think we live in?

Originally Posted by Manicdan

no one builds something for never, but they do build things for expected bottlenecks

Making sure the hardware has no bottlenecks, and making sure all the hardware is in use at all times, are two incompatible requirements. Engineers have to strike a balance between them. From an engineer's point of view, neither extreme can happen.

**Manicdan** · 07-20-2010, 01:18 PM

Originally Posted by intangir

I'm curious why you think an engineering fact of life is too limiting. How do you think the CPU world would be different if this absolution were true, versus if it were false? Which world do you think we live in?

Making sure the hardware has no bottlenecks, and making sure all the hardware is in use at all times, are two incompatible requirements. Engineers have to strike a balance between them. From an engineer's point of view, neither extreme can happen.

absolution in statements was the problem, saying always and never mean 100 and 0%, saying most or often or sometimes can mean any percentage. some things can be built with future proofing in mind, assuming that currently its not required, but will be eventually. making those assumptions can be profitable, and risky, but still realistic. and your right that sometimes never is needed, like bridge capacity should never be met.

**intangir** · 07-20-2010, 01:39 PM

Originally Posted by Manicdan

absolution in statements was the problem, saying always and never mean 100 and 0%, saying most or often or sometimes can mean any percentage. some things can be built with future proofing in mind, assuming that currently its not required, but will be eventually. making those assumptions can be profitable, and risky, but still realistic. and your right that sometimes never is needed, like bridge capacity should never be met.

In this case, I meant never as in bridge-capacity-being-exceeded never. Yeah, sure, a meteorite could fall on the bridge so it's not technically never, but in practice you will never use all execution resources on a CPU 100% of the time in any real workload. If you even came close, you would be executing an artificially-constructed power virus that would cause the chip to exceed its TDP, and so be operating the CPU out of spec.

I mean, there's specialized hardware in each core for FDIV/FSQRT operations. What proportion of the time do you think that is in use, and would that ever happen at the same time that all the other FPUs and integer units are in use, and could that occur for a significant amount of time while saturating the load/store units and activating every cache line at the same time? Should engineers even account for that case? No!

**Solus Corvus** · 07-20-2010, 01:41 PM

Originally Posted by savantu

A bit higher than Core and it depends on app. David Kanter tested in physics calculations and the best result was 2 IPC.

http://www.realworldtech.com/page.cf...0510142143&p=3

According to the author, once above 1 it is respectable.

Chumbucket483 talks nonsense with his 5 IPC.

I think part of the confusion is that you are talking about average IPC and he is talking about maximum IPC.

If code could never have an ILP beyond 2 then there would be no point in having more then 2 EU in a non-SMT processor. Instead it should be clear that there are times where all the EUs are utilized and other times where few or none are in use.

So what do you do with the inefficiency of having idle EU after you have extracted as much ILP as you can? Intel fills them with instructions from other threads. An elegant solution, IMO, but not the only possible one. For one thing, you are still going to be severely limited by parts of the code where there is no parallelism to extract (explicit or implicit) as per Amdahl's law.

After looking at a bunch of AMD slides it seems like they think we are hitting a wall in ILP and their ability to extract it. They have explicit parallelism covered with lots of cores. But what if they aren't even going to try that hard to extract implicit parallelism? Through fine grain clock and power control what if you could power down unused EU? You could use the same controls and the remaining power budget to clock higher the EU that are being utilized. That would avoid some of the pitfalls of Amdahl's law by speeding up the serial sections but having the EU available for times when there is parallelism.

**god_43** · 07-20-2010, 02:07 PM

.....me and my big mouth..."sigh".

**savantu** · 07-20-2010, 09:24 PM

Originally Posted by Solus Corvus

I think part of the confusion is that you are talking about average IPC and he is talking about maximum IPC.

And what is maximum IPC ? How do you set a limit to that ? The literature has discussed cases on simulated CPUs were in theory at least, you could hit over 100 IPC.
I'm talking real life where it is good going over 1 and exceptional in going over 2 IPC.

If code could never have an ILP beyond 2 then there would be no point in having more then 2 EU in a non-SMT processor.
Instead it should be clear that there are times where all the EUs are utilized and other times where few or none are in use.

It doesn't work like that. Few instructions are single cycle. Most need more cycles and are complex enough that they are broken in uops which can be processed either sequentially or in parallel depending on data dependencies with he final result being reconstructed at the end. So you have more EU busy with a single instruction. IPC is 1 or even less, yet there is activity in the core.

Secondly, what's the chances of using 9-10 execution units at the same time ? We're talking about integer units, FP x87 units, FP vector units ? Virtually none with a single thread. Only by employing something like SMT you could up the utilization.

So what do you do with the inefficiency of having idle EU after you have extracted as much ILP as you can? Intel fills them with instructions from other threads. An elegant solution, IMO, but not the only possible one. For one thing, you are still going to be severely limited by parts of the code where there is no parallelism to extract (explicit or implicit) as per Amdahl's law.

SMT brings thread level parallelism to ILP. That is the most elegant solution you can have for increasing core utilization. If even 2 threads aren't enough to fill all your units, than go for 4 or 8.
Power 7 has SMT with 4 threads per core. Niagara has 8 per core ( they use a different form of MT though )

After looking at a bunch of AMD slides it seems like they think we are hitting a wall in ILP and their ability to extract it. They have explicit parallelism covered with lots of cores. But what if they aren't even going to try that hard to extract implicit parallelism? Through fine grain clock and power control what if you could power down unused EU? You could use the same controls and the remaining power budget to clock higher the EU that are being utilized. That would avoid some of the pitfalls of Amdahl's law by speeding up the serial sections but having the EU available for times when there is parallelism.

Your idea saves power and makes a mess of the clock domains in the chip. Basically, every time you change the clock domain or power up units you insert latency. SUch an aproach is ok if power is your main concern.

But if you're doing high performance MPUs, you want to increase utilization of what you already have, not try to adapt power usage to what's being run ( I men at EU level; at core and chip level this is already being done with Speedstep and Turbo like technologies ).

**Solus Corvus** · 07-21-2010, 12:24 PM

Originally Posted by savantu

And what is maximum IPC ? How do you set a limit to that ? The literature has discussed cases on simulated CPUs were in theory at least, you could hit over 100 IPC.
I'm talking real life where it is good going over 1 and exceptional in going over 2 IPC.

By average I don't think he meant in a specialized programming language running on a theoretical processor. He probably meant in average programs today the max you will find is ~5. Though we can't be sure unless he wants to come back and clarify.

Your 1 or 2 IPC is, again, an average not the maximum or minimum. During execution of a program you might have some parts with >2 IPC and some parts with 0.

It doesn't work like that. Few instructions are single cycle. Most need more cycles and are complex enough that they are broken in uops which can be processed either sequentially or in parallel depending on data dependencies with he final result being reconstructed at the end. So you have more EU busy with a single instruction. IPC is 1 or even less, yet there is activity in the core.

Right, I program in assembly and I understand that most instructions take more then one cycle to complete and they are broken into μops. You are just bolstering my point though. There is only so much ILP to extract from any given section of code, whether you are talking about the x86 instructions or the resulting μops. At some point you are limited by the speed of serial execution.

SMT brings thread level parallelism to ILP. That is the most elegant solution you can have for increasing core utilization. If even 2 threads aren't enough to fill all your units, than go for 4 or 8.
Power 7 has SMT with 4 threads per core. Niagara has 8 per core ( they use a different form of MT though )

SMT is nice, but throwing extra threads at a core is an exercise in diminishing returns. It doesn't come remotely close to running those 4 or 8 threads on independent cores. And it doesn't do anything for the single-threaded case.

Your idea saves power and makes a mess of the clock domains in the chip. Basically, every time you change the clock domain or power up units you insert latency. SUch an aproach is ok if power is your main concern.

If you have a long pipeline you should already have enough latency to get the units powered back up by the time the instruction needs them.

But if you're doing high performance MPUs, you want to increase utilization of what you already have, not try to adapt power usage to what's being run ( I men at EU level; at core and chip level this is already being done with Speedstep and Turbo like technologies ).

The customer wants the processor that will execute their program(s) the fastest with acceptable power consumption and heat - the path taken doesn't matter. If all else is equal, of course you'd want higher core utilization. But all else isn't equal. The customer isn't going to care if a processor has high per core utilization if it gets outperformed by a processor with lower per core utilization but more cores and/or higher frequency in a similar power envelope.

**Chumbucket843** · 07-21-2010, 12:45 PM

Originally Posted by Solus Corvus

By average I don't think he meant in a specialized programming language running on a theoretical processor. He probably meant in average programs today the max you will find is ~5. Though we can't be sure unless he wants to come back and clarify.

i will clarify anything if someone would like but no one has asked. i am talking about code today and with some degree of real world constraints of hardware design. making a >100 instruction issue core is not feasable for several reasons.

**~~terrace215~~** · 07-21-2010, 12:52 PM

Originally Posted by Solus Corvus

If you have a long pipeline you should already have enough latency to get the units powered back up by the time the instruction needs them.

You're talking about powering up & down individual execution units in the space of tens of cycles? This sounds far-fetched to me.

**Solus Corvus** · 07-21-2010, 12:56 PM

So how long does it take then?

**Chumbucket843** · 07-21-2010, 01:43 PM

Originally Posted by Solus Corvus

So how long does it take then?

power gating or clock gating?

**Solus Corvus** · 07-21-2010, 01:50 PM

Both.

**~~terrace215~~** · 07-21-2010, 02:25 PM

Originally Posted by Solus Corvus

Both.

Well look at the table here, for example:

http://www.anandtech.com/show/2919/4

Say we have a 2GHz part: then the fastest change (getting the clock dist stopped) is 1 microsec, or 2000 cycles. Getting the power gate shut off is 60 microsec, or 120000 cycles. Now, this is for a larger element, but still...

I don't think the notion is feasible.

**Solus Corvus** · 07-21-2010, 02:45 PM

Changing the C-state of a core isn't exactly analogous to shutting down a currently unused subunit. How much of flushing pipelines, halting threads, turning off core PLL, and flushing caches would you have to do to turn off a subunit?

**Hornet331** · 07-21-2010, 03:10 PM

Originally Posted by Solus Corvus

Changing the C-state of a core isn't exactly analogous to shutting down a currently unused subunit. How much of flushing pipelines, halting threads, turning off core PLL, and flushing caches would you have to do to turn off a subunit?

It goes hand in hand, since depending on what c-state the cpu is different parts can be turned off. C6 comes to mind, since in this state the cpu begins to gate different parts.

**Solus Corvus** · 07-21-2010, 04:04 PM

I was definitely thinking something more dynamic then changing a C state. That's why I was wondering if we could get finer grained power and clock control then just the whole core level. How fast is this Intel cache sizing?

**Dresdenboy** · 09-21-2010, 12:23 AM

Originally Posted by Solus Corvus

By average I don't think he meant in a specialized programming language running on a theoretical processor. He probably meant in average programs today the max you will find is ~5. Though we can't be sure unless he wants to come back and clarify.

Your 1 or 2 IPC is, again, an average not the maximum or minimum. During execution of a program you might have some parts with >2 IPC and some parts with 0.

Right, I program in assembly and I understand that most instructions take more then one cycle to complete and they are broken into μops. You are just bolstering my point though. There is only so much ILP to extract from any given section of code, whether you are talking about the x86 instructions or the resulting μops. At some point you are limited by the speed of serial execution.

Sorry for digging out this quote. I just found a paper by J. and A. González, now working at Intel Labs Barcelona, with some max IPC data for an Alpha like machine with infinite resources (instruction window, rename registers, perfect predictions etc). They got IPC numbers from 10s to 1000s over the different SPEC95 sub benchmarks using real compiled code traces. I think the difficult point to grasp is that e.g. different loop iterations might work on independent data and with perfect branch/loop prediction you could execute future loop iterations in parallel to the current one. Ideally the whole loop could be executed in parallel.

Here is the paper:
"Limits of instruction level parallelism with data speculation"
http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

**Speederlander** · 09-21-2010, 05:14 AM

Originally Posted by Manicdan

you missed the point, i understand "some of the time" or "not very often", but saying "never" is an absolution that was way to limiting. no one builds something for never, but they do build things for expected bottlenecks

Just an FYI, absolution makes zero sense in the context you are using it.

ab·so·lu·tion /ˌæbsəˈluʃən/ [ab-suh-loo-shuhn]
–noun
1. act of absolving; a freeing from blame or guilt; release from consequences, obligations, or penalties.
2. state of being absolved.
3. Roman Catholic Theology .
a. a remission of sin or of the punishment for sin, made by a priest in the sacrament of penance on the ground of authority received from Christ.
b. the formula declaring such remission.
4. Protestant Theology . a declaration or assurance of divine forgiveness to penitent believers, made after confession of sins.
http://dictionary.reference.com/brow...ution?o=100074

**Speederlander** · 09-21-2010, 05:15 AM

Originally Posted by intangir

I'm curious why you think an engineering fact of life is too limiting. How do you think the CPU world would be different if this absolution were true, versus if it were false? Which world do you think we live in?

Re: The use of absolution
Please see: http://www.xtremesystems.org/forums/...&postcount=169

**Manicdan** · 09-21-2010, 05:52 AM

^did you really just reply to something from 2 months ago with just a definition, because the word absolute and absolution were used wrongly?
if you read on it goes over the use of the word in the very next post

**Chumbucket843** · 09-21-2010, 07:24 AM

Originally Posted by savantu

A bit higher than Core and it depends on app. David Kanter tested in physics calculations and the best result was 2 IPC.

http://www.realworldtech.com/page.cf...0510142143&p=3

According to the author, once above 1 it is respectable.

Chumbucket483 talks nonsense with his 5 IPC. Not even Itanium reaches that and it's based on static compiling and EPIC ( Explicitly Parallel Instruction Set ) architecture. EPIC was designed to solve the ILP problem by doing optimizations at compile time and inserting hints into the code so the CPU knows what to do next.

oh dear. EPIC failed epicly. ILP is a wasteful area to research anyways. on top of that data parallel is where computing is headed. VLIW created new problems on top of the old ones it kind of fixed. most of itanium's execution time is spent waiting on memory, the caches to be more specific. the biggest issue perhaps is the fact that compile time does not provide the information that runtime will. very few microarchitectural optimizations are possible with out destroying software compatibility. sure static instruction grouping will reduce hardware complexity, but that's not a huge issue until cores are much larger than that of today's. essentially the whole idea behind itanium and its fruition is a massive clusterf***.

**Chumbucket843** · 09-21-2010, 07:44 AM

Originally Posted by Solus Corvus

Both.

sorry for the late reply. it's way more than i would know w/o inside information and this is hard to explain with out a background knowledge,

power gating is done at a very high level because it creates a lot of timing issues and requires careful transistor sizing.

clock gating can be much more granular. a wild guess would be ~100K latches for nehalem. latches basically sequence the clock and make sure there are no race conditions. then you have logic in between the latches, in general less logic the less delay which means higher clockspeeds. you can add an AND gate to the latch with clock and a state from the microarchitecture as inputs to turn the clock off between those latches. this will add a few picoseconds of delay on a modern process. how much delay the latches in total is extremely difficult to know. adding more clock gating will increase delay and create signal noise which can harm sensitive circuits like SRAM and analog circuits.

**Speederlander** · 09-21-2010, 09:24 AM

Originally Posted by Manicdan

^did you really just reply to something from 2 months ago with just a definition, because the word absolute and absolution were used wrongly?
if you read on it goes over the use of the word in the very next post

You continued to misuse it here:
http://www.xtremesystems.org/forums/...&postcount=153

The other person I quoted below my first reply to you also used it incorrectly.

Also, the very next post after your post that I quoted (http://www.xtremesystems.org/forums/...&postcount=149) makes no mention of the misuse of the word, so I'm not sure where you are getting that.

No offense, but if the English isn't clear, and no one points it out, the mistake will continue to be made, as we can see with multiple posts now in this thread using "absolution" rather than absolute (by more than one person). Had I not intervened, you would go on never realizing the difference. Your welcome.

**Manicdan** · 09-21-2010, 09:35 AM

ive always been bad with some words, i will make plenty more mistakes in the future
and while i didnt correct my misuse, i did define the perspsective:

absolution in statements was the problem, saying always and never mean 100 and 0%, saying most or often or sometimes can mean any percentage.

hopefully reading what defined and not assuming it by picking a word and ignoring the definition would have prevented any confusion.

Thread: AMD Tapes Out First "Bulldozer" Microprocessors.

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions