AMD Tapes Out First "Bulldozer" Microprocessors.

**-Boris-** · 07-20-2010, 02:42 AM

Originally Posted by savantu

Where did they mention 12.5 % ?

Another thing we should have in mind is that those percentage numbers vary greatly depending on what you are calculationg on.

50% extra core space could be 12% extra die space. There is lots of other things than cores on a die.

**Dresdenboy** · 07-20-2010, 02:58 AM

Originally Posted by savantu

Who said that?
At most, they were the first to successfully implement it in a commercial large volume product.

That's correct. But you originally wrote:

Originally Posted by savantu

Maybe instead of amateur sources and interpretations, we should look into real technical articles, done by the people who invented this technologies and which are published at conferences and tech journals.

and posted a paper made by Intel.

Originally Posted by savantu

There is no misunderstanding. They are active at the same time. Even with the threads being decoded and retired in alternating cycle it doesn't mean anything about the parallel execution. Since we're talking after all of a shared resource it is logical that some arbitration exists at different level.

Says who ? You're basically narrowing the definition of SMT , so you can build an argument on a corner case which would come against the new definition. How do you want to have a discussion when you're changing the definition of things so it fit your stance ?
Talk about a logical fallacy the size of Everest.

I wrote: "Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, [...]"

And it's not only me, who points to the simultaneous execution of instructions of multiple threads on a set of EUs:

Simultaneous multi-threading is the ability of a single physical processor to simultaneously dispatch instructions from more than one hardware thread context at the same time. Simultaneous multi-threading is designed to produce performance benefits in commercial environments and for workloads that have a high Cycles Per Instruction (CPI) count.

http://publib.boulder.ibm.com/infoce...2/iphb2smt.htm

Maybe you use a different definition of SMT? Could you please post it, so that I can see, if there is a difference?

Originally Posted by savantu

Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
Most of code is programmed in a sequential manner and data dependencies are the norm.

But we left the times of accumulator architectures behind us. There are 16 GPRs and 16 XMM/YMM registers. You can do a lot of calculations using these, which are independent over a couple of instructions each. This is, what can be exploited by OoO execution.

The low branch or cache miss rates don't mean, that they only have a small effect on IPC. For branch prediction:
Let's assume:
14 cycles of branch misprediction penalty
avg. IPC of 3 (with 4 decode/issue etc.) as long as there is no misprediction - CPI = 1/IPC = 0.33
BP hit rate = 95%
percentage of branch insts: 20%

CPI = 0.33 + 0.2*0.05*14 = 0.47 or an avg IPC of 2.1

Now let's add the cache:
L1D$ hit rate: 95%
miss latency: 9 cycles (L2 hit)
percentage of mem accesses: 30%

CPI = 0.33 + 0.3*0.05*9 = 0.465 or an avg IPC of 2.15

Now both effects together:
CPI = 0.33 + 0.2*0.05*14 + 0.3*0.05*9= 0.605 or an avg IPC of 1.65

I still left out L3, mem and other effects. But this should be enough to see the effect of a few percent of miss rates.

Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]

Originally Posted by savantu

I am sure future CPUs will include run-ahead and advanced replay uarch. But it is possible that OoO execution will hinder such approaches. The first CPU with scout threads ( run ahead execution ), Sun Rock , was a major failure performance wise.

I don't think that OoO execution would be a hindrance. It's the actual execution of code afterall. But it stalls sometimes to allow for more advanced techniques. And even if not, a lower priority handling of runahead execution (e.g. executing future mem ops as prefetches) would work.

Sun Rock added more to the design than just scout threads and there is not just one way to implement runahead execution. It's like saying: "Car A will be a fail because it uses 4 wheels and a motor like car B, which failed."

There are other statements:

An efficient runahead execution processor employing these techniques executes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance.

From http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

**savantu** · 07-20-2010, 04:04 AM

Originally Posted by Dresdenboy

That's correct. But you originally wrote:
and posted a paper made by Intel.

The 2 weren't connected; should have been more explicit.

I wrote: "Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, [...]"

And it's not only me, who points to the simultaneous execution of instructions of multiple threads on a set of EUs:

http://publib.boulder.ibm.com/infoce...2/iphb2smt.htm

Maybe you use a different definition of SMT? Could you please post it, so that I can see, if there is a difference?

Now, we're perfectly matching. My comment was directed to the emphasis on the limitation of the decode stage in P4.

The definition that I would use is :

Simultaneous multithreading (SMT) is a variation on multithreading that uses the resources of a multiple-issue, dynamically-scheduled processor to exploit TLP at the same time it exploits ILP. The key insight that motivates SMT is that modern multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. Furthermore, with register renaming and dynamic scheduling, multiple instructions from independent threads can be is isued without regard to the dependences among them; the resolution of the dependences can be handled by the dynamic scheduling capability.

Computer Architecture: A Quantitative Approach, 3rd Edition by David Patterson and John L. Hennessy.

But we left the times of accumulator architectures behind us. There are 16 GPRs and 16 XMM/YMM registers. You can do a lot of calculations using these, which are independent over a couple of instructions each. This is, what can be exploited by OoO execution.

The low branch or cache miss rates don't mean, that they only have a small effect on IPC. For branch prediction:
Let's assume:
14 cycles of branch misprediction penalty
avg. IPC of 3 (with 4 decode/issue etc.) as long as there is no misprediction - CPI = 1/IPC = 0.33
BP hit rate = 95%
percentage of branch insts: 20%

CPI = 0.33 + 0.2*0.05*14 = 0.47 or an avg IPC of 2.1

Now let's add the cache:
L1D$ hit rate: 95%
miss latency: 9 cycles (L2 hit)
percentage of mem accesses: 30%

CPI = 0.33 + 0.3*0.05*9 = 0.465 or an avg IPC of 2.15

Now both effects together:
CPI = 0.33 + 0.2*0.05*14 + 0.3*0.05*9= 0.605 or an avg IPC of 1.65

I still left out L3, mem and other effects. But this should be enough to see the effect of a few percent of miss rates.

Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.

By your examples, IPC is 1.5-2 a best out of a theoretical possible of 4. SMT could really help such a scenario.

I don't think that OoO execution would be a hindrance. It's the actual execution of code afterall. But it stalls sometimes to allow for more advanced techniques. And even if not, a lower priority handling of runahead execution (e.g. executing future mem ops as prefetches) would work.

Sun Rock added more to the design than just scout threads and there is not just one way to implement runahead execution. It's like saying: "Car A will be a fail because it uses 4 wheels and a motor like car B, which failed."

There are other statements:

From http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

I know that Rock added far more, it was a quick example.

**Manicdan** · 07-20-2010, 05:25 AM

i think we can agree that SMT offers more than 1% ipc gain per 1% increase in core size,
and i think we can also agree that some people are very negative about CMT, even though its not even out yet.

**ajaidev** · 07-20-2010, 05:52 AM

I would like to point out that Hyper-threading does not really run two threads in parallel but it actually means that the OS is free to schedule two threads and run either when the situation is right for which ever thread.

Eg:

T1 and T2 running on CPU
T1 and T2 scheduled to run by OS on a HTT enabled cpu
T1 has first run
T1 uses R1, R2 AND R3
R1, R2 and R3 locked by T1 since T1 doing work
T1 requests time on cpu leading to timeout
T2 gets cpu time and resources it needs like R1,R2 OR R3

This is how HTT runs simple as that.

**Chumbucket843** · 07-20-2010, 07:11 AM

Originally Posted by savantu

Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
Most of code is programmed in a sequential manner and data dependencies are the norm.

this is simply not true. ILP is limited to about 5 IPC on average. the majority of ilp is limited by branch/jump prediction.

just because instructions sequential doesnt mean RAW dependencies exist.

**~~LesGrossman~~** · 07-20-2010, 08:21 AM

I'm pretty sure Intel is already working on their bulldozer.

**informal** · 07-20-2010, 08:30 AM

Originally Posted by LesGrossman

I'm pretty sure Intel is already working on their bulldozer.

If you mean their next big *new* design,then yes.It's called Haswell.

**~~terrace215~~** · 07-20-2010, 09:20 AM

Originally Posted by informal

If you mean their next big *new* design,then yes.It's called Haswell.

Sandy Bridge (Socket R) will be more than *new* enough for BD.

**informal** · 07-20-2010, 09:33 AM

Originally Posted by terrace215

Sandy Bridge (Socket R) will be more than *new* enough for BD.

We'll have to first wait and see what kind of numbers both pull.With 16 BD cores AMD is pretty much covered in server segment. Also I have no doubts SB will be fast.

**savantu** · 07-20-2010, 09:35 AM

Originally Posted by Chumbucket843

this is simply not true. ILP is limited to about 5 IPC on average. the majority of ilp is limited by branch/jump prediction.

just because instructions sequential doesnt mean RAW dependencies exist.

Are you claiming current CPUs get 5 IPC on average ?

Somebody should tell the CPU designers that they bothered in vain in going to unbelievable lengths in extracting ILP.

**~~terrace215~~** · 07-20-2010, 09:48 AM

Originally Posted by informal

With 16 BD cores AMD is pretty much covered in server segment.

The number of cores doesn't tell you much of anything, in itself.

If it did, AMD would simply cut clocks more, make a "2x Magny Cours" part, and be "covered in the server segment".

**qcmadness** · 07-20-2010, 09:52 AM

Originally Posted by terrace215

Sandy Bridge (Socket R) will be more than *new* enough for BD.

Adding memory channel is more than new?

**informal** · 07-20-2010, 09:57 AM

Originally Posted by terrace215

The number of cores doesn't tell you much of anything, in itself.

If it did, AMD would simply cut clocks more, make a "2x Magny Cours" part, and be "covered in the server segment".

Except they can't scale that hypothetical monster past 2P if they do it your way(just one of the many cons if they do this). This way,we have 16 *improved* cores,with new power gating features(new turbo too),AVX support,a core design that is fully modular,scalable and die area savvy. This design will serve as a basis for future server Fusion part too,since the big shared FPU can be replaced by the future hybrid GPU part ,as AMD stated themselves.

**~~terrace215~~** · 07-20-2010, 10:03 AM

Originally Posted by ajaidev

I would like to point out that Hyper-threading does not really run two threads in parallel but it actually means that the OS is free to schedule two threads and run either when the situation is right for which ever thread.

Eg:

T1 and T2 running on CPU
T1 and T2 scheduled to run by OS on a HTT enabled cpu
T1 has first run
T1 uses R1, R2 AND R3
R1, R2 and R3 locked by T1 since T1 doing work
T1 requests time on cpu leading to timeout
T2 gets cpu time and resources it needs like R1,R2 OR R3

This is how HTT runs simple as that.

Erm, no, it isn't.

At any point in time, T1 may be running in some execution units while AT THE SAME TIME, T2 is running in the others.

See virtually any of Intel's papers on Nehalem Hyperthreading...

http://software.intel.com/en-us/arti...ng-technology/

Note the little diagram of the 4-wide nehalem core.

And this:

The execution pipeline of processors based on Intel® Core™ microarchitecture is four instructions wide, meaning that it can execute up to four micro-operations per clock cycle. As shown in Figure 3, however, the software thread being executed often does not have four instructions eligible for simultaneous execution. Common reasons for fewer than four instructions per clock being retired include dependencies between the output of one instruction and the input of another, as well as latency waiting for data to be fetched from cache or memory.

Intel HT Technology improves performance through increased instruction level parallelism by having two threads with independent instruction streams, eliminating data dependencies between threads and increasing utilization of the available execution units. This effect typically increases the number of instructions executed in a given amount of time within a core, as shown in Figure 3. The impact of this greater efficiency is experienced by users as higher throughput (since more work gets completed per clock cycle) and higher performance per watt (since fewer idle execution units consume power without contributing to performance). In addition, when one thread has a cache miss, branch mispredict, or any other pipeline stall, the other thread continues processing instructions at nearly the same rate as a single thread running on the core. Intel HT Technology augments other advanced architectural features, higher clock speeds, and additional cores with a capability that is relatively inexpensive in terms of space on the silicon and production cost.

**intangir** · 07-20-2010, 10:28 AM

Originally Posted by Manicdan

why would cpus be built in such a way they are never used to maximum capacity?

Why are highways built in such a way that 80% of the time, half of their capacity is unused? You could halve the number of lanes in my local interstate highway, and for 140 hours a week, it would be perfectly fine. For the other ~30 hours though, you'll be very happy to have those 3rd and 4th lanes available for use! Engineers design to a maximum throughput number, but in the majority of circumstances there just won't be enough traffic to use 100% capacity.

If algorithms were perfectly parallel, memory accesses were perfectly predictable, and compilers were perfect at scheduling every integer and floating point pipeline to have one computation per cycle, then it would be possible to build a CPU with perfect utilization. But we don't live in such a world, and never will. For the record, x86-64 processors tend to hover around an IPC of 0.8-1.5 on optimized benchmark code.

IPC chart for selected SPEC 2006 components: http://www.marss86.org/index.php/File:Ipc_chart.png

The above chart shows the accuracy of the MARSS simulator by comparing IPCs (Instructions-Commits Per Cycle) obtained from simulation against the IPCs realized in executing the same benchmark programs on two real implementations, an AMD Athlon X2 and an Intel Core-2 Duo for SPEC 2006 benchmarks. These IPC values are for only user space execution and does not contain any kernel simulation because the kernel execution paths are different on the MARSS VM and on the test machines used to gather above statistics. All the benchmarks are run from start to completion for the simulated runs and the runs on the real machines. We have used Linux Kernel's Performance Counters to get the IPCs realized on the actual hardware.

**Chumbucket843** · 07-20-2010, 10:34 AM

Originally Posted by savantu

Are you claiming current CPUs get 5 IPC on average ?

Somebody should tell the CPU designers that they bothered in vain in going to unbelievable lengths in extracting ILP.

nope. i am claiming that ilp is limited to around 5 ipc for most applications using realistic branch prediction.

**savantu** · 07-20-2010, 11:01 AM

Originally Posted by Chumbucket843

nope. i am claiming that ilp is limited to around 5 ipc for most applications using realistic branch prediction.

Let me see that I get it right : you want to say that between 2 branches that are on average 5 instructions. If so, I agree and this is consistent with what I've read in the literature.

**savantu** · 07-20-2010, 11:06 AM

Originally Posted by qcmadness

Adding memory channel is more than new?

?? Is SB getting more memory channels ? :P

Anyway, there are enough innovations in it that make it a tock. And if some are confirmed, like the L3 latency, it will rock ( not in the Sun Rock sense

).

**Solus Corvus** · 07-20-2010, 11:34 AM

After thinking about it for a bit I realize that this IPC and ILP discussion might be taking the wrong direction in relation to BD. I'll elaborate more later but first a question: what's the average IPC of Nehalem?

**Chumbucket843** · 07-20-2010, 11:58 AM

^^ i think i might do some experiments on that. nothing fancy though.

Originally Posted by savantu

Let me see that I get it right : you want to say that between 2 branches that are on average 5 instructions. If so, I agree and this is consistent with what I've read in the literature.

i think you mean blocks and no that's not what i am saying. to reach 5 ipc you must predict the the control flow instructions so that you can start working on the next block otherwise your pipeline will be filled with nops. control flow must be predicted correctly for ilp to reach 5 ipc. increasing ilp within a block is a task for dynamic execution.

**demonkevy666** · 07-20-2010, 12:05 PM

can we get a die shot at least, Lllano is the only 32nm die we've seen
no Ontario or bulldozer yet.

**Manicdan** · 07-20-2010, 12:11 PM

Originally Posted by intangir

Why are highways built in such a way that 80% of the time, half of their capacity is unused? You could halve the number of lanes in my local interstate highway, and for 140 hours a week, it would be perfectly fine. For the other ~30 hours though, you'll be very happy to have those 3rd and 4th lanes available for use! Engineers design to a maximum throughput number, but in the majority of circumstances there just won't be enough traffic to use 100% capacity.

If algorithms were perfectly parallel, memory accesses were perfectly predictable, and compilers were perfect at scheduling every integer and floating point pipeline to have one computation per cycle, then it would be possible to build a CPU with perfect utilization. But we don't live in such a world, and never will. For the record, x86-64 processors tend to hover around an IPC of 0.8-1.5 on optimized benchmark code.

IPC chart for selected SPEC 2006 components: http://www.marss86.org/index.php/File:Ipc_chart.png

you missed the point, i understand "some of the time" or "not very often", but saying "never" is an absolution that was way to limiting. no one builds something for never, but they do build things for expected bottlenecks

**savantu** · 07-20-2010, 12:13 PM

Originally Posted by Solus Corvus

After thinking about it for a bit I realize that this IPC and ILP discussion might be taking the wrong direction in relation to BD. I'll elaborate more later but first a question: what's the average IPC of Nehalem?

A bit higher than Core and it depends on app. David Kanter tested in physics calculations and the best result was 2 IPC.

http://www.realworldtech.com/page.cf...0510142143&p=3

According to the author, once above 1 it is respectable.

Chumbucket483 talks nonsense with his 5 IPC. Not even Itanium reaches that and it's based on static compiling and EPIC ( Explicitly Parallel Instruction Set ) architecture. EPIC was designed to solve the ILP problem by doing optimizations at compile time and inserting hints into the code so the CPU knows what to do next.

**intangir** · 07-20-2010, 12:24 PM

Originally Posted by Particle

You don't appear to understand what is really going on yourself. There is only one execution unit. You can't have two threads with instructions that compete for the same resources executing on the same clock cycle in the same execution unit. That's the end of the story. HT is, as we've been claiming all along, just a way to maximize the utilization of the core's resources by scheduling work where there would normally be none being done (misses and whatnot). It does not magically let you execute two threads at the same time the way two real cores do.

Nope! You seem to be assuming there is only one execution unit per core. When a microarchitect says execution unit, he means an ALU or FPU. Each core of any out-of-order x86 CPU contains multiple integer execution units and multiple floating-point execution units. For example, the first superscalar x86 chip, the Pentium Pro, started with two ALUs and one FPU. While each individual execution unit can only be executing one instruction at a time, there are multiple per core, and a Nehalem core can easily schedule two instructions from two different threads on two of its execution units in the same cycle.

The single-core Athlon XP had nine execution units.

Huynh, Jack (2003). "The AMD Athlon XP Processor with 512KB L2 Cache": http://courses.ece.uiuc.edu/ece512/Papers/Athlon.pdf

At the heart of QuantiSpeed architecture is a fully pipelined,
nine-issue, superscalar processor core. The AMD Athlon XP processor
provides a wider execution bandwidth of nine execution pipes when
compared with competitive x86 processors with up to six execution
pipes. The nine execution engines are comprised of three address
calculation units, three integer units, and three floating-point units.

One Nehalem core has a set of nine execution units, much the same as Core 2.

1x ALU Shift
1x ALU LEA
1x ALU Shift Branch
1x SSE Shuffle ALU
1x SSE Mul
1x SSE Shuffle ALU
1x 128-bit FMUL FDIV
1x 128-bit FADD
1x 128-bit FP Shuffle

Kanter, David (2008) "Inside Nehalem: Intel's Future Processor and System": http://www.realworldtech.com/page.cf...0208182719&p=6

As with Core 2, the register alias table (RAT) points each architectural register into either the Re-Order Buffer (ROB) or the Retirement Register File (RRF) and holds the most recent speculative state (whereas the RRF holds the most recent non-speculative and committed state). The RAT can rename up to 4 uops each cycle, giving each one a destination register in the ROB. The renamed instructions then read their source operands and issue into the unified Reservation Station (RS), which is used by all instruction types.

Any instructions in the RS which have all their operands ready are dispatched to the execution units, which are largely unchanged from the Core 2 and unaffected by SMT, except for an increase in utilization.

Thread: AMD Tapes Out First "Bulldozer" Microprocessors.

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions