AMD's Bobcat and Bulldozer

**kl0012** · 08-25-2010, 09:12 AM

Originally Posted by Hornet331

well x86 quite sucks at ipc.. 1.5 is a good value.

That less depends on ISA but more on actual hardware implementation but depends even more on software code quality. There are many effective techniques to optimize code for OOO architectures (such as loop unroling e.t.c).

Originally Posted by informal

I know it's hard to believe but Hornet is correct.The average IPC in the spec2006 test suit is around ~1 .This is on a Core 2 class chip,that is 4(+1 )wide.

It seems they counted only arithmetic instructions (IPC < 1 does not make sence). Also in a different part of a code the ALU consumption may greatly vary.

**~~terrace215~~** · 08-25-2010, 09:15 AM

Originally Posted by informal

You are basing this on Paul Demone's comment?

You might also consider AMD's very own slide showing the relative improvement of client / server / hpc.

Client is only 1/2 of server, and 1/3rd of hpc. They've been quite open about BD's emphasis. Only when people start asking about single-threaded performance, etc, do they get defensive and start claiming that's going to be just wonderful, too.

**Manicdan** · 08-25-2010, 09:17 AM

i cant wait for AMD to do sub 10s superpi 1M runs on air

o wait, i dont give a crap....

**informal** · 08-25-2010, 09:17 AM

Originally Posted by kl0012

It seems they counted only arithmetic instructions (IPC < 1 does not make sence). Also in a different part of a code the ALU consumption may greatly vary.

They counted address and math instructions :

Figure 2.2(a) and Figure 2.2(b) represent the instruction profile of CPU2006 and CPU2000 respectively. It is evident from the figure that a very high percentage of instructions retired consist of loads and stores. CPU2006 benchmarks like h264ref, hmmer, bwaves, lesli3d and GemsFDTD have comparatively high percentage of loads while astar, bzip2, gcc, gobmk, libquantum, mcf, omnetpp, perlbench, sjeng, xalancbmk and gamess have high percentage of branch instructions. On the contrary CPU2000 benchmarks like gap, parser, vortex, applu, equake, fma3d, mgrid and swim have comparatively high percentage of loads while almost all integer programs have high percentage of branch instructions.

You could see that it was never higher than 1.8x throughout the whole range of spec suit applications. The point is there is a lot of loads and stores that constitute a big part of instruction mix.

Originally Posted by terrace215

You might also consider AMD's very own slide showing the relative improvement of client / server / hpc.

Client is only 1/2 of server, and 1/3rd of hpc. They've been quite open about BD's emphasis. Only when people start asking about single-threaded performance, etc, do they get defensive and start claiming that's going to be just wonderful, too.

What slides?The ones from 2007 that talked about BD version that got delayed in order to be reworked and improved?

**JF-AMD** · 08-25-2010, 09:46 AM

Originally Posted by terrace215

They are optimizing for server application throughput, at the expense of client low-threaded performance.

This is simply not true. No matter how many times you say it.

**[XC] Synthetickiller** · 08-25-2010, 09:53 AM

12 pager of discussion and I understand maybe half of it.

So my AM3 board will not take Bulldozer. Glad I only invested in a low end board anyways.

I hope this'll be like Athlon & Pentium III or the dual-core races all over again. I need an upgrade to my mini-itx already.

**superrugal** · 08-25-2010, 09:55 AM

Originally Posted by terrace215

They are optimizing for server application throughput, at the expense of client low-threaded performance. That might make sense for them, considering the initial target market is virtually all server/hpc. In client, they have Llano in the middle, and Ontario down low... so maybe they decided they couldn't be all things to all segments with BD.

If I only read the bold-character part, I will think you are describing Nehalem.

**accord99** · 08-25-2010, 09:58 AM

Originally Posted by superrugal

If I only read the bold-character part, I will think you are describing Nehalem.

But Nehalem-based CPUs have the highest throughput and highest client performance. Magny Cours on the other hand...

**~~terrace215~~** · 08-25-2010, 10:25 AM

Originally Posted by JF-AMD

This is simply not true. No matter how many times you say it.

Let's say "at the relative expense of", then.

Unless you are contradicting your own company's slide? From the 2007 tech analyst day.

Client perf/W up X, server up 2X, HPC up 3 TO 4X

Clearly, design choices were made to favor server throughput improvements over client improvements, or else the little bar graph lines wouldn't be this way.

I guess we'll be able to see a full desktop evaluation in about... oh, 15 months, unless you guys release some perf data early.

**Mats** · 08-25-2010, 10:32 AM

Originally Posted by terrace215

Unless you are contradicting your own company's slide? From the 2007 tech analyst day.

Nothing wrong with that, since it's for the 45 nm BD that never showed up.

Originally Posted by informal

What slides?The ones from 2007 that talked about BD version that got delayed in order to be reworked and improved?

**Manicdan** · 08-25-2010, 11:00 AM

Originally Posted by terrace215

Let's say "at the relative expense of", then.

Unless you are contradicting your own company's slide? From the 2007 tech analyst day.

Client perf/W up X, server up 2X, HPC up 3 TO 4X

Clearly, design choices were made to favor server throughput improvements over client improvements

i think the only math you know is:

single threaded perf per watt went up X
multi threaded perf per watt went up Y
0<X<Y

**~~terrace215~~** · 08-25-2010, 11:05 AM

Originally Posted by Mats

Nothing wrong with that, since it's for the 45 nm BD that never showed up.

Somehow, I doubt the whole design philosophy changed since then.

**qcmadness** · 08-25-2010, 11:08 AM

Originally Posted by terrace215

Somehow, I doubt the whole design philosophy changed since then.

You always doubt AMD's methodology, sales, technical details, release dates and... etc

**~~terrace215~~** · 08-25-2010, 11:08 AM

Originally Posted by Manicdan

i think the only math you know is:

single threaded perf per watt went up X
multi threaded perf per watt went up Y
0<X<Y

The old "those arrows weren't meant to imply anything specific" thing again?

I think they are qualitatively accurate...we'll have to wait, and wait, and wait... to see for certain..

**informal** · 08-25-2010, 11:10 AM

Yeah,3 years from now when Bulldozer launches,right?

**Chumbucket843** · 08-25-2010, 11:22 AM

Originally Posted by Hornet331

well x86 quite sucks at ipc.. 1.5 is a good value.

IPC isnt the same across different architectures. for example a single SSE instruction can do 4 multiplies on 32bit floating point numbers in one instruction (mulps). fmul can do only one. yes, sse is explicitly data parallel but that is part of the weakness of ipc measurements.

a better example would be a sine function. you can use the taylor series to get a good estimate. modern x86 cpu's take ~40-100 cycles to execute the fsin instruction.

taylor series approximation:
x - (x^3)/3! + (x^5)/5! - (x^7)/7!

2 subtractions
30 multiplies
3 divide
1 add

36 arithmetic operations in a RISC processor is equal to 1 (very slow)instruction in x86. this is a select case. normally risc uses 30% more code space.

this algorithm has room for improvement actually. we can store the value of x to a power and save many redundant multiplications with a look up table. i.e. compute x^3 then multiply by x^2 or add the exponents. evenutually algebra will give you a nice shortcut.

**Dresdenboy** · 08-25-2010, 11:25 AM

Originally Posted by terrace215

Somehow, I doubt the whole design philosophy changed since then.

The key people changed, esp. Chuck Moore. So why shouldn't the design philosophy change with them? Chuck even talked about improved design philosophies in some of his older presentations.

**madcho** · 08-25-2010, 11:33 AM

Originally Posted by Chumbucket843

IPC isnt the same across different architectures. for example a single SSE instruction can do 4 multiplies on 32bit floating point numbers in one instruction (mulps). fmul can do only one. yes, sse is explicitly data parallel but that is part of the weakness of ipc measurements.

a better example would be a sine function. you can use the taylor series to get a good estimate. modern x86 cpu's take ~40-100 cycles to execute the fsin instruction.

taylor series approximation:
x - (x^3)/3! + (x^5)/5! - (x^7)/7!

2 subtractions
30 multiplies
3 divide
1 add

36 arithmetic operations in a RISC processor is equal to 1 (very slow)instruction in x86. this is a select case. normally risc uses 30% more code space.

this algorithm has room for improvement actually. we can store the value of x to a power and save many redundant multiplications with a look up table. i.e. compute x^3 then multiply by x^2 or add the exponents. evenutually algebra will give you a nice shortcut.

Yeah, but approximation mean it's not the good real result of the function. So it's a mistake to use it.

---------

About BD @ hotchips, what about that was said ? Now we have slides, but someone read them, or someone talked about BD in same time ????

No other information ?

**-Boris-** · 08-25-2010, 11:34 AM

Originally Posted by terrace215

Let's say "at the relative expense of", then.

Unless you are contradicting your own company's slide? From the 2007 tech analyst day.

Client perf/W up X, server up 2X, HPC up 3 TO 4X

Clearly, design choices were made to favor server throughput improvements over client improvements, or else the little bar graph lines wouldn't be this way.

I guess we'll be able to see a full desktop evaluation in about... oh, 15 months, unless you guys release some perf data early.

Since then we have got MCM on the server side. That's nothing you do at the client side. And the fact that servers is faster per socket is absolutely not the same thing as servers is faster at the clients expense.
You know, you can boost one without crippling the other.

And a bulldozer module running one thread has much more resources to that thread than it has to two threads. The modular approach boosts single thread performance more than multi thread performance. The advantage at multi thread performance is less die space.

**SEA** · 08-25-2010, 11:35 AM

Originally Posted by Chumbucket843

x - (x^3)/3! + (x^5)/5! - (x^7)/7!
2 subtractions
30 multiplies
3 divide
1 add

It could be counted as 9 multiplies, actually:
with intermediate results a=x^2 and b=x^5:
x - (x*x*x)* (1/3!) + (b=((a=x*x)*a*x)) * (1/5!) - (a*b) * (1/7!)

**xVeinx** · 08-25-2010, 11:45 AM

Originally Posted by terrace215

Let's say "at the relative expense of", then.

Unless you are contradicting your own company's slide? From the 2007 tech analyst day.

Client perf/W up X, server up 2X, HPC up 3 TO 4X

Clearly, design choices were made to favor server throughput improvements over client improvements, or else the little bar graph lines wouldn't be this way.

I guess we'll be able to see a full desktop evaluation in about... oh, 15 months, unless you guys release some perf data early.

It isn't a zero-sum game

.

**nn_step** · 08-25-2010, 11:57 AM

surprisingly, they didn't include any more details about decode.

**~~terrace215~~** · 08-25-2010, 12:09 PM

Originally Posted by xVeinx

It isn't a zero-sum game

.

Hence, "relative", however, there are trade-offs.

**FlanK3r** · 08-25-2010, 12:38 PM

terrace love only Sandy bridge

**richierich** · 08-25-2010, 12:41 PM

Originally Posted by FlanK3r

terrace love only Intel

fixed.

Thread: AMD's Bobcat and Bulldozer

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions