AMD's Bobcat and Bulldozer

**Manicdan** · 08-26-2010, 06:34 AM

If I ran AMD, I would redirect the company's effort toward building a low-cost, low-power, high-density, flash-based cloud server platform around Bobcat. Intel's Justin Rattner has admitted that for certain cloud workloads, these types of high-density solutions are superior to a monolithic server chip like Xeon. So AMD should stop obsessing over netbooks and monolithic server parts—both of these amount to fighting the last war—and just jump straight into the cloud server market that ARM is set to tackle with its upcoming Eagle part.

the idea would be to use a mass array of bobcat cores vs a traditional 2p/4p BD?

**informal** · 08-26-2010, 06:42 AM

Originally Posted by Dresdenboy

A quick and raw estimation of single threaded performance for Zambezi based on the 50% number given for Interlagos (just to show, what has to be counted in at the least):

Relative_perf_1_thread_to_AMD_fam_10h = (Perf_Magny_Cours*1.5 * 12 / 16) * Freq_ratio_of_half_#_of_Cores * Perf_boost_single_core_in_Module * Perf_boost_single_module_on_chip

Freq_ratio_of_half_#_of_Cores = 3.2/2.3 = 1.39
Perf_Magny_Cours = 1
Perf_boost_single_core_in_Module = 1.11 (while going from 90% back to 100%)
Perf_boost_single_module_on_chip = 1.3 (some cheap turbo)

Relative_perf_1_thread_to_AMD_fam_10h = (1 * 1.5 * 12/16) * 1.39 * 1.11 * 1.3 = 2.26

So with some frequency scaling a Zambezi core will be about 126% faster than a core running in a 2.3GHz MC without turbo. This would equal a 5.2GHz PhII core.

This is just speculation. Anyone is invited to check this.

That's very interesting prediction.It looks like 25%(1.125x1.11) is the core vs core improvement(or what some like to call IPC),while the rest is the improvement in the starting clock speed and power gated Turbo.In any case,the 5.2Ghz Phenom II level speed out of the box with Zambezi ,in single threaded apps,is what some might call leapfrog performance jump

.

**madcho** · 08-26-2010, 06:44 AM

Originally Posted by =SOC= Admiral

Im a gamer too and currently I can only use one gpu for gaming as I have amd but another limitation is that none of my cards are of the same generation. I have 3 cards. I will most likely be selling my 8800 and will be getting the CIVE when it launches in September. Hopefully I can get my 1055T to 4.0GHz on that board. Just a question where is the FSB located for AMD? Is it on the mobo or the cpu?

GTX470
GTX275
8800GTS 512

There is no more FSB on Athlon from old first K8. It use hypertransports links. It's like QPI but far more advanced.

**Mats** · 08-26-2010, 07:11 AM

Originally Posted by -Boris-

And power usage isn't the only limitation to high frequency. Even if you have the headroom power wise you can't just clock higher. If that were the case it would be possible to clock an Intel Atom to insane performance.

So even if you give one module four times as high power headroom it will not necessary improve turbo capabilities much more than just 50% more headroom would.

Wow. Can you be any more boring?
I just finished soldering 48 coils and field effect transistors to my Atom board and now this?

**64NOMIS** · 08-26-2010, 07:45 AM

Originally Posted by terrace215

They are optimizing for server application throughput, at the expense of client low-threaded performance. That might make sense for them, considering the initial target market is virtually all server/hpc. In client, they have Llano in the middle, and Ontario down low... so maybe they decided they couldn't be all things to all segments with BD.

The question is, where is the client user pain on low thread applications? And if there is some pain, is it better addressed by GPGPU?

My belief is that if you take the subset of applications which are low thread count, will not benefit from GPGPU, and cause user pain - you have an empty set. And if there remain a few app classes, BD IPC and frequency serve well.

With BD based desktops discrete is the most likely scenario, so an abundance of GPGPU and graphics performance to speed low thread apps.

I think we are talking less and less about trying to solve existing app class performance problems, and more about opening new app classes.

**Chumbucket843** · 08-26-2010, 08:17 AM

Originally Posted by -Boris-

Just to make some things clear.

Deeper Pipeline != Higher Frequency
Shorter Pipeline != Higher IPC

I have a feeling that people have read some P4 articles and made some conclusions of their own. To achieve high frequency a deep pipe can help, but there are reasons to put more stages in a processor other than frequency. And the other way around, more stages don't automatically lower IPC.

ummm... no. the concept of pipelining is to increase throughput at the cost of latency.

P4 is an extremely good example of how hyperpipeling exacerbates current problems such as branch prediction and cache misses.

simplistic logic would assume that doing more things in one clock cycle means higher ipc thus pipeling increases ipc.

there are dependencies in between instructions.
http://en.wikipedia.org/wiki/Data_dependency

if you miss a branch you have to flush all n stages of the pipelines. the probablity of an misprediction increases exponentially with pipeline length. this means doubling BP accuracy will increase clockspeed linearly. that's not ideal and a waste of xtor budget.

if you miss a cache line all dependent instructions have to wait for that result. OoO will only hide so much latency and it must keep the FIFO policy.

we have reached a point of diminishing returns for pipelining. at every level it makes things more complex, from architecture to circuit to layout.

it's obvious that amd knows this but saying intel did it wrong and amd did it right/better is a foolish way to look at it. a lot of decisions are based off of what the design team is good at. they are going to do things differently.

Depending on the nature of the stages, more stages can actually improve IPC.

frequency depends on the nature of the stage. the instructions being executed are independent of the hardware.

**~~terrace215~~** · 08-26-2010, 08:32 AM

Originally Posted by JF-AMD

Ssssshhhhhh! Don't tell terrace. He'll say "Single-threaded performance will have significant improvement"

I'd be more likely to point out that an increase in perf/W/mm^2 , be it single-thread or throughput is more interesting to AMD & its investors than it is to any potential users / customers.

Particularly when it coincides with a process shrink.

Make the same claim on a slide *without* the mm^2 part, and remove the item on your other slide that talks about minimizing single-threaded losses, and we'll talk again.

(Sometimes, engineering reveals things in their presentations that marketing can't fully spin.)

**informal** · 08-26-2010, 08:36 AM

Originally Posted by terrace215

Make the same claim on a slide *without* the mm^2 part, and remove the item on your other slide that talks about minimizing single-threaded losses, and we'll talk again.

The minimizing single thread losses is in reference to a module design.When 2 threads run on the module there is a 10% penalty when compared a single thread running on a module(this implies actually that single thread performance will be better by those 10%).It has nothing to to with minimizing single thread losses compared to present design(aka 10h).They are simply stating that the integer core grouping in Bulldozer does not hinder much the "strength" of the single threads running in parallel on a module(10% is the penalty).

**~~terrace215~~** · 08-26-2010, 08:40 AM

Originally Posted by mindfury

Single-threaded performance will have significant improvement.

As posted above, unfortunately the claim is only for an improvement in:

performance per watt per mm^2

Shrinking thuban from 45nm to 32nm would accomplish that easily.

**mindfury** · 08-26-2010, 08:45 AM

Originally Posted by terrace215

As posted above, unfortunately the claim is only for an improvement in:

performance per watt per mm^2

Shrinking thuban from 45nm to 32nm would accomplish that easily.

Single-threaded performance per mm^2?

How could you measure that?

Your understanding skills are ridiculous.

**informal** · 08-26-2010, 08:46 AM

Originally Posted by terrace215

As posted above, unfortunately the claim is only for an improvement in:

performance per watt per mm^2

Shrinking thuban from 45nm to 32nm would accomplish that easily.

Yeah but that won't bring the IPC up as BD will do

.
Check out Dresdenboy's post about the possible Zambezi performance.

**~~terrace215~~** · 08-26-2010, 08:47 AM

Originally Posted by informal

The minimizing single thread losses is in reference to a module design.When 2 threads run on the module there is a 10% penalty when compared a single thread running on a module(this implies actually that single thread performance will be better by those 10%).It has nothing to to with minimizing single thread losses compared to present design(aka 10h).They are simply stating that the integer core grouping in Bulldozer does not hinder much the "strength" of the single threads running in parallel on a module(10% is the penalty).

The module design ALSO involved making the non-shared elements leaner (max 2 ALU ops in parallel) than the current cores in some aspects.

**~~terrace215~~** · 08-26-2010, 08:48 AM

Originally Posted by mindfury

Single-threaded performance per mm^2?

How could you measure that?

Um, measure the performance. Note the power used (for the per W) part. Note the core size used to run it. Compare to previous offering.

The "per mm^2" is from AMD's claim, btw, check the slide you posted.

**~~terrace215~~** · 08-26-2010, 08:51 AM

Originally Posted by informal

Check out Dresdenboy's post about the possible Zambezi performance.

There would be no point for anyone to choose Zambezi over octal Sandy, except price, possibly.

Also, JF admitted elsewhere that the 50% / 33% claim for Interlagos / MC has been substantially juiced by including, in the aggregate used to measure, serial workloads that benefit from the Turbo that MC lacked. While adding Turbo is a good thing, this means the "fully-parallel" throughput improvement is considerably less than 50%. In order to make sense of the claim, you're going to need to see exactly what they've chosen to average over, and at that point, you'll likely have access to simpler zambezi benchmarks of all sorts.

**-Boris-** · 08-26-2010, 08:56 AM

Originally Posted by Chumbucket843

ummm... no. the concept of pipelining is to increase throughput at the cost of latency.

P4 is an extremely good example of how hyperpipeling exacerbates current problems such as branch prediction and cache misses.

simplistic logic would assume that doing more things in one clock cycle means higher ipc thus pipeling increases ipc.

there are dependencies in between instructions.
http://en.wikipedia.org/wiki/Data_dependency

if you miss a branch you have to flush all n stages of the pipelines. the probablity of an misprediction increases exponentially with pipeline length. this means doubling BP accuracy will increase clockspeed linearly. that's not ideal and a waste of xtor budget.

if you miss a cache line all dependent instructions have to wait for that result. OoO will only hide so much latency and it must keep the FIFO policy.

we have reached a point of diminishing returns for pipelining. at every level it makes things more complex, from architecture to circuit to layout.

it's obvious that amd knows this but saying intel did it wrong and amd did it right/better is a foolish way to look at it. a lot of decisions are based off of what the design team is good at. they are going to do things differently.

frequency depends on the nature of the stage. the instructions being executed are independent of the hardware.

Stages can be dedicated to branch prediction, those stages don't increase frequency, and if they wouldn't increase IPC they wouldn't be there.

**mindfury** · 08-26-2010, 08:58 AM

Originally Posted by terrace215

Um, measure the performance. Note the power used (for the per W) part. Note the core size used to run it. Compare to previous offering.

The core needs other parts on chip to work properly,so Single-threaded performance per mm^2 is totally nonsense.

Originally Posted by terrace215

The "per mm^2" is from AMD's claim, btw, check the slide you posted.

That slide only said "Single-threaded performance",not "Single-threaded performance per mm^2".

**informal** · 08-26-2010, 08:58 AM

Originally Posted by terrace215

The module design ALSO involved making the non-shared elements leaner (max 2 ALU ops in parallel) than the current cores in some aspects.

First of all AMD is not disclosing all the details of the integer core organization(those AGen units could also do some math ops for example).Second of all ALU units in 10h were hindered in many ways(uops couldn't switch lanes),3rd ALU was there just for a possibly better OoO execution opportunity ,under utilization problem was present,3rd AGU was redundant as AT was told,etc. Now we have leaner and meaner integer core that can easily be 20% faster than previous 10h integer unit(this is an end effect which counts in all the new goodies AMD brought into Bulldozer design as extremely improved prefetching,branch prediction,full OoO loads/stores,double L/S BW to L1 cache,decoupled predict and fetch pipelines,branch fusion,unified integer scheduler for math and address ops,shared L2,prediction directed instruction prefetch etc).

PS Note the wording : "Throughput advantages for multi threaded workloads without significant losses on serial single-threaded workload components"
This clearly shows that they mean what I already wrote in my previous post: there is minimal loss for 2 threads(per thread) when both are being executed in parallel inside a module

.

**Dimitriman** · 08-26-2010, 09:11 AM

Are we still wasting 2 pages trying to debunk what terrace picked up in a couple of AMD slides and is using as argument against BD for 5 months straight?

Good god..

I'm simply godsmacked at JF-AMD's patience to keep posting here with so much trolling around.

**crazydiamond** · 08-26-2010, 09:29 AM

just reading the thread you would think this thread was called "101 factless reasons terrace thinks BD will suck

"

**STaRGaZeR** · 08-26-2010, 09:33 AM

Originally Posted by Dresdenboy

So with some frequency scaling a Zambezi core will be about 126% faster than a core running in a 2.3GHz MC without turbo. This would equal a 5.2GHz PhII core.

I'd toss a few GHz more in there.

**superrugal** · 08-26-2010, 09:57 AM

Another information-integrate diagram made by Hiroshige Goto

http://pc.watch.impress.co.jp/docs/c...27_389491.html

translated: http://translate.google.com/translat...ml&sl=ja&tl=en

**~~terrace215~~** · 08-26-2010, 10:15 AM

Originally Posted by mindfury

That slide only said "Single-threaded performance",not "Single-threaded performance per mm^2".

Erm, it's a sub-bullet of "Significant improvement in Performance/Watt/mm2"

You want to pretend that the sub-bullet is not related to its section title?

Well, in that case, there's no mention of "significant improvement" either.

BTW, why don't you write a letter to AMD, telling them that "Performance/Watt/mm2" does not make sense.

You realize AMD *has* estimations of single-thread IPC comparison for both integer and fp workloads at this point, relative to Phenom II, right?
As we're talking IPC, these don't depend on final clocks. So, if they wanted to, they could put out a bullet that says, for example: "estimated single-threaded IPC gains of 15-20% on integer workloads, compared to Ph-II."

Now JF will tell you that he "doesn't want his competitor to know this information" yet. Do you buy that? Both BD and SB core designs are long locked-down at this point.

I think he doesn't want YOU to know it, because the (largely client-oriented) fan base is going to be disappointed, and that would be a negative for AMD.

**mindfury** · 08-26-2010, 10:28 AM

Originally Posted by terrace215

Erm, it's a sub-bullet of "Significant improvement in Performance/Watt/mm2"

You want to pretend that the sub-bullet is not related to its section title?

Well, in that case, there's no mention of "significant improvement" either.

It is related to "Significant improvement",not "Performance/Watt/mm2".That's why they only mention "Performance" instead of "Performance/Watt/mm2".

Originally Posted by terrace215

BTW, why don't you write a letter to AMD, telling them that "Performance/Watt/mm2" does not make sense.

The whole chip's "Performance/Watt/mm2" make sense,but single-threaded performance/mm2 doesn't make sense in a multicore chip.

You realize AMD *has* estimations of single-thread IPC comparison for both integer and fp workloads at this point, relative to Phenom II, right?
As we're talking IPC, these don't depend on final clocks. So, if they wanted to, they could put out a bullet that says, for example: "estimated single-threaded IPC gains of 15-20% on integer workloads, compared to Ph-II."

Now JF will tell you that he "doesn't want his competitor to know this information" yet. Do you buy that? Both BD and SB core designs are long locked-down at this point.

I think he doesn't want YOU to know it, because the (largely client-oriented) fan base is going to be disappointed, and that would be a negative for AMD.

Nonsense FUD.

**~~terrace215~~** · 08-26-2010, 10:49 AM

Originally Posted by mindfury

It is related to "Significant improvement",not "Performance/Watt/mm2"

I see, so you just pick the parts you like! Fun.

The whole chip's "Performance/Watt/mm2" make sense,but single-threaded performance/mm2 doesn't make sense in a multicore chip.

I suggest you consider whole_chip(Perf/W/mm2)/#threads -- presumably you think THAT makes sense, as well, since we're just dividing by a constant for each part being compared. Now, is it so hard to see why you can talk about the perf/W/mm2 for a single-threaded workload?

Nonsense FUD.

Ok, just don't get mad later on...

**Mats** · 08-26-2010, 11:00 AM

Originally Posted by terrace215

I see, so you just pick the parts you like!

So do you.

Thread: AMD's Bobcat and Bulldozer

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions