AMD's Bobcat and Bulldozer

**Manicdan** · 08-28-2010, 01:35 PM

Originally Posted by MAS

http://blogs.amd.com/work/2010/08/23...ge-4/#comments

AMD guy promised bulldozer review next week (see comments)

Originally Posted by generics_user

i think you REALLY misunderstood his post

Round 2 of the questions on Bulldozer are UNDER legal review to be posted; there will be no product review only more questions on Bulldozer

Originally Posted by superrugal

He is definitely saying the "question round 2" not bulldozer review.

you guys should read who posted that, its none other than me

the BD 20 questions are all broken up sections, so we get the next set answered very soon it seems

**blindbox** · 08-28-2010, 01:56 PM

Since we're comparing with Sandy Bridge here guys... is there even an 8-core counterpart for SB? You sounded like it's easy to fit 8-cores. AMD just did, even then, they went the modules way to decrease the diesize area.

**~~terrace215~~** · 08-28-2010, 04:05 PM

Mitch Alsup says when he left AMD, Bulldozer was : performance *decrease* of 5% from the microarch-slimming, together with hoped-for 20-25% frequency increase from the pipeline-lengthening.

Even assuming *perfect* perf scaling with clock, that's 15-20% increase over Ph-II.

When I left, BD was supposed to be 20-25% faster frequency wise, and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture.

http://groups.google.de/group/comp.a...14f6049?hl=de#

So they are really counting on speed-racer to bring the performance increase.

**Chumbucket843** · 08-28-2010, 04:17 PM

my guess would be 17 stages. a speed racer in a modern process is arguably going to be more efficient than a brainiac as long as you dont go over the top with pipelining. increasing IPC has much much worse diminishing benefits excluding multicore.

**haylui** · 08-28-2010, 04:24 PM

Originally Posted by terrace215

Mitch Alsup says when he left AMD, Bulldozer was : performance *decrease* of 5% from the microarch-slimming, together with hoped-for 20-25% frequency increase from the pipeline-lengthening.

Even assuming *perfect* perf scaling with clock, that's 15-20% increase over Ph-II.

http://groups.google.de/group/comp.a...14f6049?hl=de#

So they are really counting on speed-racer to bring the performance increase.

When did Mitch Alsup leave AMD?

**~~terrace215~~** · 08-28-2010, 04:38 PM

More interesting bits on the pipeline changes:

Most of what got cut was cut to enable the 12-gate pipe (if indeed
they did achieve that.) In Athlon/Opteron, one can forward a byte,
word, double, or quad from any of the 5 results to any operand of any
6 integer computation units {ALU, AGU}. If BD can't (or couldn't when
I left) forward anything to anywhere, and eats a little AFoM because
of this. This probably saved 2 real gate delays. Lopping off the extra
ALU, and a few other things saves another gate and we are then within
spitting distance (1-gate) of the desired 12-gate pipe in the integer
pipe. More lopping occured in the L1cache pipe to reach the cycle time
goal.

**Metroid** · 08-28-2010, 05:36 PM

Bobcat? come on AMD, you had better names.

**nn_step** · 08-28-2010, 05:47 PM

Why is no one else wondering about Bulldozer's Decode details?

**god_43** · 08-28-2010, 06:09 PM

Originally Posted by nn_step

Why is no one else wondering about Bulldozer's Decode details?

my understanding of cpu arch is still limited compared to most others here. why "should" we be concerned with the decode unit?

**informal** · 08-28-2010, 06:53 PM

He left just in time when the BD 1 was canceled(end of 2007) and BD 2 ,the one that is coming out 2 years later, was starting to take shape.

**danielkza** · 08-28-2010, 08:16 PM

Originally Posted by god_43

my understanding of cpu arch is still limited compared to most others here. why "should" we be concerned with the decode unit?

Because BD needs to feed all those integer cores somehow. We know prefetching is getting improvements, but decoding needs to improve as well to keep up with everything else.

**nn_step** · 08-28-2010, 09:00 PM

Originally Posted by god_43

my understanding of cpu arch is still limited compared to most others here. why "should" we be concerned with the decode unit?

The decode rate determines execution unit utilization

**god_43** · 08-28-2010, 09:21 PM

Originally Posted by danielkza

Because BD needs to feed all those integer cores somehow. We know prefetching is getting improvements, but decoding needs to improve as well to keep up with everything else.

Originally Posted by nn_step

The decode rate determines execution unit utilization

thank you both, i understand now...that does seem worth "dissecting"!

**~~terrace215~~** · 08-28-2010, 10:10 PM

Originally Posted by nn_step

Why is no one else wondering about Bulldozer's Decode details?

It has 4-wide decoding feeding 2 cores....

**~~terrace215~~** · 08-28-2010, 10:19 PM

Originally Posted by informal

He left just in time when the BD 1 was canceled(end of 2007) and BD 2 ,the one that is coming out 2 years later, was starting to take shape.

From design to launch, a modern cpu arch is about a 5 year cycle.

**nn_step** · 08-28-2010, 10:28 PM

Originally Posted by terrace215

It has 4-wide decoding feeding 2 cores....

But how is the 4-wide partitioned, and how many bytes does it read per clock.

**madcho** · 08-28-2010, 11:06 PM

Originally Posted by nn_step

But how is the 4-wide partitioned, and how many bytes does it read per clock.

It's 32bits Fetch

**informal** · 08-29-2010, 03:04 AM

Originally Posted by terrace215

From design to launch, a modern cpu arch is about a 5 year cycle.

BD 2 was not all around new since it's naturally based on the BD 1 version that was supposed to come out at 45nm. I suspect that like in the case of Barcelona,they were power limited at 45nm and perfromance was not up there where they wanted .So they went with an improved core,done on a smaller node and delayed it 2 years(2009->2011). This gives them more room for improvements at the core level and more clocks ,all within the same power envelope.I expect 15-20% in core level improvement + 30% in clocks.

Originally Posted by nn_step

The decode rate determines execution unit utilization

Its 4+1(branch fusion supported) decoder at the front end,with a so called "accelerate mode" if certain conditions are met.AMD is not disclosing anything about this particular feature ,but essentially this increases the decode rate by some unknown factor.

**JF-AMD** · 08-29-2010, 03:34 AM

Originally Posted by danielkza

Because BD needs to feed all those integer cores somehow. We know prefetching is getting improvements, but decoding needs to improve as well to keep up with everything else.

So, not to talk out of school, but I did ask one of our design engineers about the ability of the shared front end to keep two integer cores fed and he had absolutely no concern because of things that are done to improve the front end.

Can't say any more beyond that because a.) it is not public info and b.) I don't really know enough about how those things work to accurately describe them.

In my mind this is not a concern of the engineering team. After all it is a completely new design. If they had taken the front end off of an existing product it might be more of an issue, but as I understand it, that has not happened.

**god_43** · 08-29-2010, 06:56 AM

Originally Posted by JF-AMD

So, not to talk out of school, but I did ask one of our design engineers about the ability of the shared front end to keep two integer cores fed and he had absolutely no concern because of things that are done to improve the front end.

Can't say any more beyond that because a.) it is not public info and b.) I don't really know enough about how those things work to accurately describe them.

In my mind this is not a concern of the engineering team. After all it is a completely new design. If they had taken the front end off of an existing product it might be more of an issue, but as I understand it, that has not happened.

thanks JF!

**nn_step** · 08-29-2010, 07:03 AM

Originally Posted by madcho

It's 32bits Fetch

Given that Phenom has a 32 BYTE pick buffer and a 408bit fetch, I see that has highly unlikely.

Added to the fact that Bobcat has a 22 byte decode

Originally Posted by informal

BD 2 was not all around new since it's naturally based on the BD 1 version that was supposed to come out at 45nm. I suspect that like in the case of Barcelona,they were power limited at 45nm and perfromance was not up there where they wanted .So they went with an improved core,done on a smaller node and delayed it 2 years(2009->2011). This gives them more room for improvements at the core level and more clocks ,all within the same power envelope.I expect 15-20% in core level improvement + 30% in clocks.

Its 4+1(branch fusion supported) decoder at the front end,with a so called "accelerate mode" if certain conditions are met.AMD is not disclosing anything about this particular feature ,but essentially this increases the decode rate by some unknown factor.

But without more details, optimizing the decode rate is impossible.

For example, can a single thread take up the entire decode unit for a couple clock cycles if the other thread is sleeping?

Originally Posted by JF-AMD

So, not to talk out of school, but I did ask one of our design engineers about the ability of the shared front end to keep two integer cores fed and he had absolutely no concern because of things that are done to improve the front end.

Can't say any more beyond that because a.) it is not public info and b.) I don't really know enough about how those things work to accurately describe them.

In my mind this is not a concern of the engineering team. After all it is a completely new design. If they had taken the front end off of an existing product it might be more of an issue, but as I understand it, that has not happened.

Could you find out if the threads share a pick buffer or if it is shared.
and if so, what size(s)

**Mechromancer** · 08-29-2010, 07:04 AM

Originally Posted by JF-AMD

So, not to talk out of school, but I did ask one of our design engineers about the ability of the shared front end to keep two integer cores fed and he had absolutely no concern because of things that are done to improve the front end.

Can't say any more beyond that because a.) it is not public info and b.) I don't really know enough about how those things work to accurately describe them.

In my mind this is not a concern of the engineering team. After all it is a completely new design. If they had taken the front end off of an existing product it might be more of an issue, but as I understand it, that has not happened.

This almost seems like a single module may possibly use both integer units along with the FPU when executing a single thread. If this is the case, single threaded performance on BD will not be a weak point at all

. I remember that old marketing slide saying BD would have the highest single threaded performance ever. It better be true dammit.

**madcho** · 08-29-2010, 07:05 AM

Originally Posted by nn_step

Given that Phenom has a 32 BYTE pick buffer and a 408bit fetch, I see that has highly unlikely.

Added to the fact that Bobcat has a 22 byte decode

But without more details, optimizing the decode rate is impossible.

For example, can a single thread take up the entire decode unit for a couple clock cycles if the other thread is sleeping?

Could you find out if the threads share a pick buffer or if it is shared.
and if so, what size(s)

Damm I believed it was 32bits ... OMG BYTES !!! lol

**informal** · 08-29-2010, 07:21 AM

Originally Posted by nn_step

But without more details, optimizing the decode rate is impossible.
For example, can a single thread take up the entire decode unit for a couple clock cycles if the other thread is sleeping?

Single thread can occupy all the shared resources in the module.Decoder and thew whole front end ,with the extra beefed up prefetch is shared.FPU is shared.

Originally Posted by Mechromancer

This almost seems like a single module may possibly use both integer units along with the FPU when executing a single thread. If this is the case, single threaded performance on BD will not be a weak point at all

. I remember that old marketing slide saying BD would have the highest single threaded performance ever. It better be true dammit.

Integer cores can't "combine" to work on single integer thread,but one integer core can use the whole FPU to itself.Also one FPU can be used a la SMT by 2 integer cores. What is shared in the module can be used by integer core(s).

**~~terrace215~~** · 08-29-2010, 08:56 AM

Originally Posted by informal

.I expect 15-20% in core level improvement + 30% in clocks.

And AMD's ex-Chief Architect expects 5% in core level perf/clock LOSSES + 20-25% in clocks.

I wonder who will turn out to be closer?

Thread: AMD's Bobcat and Bulldozer

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions