AMD's Bobcat and Bulldozer

**freeloader** · 08-31-2010, 03:35 AM

Originally Posted by -Boris-

I'm pretty sure that when you have the position JF has in a company you get pretty accurate numbers from engineering and so on. There is no need for him to sit down and bench engineering samples personally. Would be quite stupid if engineering lied about the performance in internal reviews and documents.
You know this isn't Dilbertland right?

Thanks for answering that JF.

So if he's getting accurate numbers from engineering, then it's a simple thing for him to say, "BD numbers are better than existing chips". Simple right?

**-Boris-** · 08-31-2010, 03:39 AM

Originally Posted by freeloader

Thanks for answering that JF.

So if he's getting accurate numbers from engineering, then it's a simple thing for him to say, "BD numbers are better than existing chips". Simple right?

That's pretty much what he have said countless times already.

http://www.xtremesystems.org/forums/...&postcount=602

**JF-AMD** · 08-31-2010, 03:57 AM

Originally Posted by freeloader

JF...have you personally seen running BD chips yet or whatever the server variant is called? Just wondering how you're so sure if you haven't bench tested one yet.

Does anyone here know when BD compatible socket motherboards will go on sale?

I don't have any reason to be in building 400. And it is better off that the marketing guy is not "dropping in" on them.

Our performance engineering team has done a real accurate job on performance modeling in the past, I have no reason to doubt them. Generally the worst that we see is too much conservatism, not too much optimism.

**JF-AMD** · 08-31-2010, 04:01 AM

Originally Posted by STaRGaZeR

He's right. K10 has more resources, shared or not.

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

**vietthanhpro** · 08-31-2010, 04:10 AM

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

THey are Intel fanboys

No problem !

**AliG** · 08-31-2010, 04:29 AM

Originally Posted by Hans de Vries

Our friend who goes by the name dougsf30/terrace215/chipper/chipdesigner/tatertot/justaview/gloo...
and 100 more names has a history of driving people nuts (and maybe himself as well...)

To recapitulate this thread:

AMD Architects : IPC increases (Anand article commenting on the 2 ALUs an 16KB L1)

terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, because of the 16KB caches
terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, AMD presentation sheet no.Y confesses this.

JF-AMD posting: IPC increases!! instead of getting worse.

terrace215 post: IPC decreases, the marketing guy isn't talking about IPC
terrace215 post: IPC decreases, don't trust marketing guys.
terrace215 post: IPC decreases, Bulldozer is only optimized for server workloads.
terrace215 post: IPC decreases, AMD presentation sheet no.Y confesses this.

JF-AMD posting: IPC increases!!!! You are spreading FUD

terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, The AMD architect says it decreases by 5%
terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, AMD has given up improving IPC.

JF-AMD posting: IPC increases!!!!!!! How many times did I tell you!!!

forever{
terrace215 post: IPC decreases, because .....
terrace215 post: IPC decreases, says .... of AMD
terrace215 post: IPC decreases, according to AMD's presentation.
terrace215 post: IPC decreases, don't trust marketing guys.
terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, the marketing guy isn't talking about IPC
terrace215 post: IPC decreases, because of the 16KB caches
terrace215 post: IPC decreases, AMD has given up improving IPC.
terrace215 post: IPC decreases, The AMD architect says it decreases by 5%
terrace215 post: IPC decreases, Bulldozer is only optimized for server workloads.
terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
.....}
until (interrupt by Movieman)

savantu post: IPC decreases, AMD has given up improving IPC.
savantu post: IPC decreases, The AMD architect says it decreases by 5%
savantu post: IPC decreases, The more I post the more it decreases.
savantu post: IPC decreases, The more I post the more it decreases.
savantu post: IPC decreases, The more I post the more it decreases.

JF-AMD posting: Epic palms to the Face....

Regards, Hans

updated

**STaRGaZeR** · 08-31-2010, 04:37 AM

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

I'll make it short and easy to understand. Original quote:

Originally Posted by savantu

K10 has 3 ALUs and 3 AGUs. No matter how hard you and others try to downplay K10 execution resources, fact is, a K10 integer core has more resources than a BD integer core.

Which is 100% true, as K10 has more execution units. I don't see the words perfomance, shared or dedicated in this post. Then you say:

Originally Posted by JF-AMD

No, you are wrong. Old architecture has shared resources, new architecture has dedicated resources.

Which is wrong, based on the above. I just pointed it out, but it seems it was a perfect excuse to ignore what the guy is actually saying (as you like to do) and repeat the same post you've been repeating how many times now?

I hope you properly get it now.

**savantu** · 08-31-2010, 04:38 AM

Originally Posted by -Boris-

I'm pretty sure that when you have the position JF has in a company you get pretty accurate numbers from engineering and so on. There is no need for him to sit down and bench engineering samples personally. Would be quite stupid if engineering lied about the performance in internal reviews and documents.
You know this isn't Dilbertland right?

Let's say their past history isn't as imaculate as you portray it. There is an alternate discussion on BD details on Aces and Paul Demone directly answers JFs claims :

Originally Posted by Paul Demone

Originally Posted by inf64

wrote:
JF's comment about "lower IPC" in BD
http://www.xtremesystems.org/forums/sho ... tcount=589

Quote:
See, that statement is what gets people in trouble. Someone reads that statement and assumes 10% lower performance.

IPC will be higher than previous generation
Single threaded performance will be higher than previous generation

I hope this is clear enough for Paul and his crystal ball.

ROFL. A Niagara has higher "aggregate" (across all threads) IPC than a US-IV
but far lower single thread performance. Listen for what a salesman doesn't
say! Higher single thread performance than K10? Probably, but at far higher
clock rates enabled by a deeper pipeline, simpler cores, and a process shrink.

I remember another AMD "great white hope" chip called Barcelona.

I remember another rash and brash AMD marketing guy called Henri Richard.

That guy said a lot of things about that chip, made a lot of fantastic claims
of how it would make Intel cry. AMD fans worshipped the ground he walked on.

The funny thing is that once Barcelona silicon was characterized, SKUs defined,
and the first pre-release internal benchmark data compiled that guy left AMD so
fast it made the air crackle. Once Barcelona was released it was obvious why. :-D

Good ole JF has said a lot of things about BD. My guess is he still has at least
18 months to spin BD before having a third party reality check. I hope unlike
Henri he sticks it out post BD release just to see his fancy footwork trying to
match these claims up to reality.

Until then I'll believe a gram of disclosure from AMD's current and recent design
engineers over a ton of claims from a marketing guy. My advice to AMD fans
hanging on to JF's every last word is to remember back on what Henri Richard
said in private vs what he said in public.

http://news.cnet.com/8301-13924_3-10433953-64.html

an excerpt of a 2004 internal AMD communication from former AMD Executive Vice
President Henri Richard, the company's then-highest-ranking sales executive: "If you
look at it with an objective set of eyes, you would never buy AMD. I certainly would
never buy AMD for a personal system, if I wasn't working here."

BD taped out a month or two ago. If they were lucky silicon is mostly functional. If not, they are working overtime to fix it and get working samples. Silicon is being characterized and in pre-validation stage.
In other words, benchmarks and performance are second place at this time, most important is getting a functional chip.

What this all means, every claim about BD performance is based on estimates done without having actual silicon in hand.
SUN Rock was meant to be the greatest chip done in the past decade with with innovative features like transactional memory and scout threads. I still remember Jonathan Schwartz, how ecstatic he was over Rock.
Rock turned out a complete dud, burning 300w and abisymal performance.

**MrMojoZ** · 08-31-2010, 05:12 AM

Six year old quotes? I'm going to start filling Sandy Bridge threads with info on Netburst, thats cool with you guys right?

**Particle** · 08-31-2010, 05:17 AM

I have to wonder why Paul would even say this unless he just wants to argue:

ROFL. A Niagara has higher "aggregate" (across all threads) IPC than a US-IV
but far lower single thread performance. Listen for what a salesman doesn't
say! Higher single thread performance than K10? Probably, but at far higher
clock rates enabled by a deeper pipeline, simpler cores, and a process shrink.

The first section deals with repeating the whole "OK, so it's faster overall but what about single threaded work?! Ha!" We've already been told that BD is faster than the current gen at both, which he even acknowledges in the second half. As such, what point was there to even posting the first part? As for the second part, who cares? If the frequencies are higher due to the changes in the chip and that permits an overall faster singlethreaded and multithreaded experience than is possible with current designs, then why does it matter if the new chip ticks faster? I'm not saying I think that clock for clock the new part will be slower at this point, but even if it were it would be fine given that in the end it's still faster and not only clocked higher.

**Hornet331** · 08-31-2010, 05:19 AM

http://flamewheelspin.ytmnd.com/

perfectly sums up this thread...

**savantu** · 08-31-2010, 05:21 AM

Originally Posted by MrMojoZ

Six year old quotes? I'm going to start filling Sandy Bridge threads with info on Netburst, thats cool with you guys right?

)

No need to; the point was simply to take the appropriate spoon of salt wrts to marketing and performance claims for a future product.

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

Not in the slightest.
First of all, nobody claimed BD will be slower than existing products either in performance overall or single threaded performance. Nobody brought in discussion the dedicated vs. shared resources but you, so false dilemma you have there.
The only point raised ( by me at least ) was that given the design trade-offs BD did ( which I addressed in a previous post in details - my POV nothing more and which David Kanter also mentioned in his article), it is expected that BD will lose slightly in performance per clock compared to K10 in integer code . Overall performance of BD, including single threaded will be higher no doubt than K10. But not per clock.

**Calmatory** · 08-31-2010, 05:27 AM

On what do you base your opinion on? Deeper pipeline?

Are there any bits of info regarding cache inclusiveness/exclusiveness, other than 16 kB L1D, which hints for inclusive cache?

I'm still predicting inclusive cache for the L1D size. There is no reason to stick to exclusive cache as it gives virtually no benefit because of poor L2/L1 and L3/L2 ratios. It just slows every memory operation quite a bit while giving marginal improvement on hit rate. Anyone with some knowledge on the performance penalty due to exclusive cache? I'd believe that inclusive cache would bring more than enough to compensate for any loss the deeper pipeline could potentially cause, brining the cache latencies to near Nehalem numbers, if not greater. SB seems to be a real badass on this, so I can't see BD topping near it's latecies with inclusive cache.

**MrMojoZ** · 08-31-2010, 05:29 AM

Originally Posted by savantu

)

No need to; the point was simply to take the appropriate spoon of salt wrts to marketing and performance claims for a future product.

And the same applies to Intel parts so I need to start spamming their threads too, right? Is that what you are advocating here?

**savantu** · 08-31-2010, 05:32 AM

Originally Posted by Particle

I have to wonder why Paul would even say this unless he just wants to argue:

The first section deals with repeating the whole "OK, so it's faster overall but what about single threaded work?! Ha!" We've already been told that BD is faster than the current gen at both, which he even acknowledges in the second half. As such, what point was there to even posting the first part?

He was adressing JF's point about IPC being higher ( Paul doubts that ). I am surprised it isn't obvious.

As for the second part, who cares? If the frequencies are higher due to the changes in the chip and that permits an overall faster singlethreaded and multithreaded experience than is possible with current designs, then why does it matter if the new chip ticks faster? I'm not saying I think that clock for clock the new part will be slower at this point, but even if it were it would be fine given that in the end it's still faster and not only clocked higher.

Well, you see, neither me, Paul or others are interested in absolute values for benchmarks scores. My interest is how they got there, the uarch, the trade-offs, the clever stuff done to hide bottlenecks, the corner cases,etc. I don't give a rats ass if it scores 101 FPS in I-don't-know-what-game or does SuperPi in -2 sec.
The fun is analyzing the intentions and the implementation, not the end result. I take great pleasure in reading about Netburst, Prescott,Tejas, Nehalem ( 1st one ) , Tanglewood, Rock, etc even if some were duds in the end. It may suck, but it was innovative and challenging.

**Hornet331** · 08-31-2010, 05:33 AM

Originally Posted by savantu

The only point raised ( by me at least ) was that given the design trade-offs BD did ( which I addressed in a previous post in details - my POV nothing more and which David Kanter also mentioned in his article), it is expected that BD will lose slightly in performance per clock compared to K10 in integer code . Overall performance of BD, including single threaded will be higher no doubt than K10. But not per clock.

Watch as it will again be spun so you said overall performance or overall IPC...

**Motiv** · 08-31-2010, 05:39 AM

Well after readin' all the stuff about BD, my nooby chip expertese tells me that.

IPC will be be improved at the same clocks, than current AMD processors.
It will take less space per core
It will clock higher than the current crop of AMD processors.

It looks like it will be highly competitive in the server market, but behind in the 'gamers' segment (possibly close to matching today's Intels because of clockspeed, but not surpassing it in IPC).

obviously no one will know until it gets leaked.

**Hans de Vries** · 08-31-2010, 05:43 AM

Originally Posted by savantu

Let's say their past history isn't as imaculate as you portray it. There is an alternate discussion on BD details on Aces and Paul Demone directly answers JFs claims :

BD taped out a month or two ago. If they were lucky silicon is mostly functional. If not, they are working overtime to fix it and get working samples. Silicon is being characterized and in pre-validation stage.
In other words, benchmarks and performance are second place at this time, most important is getting a functional chip.

What this all means, every claim about BD performance is based on estimates done without having actual silicon in hand.

I don't agree with the word "estimates"

A design is validated and debugged long before it goes to silicon. Validation
is done both by cycle accurate software simulation and FPGA hardware
emulation. An FPGA hardware implementation of the core, or entire processor,
can run 10+MHz and can be made cycle accurate. This is also how you do
performance tuning during the design phase itself.

Typically operating systems are booted and many software applications
are run long before you go to silicon.

About your link......

What in this musing from the investment board inhabitants can be classified
as not being investor FUD and of any technical relevance concerning
the architectural details of bulldozer?

Regards, Hans

**AliG** · 08-31-2010, 05:45 AM

Originally Posted by savantu

)

Not in the slightest.
First of all, nobody claimed BD will be slower than existing products either in performance overall or single threaded performance. Nobody brought in discussion the dedicated vs. shared resources but you, so false dilemma you have there.
The only point raised ( by me at least ) was that given the design trade-offs BD did ( which I addressed in a previous post in details - my POV nothing more and which David Kanter also mentioned in his article), it is expected that BD will lose slightly in performance per clock compared to K10 in integer code . Overall performance of BD, including single threaded will be higher no doubt than K10. But not per clock.

I don't get why you guys are so sure it can't offer more ipc than k10. I think it does make a difference that the 3 could either be ALUs or AGUs because they couldn't do so simultaneously. Add to the fact that many applications don't even use a full ALU/AGU, so combined with a better prefetcher bulldozer should offer good ipc gains.

It's been confirmed many times over that 80% number is integer cores in a single module vs integer cores in different modules, and the performance is lost due to shared components in the modules, not due to weaker cores.

**Sn0wm@n** · 08-31-2010, 05:47 AM

Originally Posted by freeloader

Thanks for answering that JF.

So if he's getting accurate numbers from engineering, then it's a simple thing for him to say, "BD numbers are better than existing chips". Simple right?

in fact its been said couples of times by JF-AMD himself ....

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

sharing most inevitably mean communism for some .... but not for me .. if its to bring a good product at an affordable price with a big improvement over the last product im all for anything really

Originally Posted by STaRGaZeR

I'll make it short and easy to understand. Original quote:

Which is 100% true, as K10 has more execution units. I don't see the words perfomance, shared or dedicated in this post. Then you say:

Which is wrong, based on the above. I just pointed it out, but it seems it was a perfect excuse to ignore what the guy is actually saying (as you like to do) and repeat the same post you've been repeating how many times now?

I hope you properly get it now.

arguying with the man who works at the company to wich you decide to pick about said product .. and said person is in talk with engineers who built the damn thing ....

**madcho** · 08-31-2010, 05:47 AM

Originally Posted by Hornet331

http://flamewheelspin.ytmnd.com/

perfectly sums up this thread...

I agree & savantu doesn't help to have objectiv view about facts.

If JF- said IPC will be better, this is true ... Why ? It's simple, he don't want be unemployed.

Good marketing is telling the true ... And Henri Richard has done some big mistake, and now don't work anymore for AMD.

Bad guys don't stay so long ...

**Calmatory** · 08-31-2010, 05:55 AM

What if there are multiple threads with lots of AVX instructions? Single module can only feed one AVX instruction at a time, or two 128-bit SSEx instructions, or 4 64-bit FPU instructions, right?

Up to <number of modules> threads running AVX there should be no performance penalty as long as there is no other FP instructions in the fly. More there are, lower the AVX performance will be. And if one adds more AVX threads, the FPU units will just starve and there is no performance improvement?

In short: If I want to do lots of AVX, I can only run <number of modules> threads for improved performance?

**informal** · 08-31-2010, 05:58 AM

Repeating for the 10th time already,10h can retire(back end of the chip) 3 macro ops,period.It has 9 execution units. There's your problem.

**Sn0wm@n** · 08-31-2010, 06:02 AM

Originally Posted by Calmatory

What if there are multiple threads with lots of AVX instructions? Single module can only feed one AVX instruction at a time, or two 128-bit SSEx instructions, or 4 64-bit FPU instructions, right?

Up to <number of modules> threads running AVX there should be no performance penalty as long as there is no other FP instructions in the fly. More there are, lower the AVX performance will be. And if one adds more AVX threads, the FPU units will just starve and there is no performance improvement?

In short: If I want to do lots of AVX, I can only run <number of modules> threads for improved performance?

by the time this what if comes there will be more powerfull cpu's cappable of doing more then a single avx instruction per core etc....

and anyway isnt avx a better suited instruction for massive multimedia task in term of the way it process the info ???? so even if the cpu's might be limited by that fact they will most likely finish their job easily right ????

**-Boris-** · 08-31-2010, 06:22 AM

Originally Posted by STaRGaZeR

Original quote:

Originally Posted by savantu
K10 has 3 ALUs and 3 AGUs. No matter how hard you and others try to downplay K10 execution resources, fact is, a K10 integer core has more resources than a BD integer core.

Which is 100% true, as K10 has more execution units. I don't see the words perfomance, shared or dedicated in this post. Then you say:

Quotes taken out of context can be true, but in context they can mean something different. Your quote was a response to my post about pipes. He is trying to make 3 pipelines appear like 6 pipes. Which is a twist to the truth.

BD has more resources since it can use 2 ALUs and 2 AGUs every clock, Phenom II averages at 1.5 ALUs and 1.5 AGUs since the share pipe. Again, if you can't use it, it isn't a resource. 2+2=4 (3+3)/2=3.

Originally Posted by STaRGaZeR

Which is wrong, based on the above. I just pointed it out, but it seems it was a perfect excuse to ignore what the guy is actually saying (as you like to do) and repeat the same post you've been repeating how many times now?

I hope you properly get it now.

The discussion is still around IPC. Even if you try to make it look different. And it's still about BDs integer execution capacity compared to k8 (10h), we are pointing out that BDs 4 pipes seems a bit stronger than K8s 3 pipes.
And by adding the different parts of K8s pipeline together some people here are trying to make them look twice as strong.
4 pipes equals more resources than 3.

Thread: AMD's Bobcat and Bulldozer

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions