AMD's Bobcat and Bulldozer

Printable View

Show 100 post(s) from this thread on one page

08-31-2010, 03:35 AM
freeloader

Quote:

Originally Posted by -Boris-

I'm pretty sure that when you have the position JF has in a company you get pretty accurate numbers from engineering and so on. There is no need for him to sit down and bench engineering samples personally. Would be quite stupid if engineering lied about the performance in internal reviews and documents.
You know this isn't Dilbertland right? ;)

Thanks for answering that JF. :rolleyes:

So if he's getting accurate numbers from engineering, then it's a simple thing for him to say, "BD numbers are better than existing chips". Simple right?
08-31-2010, 03:39 AM
-Boris-

Quote:

Originally Posted by freeloader

Thanks for answering that JF. :rolleyes:

So if he's getting accurate numbers from engineering, then it's a simple thing for him to say, "BD numbers are better than existing chips". Simple right?

That's pretty much what he have said countless times already.

http://www.xtremesystems.org/forums/...&postcount=602
08-31-2010, 03:57 AM
JF-AMD

Quote:

Originally Posted by freeloader

JF...have you personally seen running BD chips yet or whatever the server variant is called? Just wondering how you're so sure if you haven't bench tested one yet.

Does anyone here know when BD compatible socket motherboards will go on sale?

I don't have any reason to be in building 400. And it is better off that the marketing guy is not "dropping in" on them.

Our performance engineering team has done a real accurate job on performance modeling in the past, I have no reason to doubt them. Generally the worst that we see is too much conservatism, not too much optimism.
08-31-2010, 04:01 AM
JF-AMD

Quote:

Originally Posted by STaRGaZeR

He's right. K10 has more resources, shared or not.

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.
08-31-2010, 04:10 AM
vietthanhpro

Quote:

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

THey are Intel fanboys :rofl: No problem !:ROTF:
08-31-2010, 04:29 AM
AliG

Quote:

Originally Posted by Hans de Vries

Our friend who goes by the name dougsf30/terrace215/chipper/chipdesigner/tatertot/justaview/gloo...
and 100 more names has a history of driving people nuts (and maybe himself as well...)

To recapitulate this thread:

AMD Architects : IPC increases (Anand article commenting on the 2 ALUs an 16KB L1)

terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, because of the 16KB caches
terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, AMD presentation sheet no.Y confesses this.

JF-AMD posting: IPC increases!! instead of getting worse.

terrace215 post: IPC decreases, the marketing guy isn't talking about IPC
terrace215 post: IPC decreases, don't trust marketing guys.
terrace215 post: IPC decreases, Bulldozer is only optimized for server workloads.
terrace215 post: IPC decreases, AMD presentation sheet no.Y confesses this.

JF-AMD posting: IPC increases!!!! You are spreading FUD

terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, The AMD architect says it decreases by 5%
terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, AMD has given up improving IPC.

JF-AMD posting: IPC increases!!!!!!! How many times did I tell you!!!

forever{
terrace215 post: IPC decreases, because .....
terrace215 post: IPC decreases, says .... of AMD
terrace215 post: IPC decreases, according to AMD's presentation.
terrace215 post: IPC decreases, don't trust marketing guys.
terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, the marketing guy isn't talking about IPC
terrace215 post: IPC decreases, because of the 16KB caches
terrace215 post: IPC decreases, AMD has given up improving IPC.
terrace215 post: IPC decreases, The AMD architect says it decreases by 5%
terrace215 post: IPC decreases, Bulldozer is only optimized for server workloads.
terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
.....}
until (interrupt by Movieman)

savantu post: IPC decreases, AMD has given up improving IPC.
savantu post: IPC decreases, The AMD architect says it decreases by 5%
savantu post: IPC decreases, The more I post the more it decreases.
savantu post: IPC decreases, The more I post the more it decreases.
savantu post: IPC decreases, The more I post the more it decreases.

JF-AMD posting: Epic palms to the Face....

Regards, Hans

updated
08-31-2010, 04:37 AM
STaRGaZeR

Quote:

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

I'll make it short and easy to understand. Original quote:

Quote:

Originally Posted by savantu

K10 has 3 ALUs and 3 AGUs. No matter how hard you and others try to downplay K10 execution resources, fact is, a K10 integer core has more resources than a BD integer core.

Which is 100% true, as K10 has more execution units. I don't see the words perfomance, shared or dedicated in this post. Then you say:

Quote:

Originally Posted by JF-AMD

No, you are wrong. Old architecture has shared resources, new architecture has dedicated resources.

Which is wrong, based on the above. I just pointed it out, but it seems it was a perfect excuse to ignore what the guy is actually saying (as you like to do) and repeat the same post you've been repeating how many times now?

I hope you properly get it now.
08-31-2010, 04:38 AM
savantu

Quote:

Originally Posted by -Boris-

I'm pretty sure that when you have the position JF has in a company you get pretty accurate numbers from engineering and so on. There is no need for him to sit down and bench engineering samples personally. Would be quite stupid if engineering lied about the performance in internal reviews and documents.
You know this isn't Dilbertland right? ;)

Let's say their past history isn't as imaculate as you portray it. There is an alternate discussion on BD details on Aces and Paul Demone directly answers JFs claims :

Quote:

Originally Posted by Paul Demone

Quote:

Originally Posted by inf64

wrote:
JF's comment about "lower IPC" in BD
http://www.xtremesystems.org/forums/sho ... tcount=589

Quote:

Quote:
See, that statement is what gets people in trouble. Someone reads that statement and assumes 10% lower performance.

IPC will be higher than previous generation
Single threaded performance will be higher than previous generation

I hope this is clear enough for Paul and his crystal ball.

ROFL. A Niagara has higher "aggregate" (across all threads) IPC than a US-IV
but far lower single thread performance. Listen for what a salesman doesn't
say! Higher single thread performance than K10? Probably, but at far higher
clock rates enabled by a deeper pipeline, simpler cores, and a process shrink.

I remember another AMD "great white hope" chip called Barcelona.

I remember another rash and brash AMD marketing guy called Henri Richard.

That guy said a lot of things about that chip, made a lot of fantastic claims
of how it would make Intel cry. AMD fans worshipped the ground he walked on.

The funny thing is that once Barcelona silicon was characterized, SKUs defined,
and the first pre-release internal benchmark data compiled that guy left AMD so
fast it made the air crackle. Once Barcelona was released it was obvious why. :-D

Good ole JF has said a lot of things about BD. My guess is he still has at least
18 months to spin BD before having a third party reality check. I hope unlike
Henri he sticks it out post BD release just to see his fancy footwork trying to
match these claims up to reality.

Until then I'll believe a gram of disclosure from AMD's current and recent design
engineers over a ton of claims from a marketing guy. My advice to AMD fans
hanging on to JF's every last word is to remember back on what Henri Richard
said in private vs what he said in public.

http://news.cnet.com/8301-13924_3-10433953-64.html

an excerpt of a 2004 internal AMD communication from former AMD Executive Vice
President Henri Richard, the company's then-highest-ranking sales executive: "If you
look at it with an objective set of eyes, you would never buy AMD. I certainly would
never buy AMD for a personal system, if I wasn't working here."

BD taped out a month or two ago. If they were lucky silicon is mostly functional. If not, they are working overtime to fix it and get working samples. Silicon is being characterized and in pre-validation stage.
In other words, benchmarks and performance are second place at this time, most important is getting a functional chip.

What this all means, every claim about BD performance is based on estimates done without having actual silicon in hand.
SUN Rock was meant to be the greatest chip done in the past decade with with innovative features like transactional memory and scout threads. I still remember Jonathan Schwartz, how ecstatic he was over Rock.
Rock turned out a complete dud, burning 300w and abisymal performance.
08-31-2010, 05:12 AM
MrMojoZ

Six year old quotes? I'm going to start filling Sandy Bridge threads with info on Netburst, thats cool with you guys right?
08-31-2010, 05:17 AM
Particle

I have to wonder why Paul would even say this unless he just wants to argue:

Quote:

ROFL. A Niagara has higher "aggregate" (across all threads) IPC than a US-IV
but far lower single thread performance. Listen for what a salesman doesn't
say! Higher single thread performance than K10? Probably, but at far higher
clock rates enabled by a deeper pipeline, simpler cores, and a process shrink.

The first section deals with repeating the whole "OK, so it's faster overall but what about single threaded work?! Ha!" We've already been told that BD is faster than the current gen at both, which he even acknowledges in the second half. As such, what point was there to even posting the first part? As for the second part, who cares? If the frequencies are higher due to the changes in the chip and that permits an overall faster singlethreaded and multithreaded experience than is possible with current designs, then why does it matter if the new chip ticks faster? I'm not saying I think that clock for clock the new part will be slower at this point, but even if it were it would be fine given that in the end it's still faster and not only clocked higher.
08-31-2010, 05:19 AM
Hornet331

http://flamewheelspin.ytmnd.com/

perfectly sums up this thread...
08-31-2010, 05:21 AM
savantu

Quote:

Originally Posted by MrMojoZ

Six year old quotes? I'm going to start filling Sandy Bridge threads with info on Netburst, thats cool with you guys right?

:))

No need to; the point was simply to take the appropriate spoon of salt wrts to marketing and performance claims for a future product.

Quote:

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

Not in the slightest.
First of all, nobody claimed BD will be slower than existing products either in performance overall or single threaded performance. Nobody brought in discussion the dedicated vs. shared resources but you, so false dilemma you have there.
The only point raised ( by me at least ) was that given the design trade-offs BD did ( which I addressed in a previous post in details - my POV nothing more and which David Kanter also mentioned in his article), it is expected that BD will lose slightly in performance per clock compared to K10 in integer code . Overall performance of BD, including single threaded will be higher no doubt than K10. But not per clock.
08-31-2010, 05:27 AM
Calmatory

On what do you base your opinion on? Deeper pipeline?

Are there any bits of info regarding cache inclusiveness/exclusiveness, other than 16 kB L1D, which hints for inclusive cache?

I'm still predicting inclusive cache for the L1D size. There is no reason to stick to exclusive cache as it gives virtually no benefit because of poor L2/L1 and L3/L2 ratios. It just slows every memory operation quite a bit while giving marginal improvement on hit rate. Anyone with some knowledge on the performance penalty due to exclusive cache? I'd believe that inclusive cache would bring more than enough to compensate for any loss the deeper pipeline could potentially cause, brining the cache latencies to near Nehalem numbers, if not greater. SB seems to be a real badass on this, so I can't see BD topping near it's latecies with inclusive cache.
08-31-2010, 05:29 AM
MrMojoZ

Quote:

Originally Posted by savantu

:))

No need to; the point was simply to take the appropriate spoon of salt wrts to marketing and performance claims for a future product.

And the same applies to Intel parts so I need to start spamming their threads too, right? Is that what you are advocating here? :confused:
08-31-2010, 05:32 AM
savantu

Quote:

Originally Posted by Particle

I have to wonder why Paul would even say this unless he just wants to argue:

The first section deals with repeating the whole "OK, so it's faster overall but what about single threaded work?! Ha!" We've already been told that BD is faster than the current gen at both, which he even acknowledges in the second half. As such, what point was there to even posting the first part?

He was adressing JF's point about IPC being higher ( Paul doubts that ). I am surprised it isn't obvious.

Quote:

As for the second part, who cares? If the frequencies are higher due to the changes in the chip and that permits an overall faster singlethreaded and multithreaded experience than is possible with current designs, then why does it matter if the new chip ticks faster? I'm not saying I think that clock for clock the new part will be slower at this point, but even if it were it would be fine given that in the end it's still faster and not only clocked higher.

Well, you see, neither me, Paul or others are interested in absolute values for benchmarks scores. My interest is how they got there, the uarch, the trade-offs, the clever stuff done to hide bottlenecks, the corner cases,etc. I don't give a rats ass if it scores 101 FPS in I-don't-know-what-game or does SuperPi in -2 sec.
The fun is analyzing the intentions and the implementation, not the end result. I take great pleasure in reading about Netburst, Prescott,Tejas, Nehalem ( 1st one ) , Tanglewood, Rock, etc even if some were duds in the end. It may suck, but it was innovative and challenging.
08-31-2010, 05:33 AM
Hornet331

Quote:

Originally Posted by savantu

The only point raised ( by me at least ) was that given the design trade-offs BD did ( which I addressed in a previous post in details - my POV nothing more and which David Kanter also mentioned in his article), it is expected that BD will lose slightly in performance per clock compared to K10 in integer code . Overall performance of BD, including single threaded will be higher no doubt than K10. But not per clock.

Watch as it will again be spun so you said overall performance or overall IPC...
08-31-2010, 05:39 AM
Motiv

Well after readin' all the stuff about BD, my nooby chip expertese tells me that.

IPC will be be improved at the same clocks, than current AMD processors.
It will take less space per core
It will clock higher than the current crop of AMD processors.

It looks like it will be highly competitive in the server market, but behind in the 'gamers' segment (possibly close to matching today's Intels because of clockspeed, but not surpassing it in IPC).

obviously no one will know until it gets leaked.
08-31-2010, 05:43 AM
Hans de Vries

Quote:

Originally Posted by savantu

Let's say their past history isn't as imaculate as you portray it. There is an alternate discussion on BD details on Aces and Paul Demone directly answers JFs claims :

BD taped out a month or two ago. If they were lucky silicon is mostly functional. If not, they are working overtime to fix it and get working samples. Silicon is being characterized and in pre-validation stage.
In other words, benchmarks and performance are second place at this time, most important is getting a functional chip.

What this all means, every claim about BD performance is based on estimates done without having actual silicon in hand.

I don't agree with the word "estimates"

A design is validated and debugged long before it goes to silicon. Validation
is done both by cycle accurate software simulation and FPGA hardware
emulation. An FPGA hardware implementation of the core, or entire processor,
can run 10+MHz and can be made cycle accurate. This is also how you do
performance tuning during the design phase itself.

Typically operating systems are booted and many software applications
are run long before you go to silicon.

About your link......

What in this musing from the investment board inhabitants can be classified
as not being investor FUD and of any technical relevance concerning
the architectural details of bulldozer?

Regards, Hans
08-31-2010, 05:45 AM
AliG

Quote:

Originally Posted by savantu

:))

Not in the slightest.
First of all, nobody claimed BD will be slower than existing products either in performance overall or single threaded performance. Nobody brought in discussion the dedicated vs. shared resources but you, so false dilemma you have there.
The only point raised ( by me at least ) was that given the design trade-offs BD did ( which I addressed in a previous post in details - my POV nothing more and which David Kanter also mentioned in his article), it is expected that BD will lose slightly in performance per clock compared to K10 in integer code . Overall performance of BD, including single threaded will be higher no doubt than K10. But not per clock.

I don't get why you guys are so sure it can't offer more ipc than k10. I think it does make a difference that the 3 could either be ALUs or AGUs because they couldn't do so simultaneously. Add to the fact that many applications don't even use a full ALU/AGU, so combined with a better prefetcher bulldozer should offer good ipc gains.

It's been confirmed many times over that 80% number is integer cores in a single module vs integer cores in different modules, and the performance is lost due to shared components in the modules, not due to weaker cores.
08-31-2010, 05:47 AM
Sn0wm@n

Quote:

Originally Posted by freeloader

Thanks for answering that JF. :rolleyes:

So if he's getting accurate numbers from engineering, then it's a simple thing for him to say, "BD numbers are better than existing chips". Simple right?

in fact its been said couples of times by JF-AMD himself ....

Quote:

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

sharing most inevitably mean communism for some .... but not for me .. if its to bring a good product at an affordable price with a big improvement over the last product im all for anything really

Quote:

Originally Posted by STaRGaZeR

I'll make it short and easy to understand. Original quote:

Which is 100% true, as K10 has more execution units. I don't see the words perfomance, shared or dedicated in this post. Then you say:

Which is wrong, based on the above. I just pointed it out, but it seems it was a perfect excuse to ignore what the guy is actually saying (as you like to do) and repeat the same post you've been repeating how many times now?

I hope you properly get it now.

arguying with the man who works at the company to wich you decide to pick about said product .. and said person is in talk with engineers who built the damn thing ....
08-31-2010, 05:47 AM
madcho

Quote:

Originally Posted by Hornet331

http://flamewheelspin.ytmnd.com/

perfectly sums up this thread...

I agree & savantu doesn't help to have objectiv view about facts.

If JF- said IPC will be better, this is true ... Why ? It's simple, he don't want be unemployed.

Good marketing is telling the true ... And Henri Richard has done some big mistake, and now don't work anymore for AMD.

Bad guys don't stay so long ...
08-31-2010, 05:55 AM
Calmatory

What if there are multiple threads with lots of AVX instructions? Single module can only feed one AVX instruction at a time, or two 128-bit SSEx instructions, or 4 64-bit FPU instructions, right?

Up to <number of modules> threads running AVX there should be no performance penalty as long as there is no other FP instructions in the fly. More there are, lower the AVX performance will be. And if one adds more AVX threads, the FPU units will just starve and there is no performance improvement?

In short: If I want to do lots of AVX, I can only run <number of modules> threads for improved performance?
08-31-2010, 05:58 AM
informal

Repeating for the 10th time already,10h can retire(back end of the chip) 3 macro ops,period.It has 9 execution units. There's your problem.
08-31-2010, 06:02 AM
Sn0wm@n

Quote:

Originally Posted by Calmatory

What if there are multiple threads with lots of AVX instructions? Single module can only feed one AVX instruction at a time, or two 128-bit SSEx instructions, or 4 64-bit FPU instructions, right?

Up to <number of modules> threads running AVX there should be no performance penalty as long as there is no other FP instructions in the fly. More there are, lower the AVX performance will be. And if one adds more AVX threads, the FPU units will just starve and there is no performance improvement?

In short: If I want to do lots of AVX, I can only run <number of modules> threads for improved performance?

by the time this what if comes there will be more powerfull cpu's cappable of doing more then a single avx instruction per core etc....

and anyway isnt avx a better suited instruction for massive multimedia task in term of the way it process the info ???? so even if the cpu's might be limited by that fact they will most likely finish their job easily right ????
08-31-2010, 06:22 AM
-Boris-

Quote:

Originally Posted by STaRGaZeR

Original quote:

Quote:

Originally Posted by savantu
K10 has 3 ALUs and 3 AGUs. No matter how hard you and others try to downplay K10 execution resources, fact is, a K10 integer core has more resources than a BD integer core.

Which is 100% true, as K10 has more execution units. I don't see the words perfomance, shared or dedicated in this post. Then you say:

Quotes taken out of context can be true, but in context they can mean something different. Your quote was a response to my post about pipes. He is trying to make 3 pipelines appear like 6 pipes. Which is a twist to the truth.

BD has more resources since it can use 2 ALUs and 2 AGUs every clock, Phenom II averages at 1.5 ALUs and 1.5 AGUs since the share pipe. Again, if you can't use it, it isn't a resource. 2+2=4 (3+3)/2=3.

Quote:

Originally Posted by STaRGaZeR

Which is wrong, based on the above. I just pointed it out, but it seems it was a perfect excuse to ignore what the guy is actually saying (as you like to do) and repeat the same post you've been repeating how many times now?

I hope you properly get it now.

The discussion is still around IPC. Even if you try to make it look different. And it's still about BDs integer execution capacity compared to k8 (10h), we are pointing out that BDs 4 pipes seems a bit stronger than K8s 3 pipes.
And by adding the different parts of K8s pipeline together some people here are trying to make them look twice as strong.
4 pipes equals more resources than 3.
08-31-2010, 06:31 AM
superrugal

Is this clear enough?

http://www.realworldtech.com/include...ulldozer-4.png
08-31-2010, 06:47 AM
informal

Each integer core takes 4 Macro ops from the dispatch group buffers while each 10h(Istanbul) core takes 3 Macro ops.
08-31-2010, 06:48 AM
Sn0wm@n

where did you get those slides ? :O they are more then clear about how the arch works compared to westmere and instanbul ... Thanks
08-31-2010, 06:51 AM
z3et

Quote:

Originally Posted by Sn0wm@n

where did you get those slides ? :O they are more then clear about how the arch works compared to westmere and instanbul ... Thanks

From here. It's very good article ;)
08-31-2010, 06:54 AM
Hornet331

Quote:

Originally Posted by z3et

From here. It's very good article ;)

Its not like it hadn't been already posted here in this thread:
http://www.xtremesystems.org/forums/...&postcount=685
08-31-2010, 07:13 AM
z3et

I didn't saw the post :)

PS: I read it from the first day it got published, http://www.google.com/realtime ;)
08-31-2010, 07:15 AM
STaRGaZeR

Quote:

Originally Posted by -Boris-

BD has more resources since it can use 2 ALUs and 2 AGUs every clock, Phenom II averages at 1.5 ALUs and 1.5 AGUs since the share pipe. Again, if you can't use it, it isn't a resource. 2+2=4 (3+3)/2=3

The thing is that it uses it. If the CPU can't use all 6 at the same time that's another thing. All 6 will get used at some point. Either way, they are on the die, they're connected, and they are used. Alternatively, not at the same time, whatever. But they are there, they are used and thus they are a resource. K10 has more resources than BD (integer "clusters").

Quote:

Originally Posted by -Boris-

The discussion is still around IPC. Even if you try to make it look different. And it's still about BDs integer execution capacity compared to k8 (10h), we are pointing out that BDs 4 pipes seems a bit stronger than K8s 3 pipes.
And by adding the different parts of K8s pipeline together some people here are trying to make them look twice as strong.
4 pipes equals more resources than 3.

Instructions per clock (compared to K10). Frequency doesn't matter, this is per clock:

IPC (CPU level) --> Will be higher, more "modules", double integer resources per "module", less resources per integer "cluster", better use of available resources per integer "cluster".
IPC ("module" level) --> Will be higher, double integer resources per "module", less resources per integer "cluster", better use of available resources per integer "cluster".
IPC (single integer "cluster") --> Less resources, better use of available resources. Higher or lower instructions per clock?

The bold part is likely lower, and that's exactly what savantu, terrace and others are discussing here. IPC per integer "cluster". We don't know for sure, since JF just says "IPC will be higher". At what of the previous levels? After all the BS, bans, etc. he still hasn't answered this question.

Now, if you throw frecuency in the mix, knowing that it will be higher than current K10 CPUs, of course you can say single integer "cluster" perfomance is higher. Just notice how he never uses IPC+higher+per integer "cluster" in the same sentence. The only info we know about single thread perfomance is that it will "be higher". Of course, because of the higher frequency, not because IPC is higher.

JF just has to answer the question and this debate is going to end fast: IPC per integer cluster has been increased or not? No BS, just yes or no.
08-31-2010, 07:27 AM
Mechanical Man

Quote:

Originally Posted by STaRGaZeR

Blaablaa..

JF just has to answer the question and this debate is going to end fast: IPC per integer cluster has been increased or not? No BS, just yes or no.

Quote:

Originally Posted by JF-AMD

Blaablaa..

A BD integer core will do more IPC and perform single threads faster than an old core.

There, posted few pages back. So why did it not end?
08-31-2010, 07:27 AM
informal

Stargazer,can you read or not?The man said IPC will be higher and single thread performance will be higher. Can't you just stop beating the dead horse already?It's dead,alright?
K10 has a clear bottleneck in the retirement unit.It has a massive 9 execution units available(3ALU,3AGU,3FPU) but can retire only 3 macro ops per cycle.
08-31-2010, 07:31 AM
Calmatory

What would make the IPC lower on "integer cluster level"? Deeper pipeline + "less" resources, L1D cut down by 75%?

Not any of those. Less absolute resources, more practical resources per thread. This alone could possibly compensate for any IPC loss caused by deeper pipeline, let alone the improvements in other areas. If the cache is actually inclusive, then that alone would compensate for every possible CPU-level change which would reduce IPC even the fiercest Intel fan could think of.

Potential integer throughput of those 2ALU/2AGU says very little about the IPC performance, let alone single-thread performance, or whole product performance. All you'd need is slightly faster cache access and more aggressive prefetching and branch predicting to bring 10 % IPC increase with 10 % penalty on "integer clusters".
08-31-2010, 07:33 AM
Particle

What BS? He has already stated both single thread and multithreaded performance are both higher.
08-31-2010, 07:34 AM
blindbox

informal, some people can't believe their eyes.

Anyway, yeah, read the article at anand. It's quite clear about uarch changes there (and probably the only site other than realworldtech which bothered to make their own diagrams to help better understand). I haven't read the one on realworldtech yet but judging from that pic at post 726, it might be good.
08-31-2010, 07:48 AM
Mechromancer

David Kanter over at Real World Tech has a writeup about Bulldozer's uArch.

http://www.realworldtech.com/page.cf...WT082610181333

I figured this didn't have to be posted as a completely new thread.
08-31-2010, 08:01 AM
madcho

Quote:

Originally Posted by Mechromancer

David Kanter over at Real World Tech has a writeup about Bulldozer's uArch.

http://www.realworldtech.com/page.cf...WT082610181333

I figured this didn't have to be posted as a completely new thread.

already posted in this thread but thx :up:
08-31-2010, 08:05 AM
Chumbucket843

Quote:

Originally Posted by savantu

BD taped out a month or two ago. If they were lucky silicon is mostly functional. If not, they are working overtime to fix it and get working samples. Silicon is being characterized and in pre-validation stage.
In other words, benchmarks and performance are second place at this time, most important is getting a functional chip.

do you have any clue as to what happens in post-Si validation? based off of your post i'd bet against it.
08-31-2010, 08:07 AM
Mechromancer

Quote:

Originally Posted by madcho

already posted in this thread but thx :up:

I didn't go back enough pages when I checked :(. Oh wellz..
08-31-2010, 08:20 AM
Hornet331

Quote:

Originally Posted by Mechanical Man

There, posted few pages back. So why did it not end?

2 pages is a few? :D

And finally we got a statement, its the first time he explicit mentioned this and the question was answered, ironically after the one that asked the question first was banned... :rolleyes:

If he would have done so much earlier we could have at least saved 15 pages of nonsense... anyway im satisfied with the answer and there is nothing more to ask.
08-31-2010, 08:23 AM
blindbox

Or the guy who made 15 pages of non-sense could be willing to read a few articles on the front page before starting assumptions (and indirectly accusing people/slides of lying). Two ways of looking at it.. :rolleyes:
08-31-2010, 08:30 AM
tifosi

Quote:

Originally Posted by STaRGaZeR

1)The bold part is likely lower, and that's exactly what savantu, terrace and others are discussing here. IPC per integer "cluster". We don't know for sure, since JF just says "IPC will be higher". At what of the previous levels? After all the BS, bans, etc. he still hasn't answered this question.

2)...Of course, because of the higher frequency, not because IPC is higher.

3)JF just has to answer the question and this debate is going to end fast: IPC per integer cluster has been increased or not? No BS, just yes or no.

1) As per what percentage improvement could be seen... JF has already said that with 33% cores 50% performance gain at server workloads could be seen. This is the only information JF is willing to share and unless you hold Intel stock or work for them, i see no reason why'd you press so much for that information... which he already explained that he couldn't share owing to product being some time away from launch (i assume about a good 2 quarters or so...). Personally speaking AMD wouldn't want Intel to have information on an upcoming product, as it will give Intel an edge and possibly a chance to outmaneuver them. It works the same the when it comes to the opposite... The only time Intel leaked information (remember C2D) on an upcoming architecture was when AMD was kicking them around left right and center and in all segments of market... Now if Intel finds out stuff, they could possibly evolve a new pricing strategy (given their scale and market share its easier now) or something else, to counter a competitive product. Competitive BD is... :up:

2) IPC is higher... :yepp:

3) IPC compared to previous architectures of AMD is higher... he said as much... and many a times over...
08-31-2010, 08:42 AM
AliG

Quote:

Originally Posted by tifosi

1) As per what percentage improvement could be seen... JF has already said that with 33% cores 50% performance gain at server workloads could be seen. This is the only information JF is willing to share and unless you hold Intel stock or work for them, i see no reason why'd you press so much for that information... which he already explained that he couldn't share owing to product being some time away from launch (i assume about a good 2 quarters or so...). Personally speaking AMD wouldn't want Intel to have information on an upcoming product, as it will give Intel an edge and possibly a chance to outmaneuver them. It works the same the when it comes to the opposite... The only time Intel leaked information (remember C2D) on an upcoming architecture was when AMD was kicking them around left right and center and in all segments of market... Now if Intel finds out stuff, they could possibly evolve a new pricing strategy (given their scale and market share its easier now) or something else, to counter a competitive product. Competitive BD is... :up:

2) IPC is higher... :yepp:

3) IPC compared to previous architectures of AMD is higher... he said as much... and many a times over...

to be fair, historically companies that hide information until days before launch tend to have problems with their product, especially the ones who have many delays. Even if BD does have a sizeable increase over k10.5, which it should, I honestly don't think it will be enough to compete with Sandy Bridge.

That preview by Intel was a red cape for AMD to charge at, and I'm willing to bet if they had a better product they would have released their own preview, challenging for the top spot. My guess is that BD will be a fine product, just still not has a powerful as Intel's in terms of pure performance. To me it seems it's more about power efficiency, as JF keeps mentioning 50% off 33% more cores. Well why not 100% off 33% more cores? That's because the thermal envelopes would just be too high not to mention the power draw would be astronomical considering they don't have a working 32nm process.

At least from my perspective, it seems to me that AMD is done challenging for the top enthusiast performance spot. They seem to have shifted onto a new direction, trying to offer the most performance per dollar, especially over the long run when you consider electricity bills. That's quite reasonable, as Intel has far more money spent on their fabrication process, and thus have denser, faster caches which seriously helps out on applications like Super Pi.
08-31-2010, 08:50 AM
spursindonesia

Quote:

Originally Posted by JF-AMD

OK, so let me get the gist of all of this whole thread down to two statements:

1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.

2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.

OK, I got it now.

:rofl: Like what i've just said recently in this thread, this is sooooo predictable, the deed of Intel trolls, better take them lightly as a dry comedy & entertainment. :up:
08-31-2010, 09:00 AM
AliG

Quote:

Originally Posted by spursindonesia

:rofl: Like what i've just said recently in this thread, this is sooooo predictable, the deed of Intel trolls, better take them lightly as a dry comedy & entertainment. :up:

this is how I think of them
08-31-2010, 09:00 AM
Andi64

Is the OS aware of the cores sharing resources? If 2 cores of a module have 80% of the performance of two independent cores, when an application is using 2 threads (most of the games, for example) will the OS work on two different modules, or on a single module?

I'm reading the thread, sorry if it has already been answered...
08-31-2010, 09:17 AM
AliG

Quote:

Originally Posted by Andi64

Is the OS aware of the cores sharing resources? If 2 cores of a module have 80% of the performance of two independent cores, when an application is using 2 threads (most of the games, for example) will the OS work on two different modules, or on a single module?

I'm reading the thread, sorry if it has been already answered...

No one is sure, all JF has said is that AMD is working with MS to devise core utilization order etc.

I would imagine, that ideally for multithreaded tasks you would want the same module due to the shared L2, but for separate tasks you would want different modules due to the performance loss from sharing components
08-31-2010, 09:21 AM
Motiv

Quote:

Originally Posted by AliG

No one is sure, all JF has said is that AMD is working with MS to devise core utilization order etc.

I would imagine, that ideally for multithreaded tasks you would want the same module due to the shared L2, but for separate tasks you would want different modules due to the performance loss from sharing components

It was answered on the blog, that the shared L2 Cache wouldn't really help.

As for the Multitasking, I suspect it will work like Intels HT. As far as I'm aware, that doesn't cripple 1 core only, but spreads it out amongst the other cores first and foremost.
08-31-2010, 09:23 AM
-Boris-

Quote:

Originally Posted by AliG

No one is sure, all JF has said is that AMD is working with MS to devise core utilization order etc.

I would imagine, that ideally for multithreaded tasks you would want the same module due to the shared L2, but for separate tasks you would want different modules due to the performance loss from sharing components

At the same time as you have a performance loss to shared components you have a boost from Turbo. If four threads run on a module each, you will have no turbo, since turbo managment is at a module level and not at a core level. If all threads run at two modules, you will have a 10% performance hit, but you will have turbo making up for that and more.
08-31-2010, 09:28 AM
god_43

Quote:

Originally Posted by Hornet331

http://flamewheelspin.ytmnd.com/

perfectly sums up this thread...

:ROTF::rofl::ROTF::up::up::up::up::up:
08-31-2010, 09:29 AM
AliG

Quote:

Originally Posted by Motiv

It was answered on the blog, that the shared L2 Cache wouldn't really help.

As for the Multitasking, I suspect it will work like Intels HT. As far as I'm aware, that doesn't cripple 1 core only, but spreads it out amongst the other cores first and foremost.

If that's the case, then ideally AMD would create a lineup that was priced such that 1 module ~ 1 intel HT core - but we know that's probably never going to happen
08-31-2010, 09:29 AM
Motiv

Quote:

Originally Posted by -Boris-

At the same time as you have a performance loss to shared components, you have a boost from Turbo. If four threads run on a module each, you will have no turbo, since turbo managment is at a module level and not at a core level. If all threads run at two modules, you will have a 10% performance hit, but you will have turbo making up for that and more.

Why would turbo work when it's 2 module/2 core? Surely turbo would be more suited to running when all 4 modules are only utilising 1 core?

I thought the performance hit would be around 20%, if both cores are used within a module.
08-31-2010, 09:30 AM
Motiv

Quote:

Originally Posted by AliG

If that's the case, then ideally AMD would create a lineup that was priced such that 1 module ~ 1 intel HT core - but we know that's probably never going to happen

I suspect we'll be seeing AMD lineup using 8 cores vs 4 cores, even if than means 4 modules. The AMD cores within the modules are certainly more core like, than HT.

At the end of the day, the prices will be set based on workloads and how it copes with them. If a 4 module/8 core AMD chip (at say 2.5ghz), can deal with the same workload as a 4 core(8ht) Intel chip (at 2ghz), then that will be it's price window (speed values for arguments sake etc).
08-31-2010, 09:39 AM
-Boris-

Quote:

Originally Posted by Motiv

Why would turbo work when it's 2 module/2 core? Surely turbo would be more suited to running when all 4 modules are only utilising 1 core?

I thought the performance hit would be around 20%, if both cores are used within a module.

If you run one thread per module all modules work at the same time, and no modules rest, therefore no module can enter turbo. But if two the modules work with two threads, then two modules rest, if two modules rest the other to can enter turbo mode.
You can't have turbo and all modules working at the same time, the fact that parts of a module is idle doesn't matter since turbo works on a module level.

And it's said everywhere that a second thread run in a module "only" increases performance with 80%. That is a 10% performance loss compared to a traditional dual core approach.
08-31-2010, 09:45 AM
AliG

Quote:

Originally Posted by Motiv

At the end of the day, the prices will be set based on workloads and how it copes with them. If a 4 module/8 core AMD chip (at say 2.5ghz), can deal with the same workload as a 4 core(8ht) Intel chip (at 2ghz), then that will be it's price window (speed values for arguments sake etc).

I doubt that would happen though due to manufacturing costs. I have to believe 1 module is bigger than 1 Intel core. For consumers, the Intel core would make more sense, whereas for servers the module would make more sense as you are comparing 130% to 180% of the integer performance. Thus since server processors are always priced with much higher margins in mind, they could probably line up their processors that way, so even if intel's ipc is 10% faster, they would still win the performance battle.

However, I just can't see AMD being able to price their products as you described for the general consumer and still make a profit, especially when Intel is at 32nm whereas AMD is stuck at 45nm. Even if they could, if the Intel product offers anywhere from 5-20% more ipc, I would just by an unlocked k series processor and be happy with that. Having anything beyond 4 threads is pretty much useless for me, so single threaded performance is what will earn my money.
08-31-2010, 09:48 AM
informal

Quote:

Originally Posted by AliG

However, I just can't see AMD being able to price their products as you described for the general consumer and still make a profit, especially when Intel is at 32nm whereas AMD is stuck at 45nm.

Bulldozer is 32nm SOI highk/mg...
08-31-2010, 09:52 AM
AliG

Quote:

Originally Posted by informal

Bulldozer is 32nm SOI highk/mg...

is it? 45nm makes a lot more sense because it's a proven process. That seems like a bad idea considering how well their 65nm k10 transition went. Perhaps that's the root of all the delays
08-31-2010, 09:53 AM
-Boris-

Quote:

Originally Posted by AliG

is it? 45nm makes a lot more sense because it's a proven process. That seems like a bad idea considering how well their 65nm k10 transition went. Perhaps that's the root of all the delays

No, it was planned for 45nm, the delays made them change that to 32nm, giving them more time to develop the architecture.
08-31-2010, 09:55 AM
JF-AMD

Quote:

Originally Posted by -Boris-

If you run one thread per module all modules work at the same time, and no modules rest, therefore no module can enter turbo. But if two the modules work with two threads, then two modules rest, if two modules rest the other to can enter turbo mode.
You can't have turbo and all modules working at the same time, the fact that parts of a module is idle doesn't matter since turbo works on a module level.

And it's said everywhere that a second thread run in a module "only" increases performance with 80%. That is a 10% performance loss compared to a traditional dual core approach.

I would not make assumptions about how our processor works based on how our competitor has implemented technology.

As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.

I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.

Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.

People have been asking for someone to really bring some real innovation to the market, I think you will see that.
08-31-2010, 09:55 AM
AliG

Quote:

Originally Posted by -Boris-

No, it was planned for 45nm, the delays made them change that to 32nm, giving them more time to develop the architecture.

that explains it then
08-31-2010, 09:58 AM
-Boris-

Quote:

Originally Posted by JF-AMD

I would not make assumptions about how our processor works based on how our competitor has implemented technology.

As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.

I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.

Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.

People have been asking for someone to really bring some real innovation to the market, I think you will see that.

Ok, I've read that it was working on a module level. But I guess you are telling me that there is more to it than that? :)
08-31-2010, 10:01 AM
informal

It is working on a module level but that is all we know. Many things AMD didn't reveal,for obvious reasons.
08-31-2010, 10:33 AM
Solus Corvus

This is quite a fascinating architecture. If that RWT article is accurate then I am extremely interested in seeing some benchmarks.

I don't buy that overall per-core IPC must necessarily decrease (in relation to K10) because of reduced interger ALUs. Of course they will obviously miss out, compared to a 3 or 4 ALU core, on cases where int ILP is greater then 2. But in cases where the code is more mixed int and memory ops, IPC could go up in relation to K10 - based on available execution resources alone. Which case is more common obviously depends on the specific code being ran. Though I'd suggest that a program with consistently high integer ILP would be more efficient using packed integers (handled by the FPU) anyway.

If we add to that the fact that missed branches and cache misses (both significantly improved in BD) have a much greater effect on overall IPC than some missed ILP cases, it's clear that claiming lower IPC than K10 isn't really justified based on fewer ALUs alone. I doubt that BD will have lower IPC per-core than K10. In reality it's probably somewhere in the vast gulf between PII and SB.

As already noted though, IPC isn't the only factor in a processor's performance. This is obviously a high frequency design. The memory and cache subsystems are a big leap forward for AMD. They are designed to keep a large number of cores well fed - to minimize the amount of time that execution resources are waiting on data and thus increase efficiency. Intel will probably continue to lead in IPC by a significant margin. Whether AMD can increase frequency enough to make single threaded performance competitive remains to be seen. On the multi-threaded side BD sounds like a monster.

If AMD can't match Intel's single threaded performance it looks like we will have a split market come 2011. Office users and gamers might do best with SB while people doing encoding, folding, heavy multitasking, HPC, and servers might do best with BD.
08-31-2010, 10:40 AM
Motiv

Quote:

Originally Posted by Solus Corvus

snipped...

If AMD can't match Intel's single threaded performance it looks like we will have a split market come 2011. Office users and gamers might do best with SB while people doing encoding, folding, heavy multitasking, HPC, and servers might do best with BD.

While I agree with everything else you put (that RWT article is a must read for anyone who hasn't), I would say this last statement is wrong.

I suspect that margins will be significantly lower for gamers/office users (although will bobcat/llano fill the office space?). It could be a great result for overclockers, as we'll have access to decent multicore tech, that should have a bit of room to mess with.

So unless Intel go for a price war, all AMD has to do is price match on a performance level.

it's only people wanting absolute max, that care about who has the best CPU. The mainstream gamer just wants to spend £200 on a cpu and make sure that the cpu is competitive to other cpus round that price break.
08-31-2010, 10:43 AM
nn_step

Quote:

Originally Posted by Solus Corvus

This is quite a fascinating architecture. If that RWT article is accurate then I am extremely interested in seeing some benchmarks.

I don't buy that overall per-core IPC must necessarily decrease (in relation to K10) because of reduced interger ALUs. Of course they will obviously miss out, compared to a 3 or 4 ALU core, on cases where int ILP is greater then 2. But in cases where the code is more mixed int and memory ops, IPC could go up in relation to K10 - based on available execution resources alone. Which case is more common obviously depends on the specific code being ran. Though I'd suggest that a program with consistently high integer ILP would be more efficient using packed integers (handled by the FPU) anyway.

If we add to that the fact that missed branches and cache misses (both significantly improved in BD) have a much greater effect on overall IPC than some missed ILP cases, it's clear that claiming lower IPC than K10 isn't really justified based on fewer ALUs alone. I doubt that BD will have lower IPC per-core than K10. In reality it's probably somewhere in the vast gulf between PII and SB.

As already noted though, IPC isn't the only factor in a processor's performance. This is obviously a high frequency design. The memory and cache subsystems are a big leap forward for AMD. They are designed to keep a large number of cores well fed - to minimize the amount of time that execution resources are waiting on data and thus increase efficiency. Intel will probably continue to lead in IPC by a significant margin. Whether AMD can increase frequency enough to make single threaded performance competitive remains to be seen. On the multi-threaded side BD sounds like a monster.

If AMD can't match Intel's single threaded performance it looks like we will have a split market come 2011. Office users and gamers might do best with SB while people doing encoding, folding, heavy multitasking, HPC, and servers might do best with BD.

Why don't we look at the argument from another view point.

Show me the source code to 1 program which can sustain under optimal conditions an IPC greater than 1.8, for which multi-threading isn't a better solution.

For those of you smart enough to actually wonder what makes IPC greater than 1 possible [In source code]; let me save you a long winding trip and give you the answer; such a beast DOES NOT EXIST.
08-31-2010, 10:55 AM
Solus Corvus

Let me think about that.
08-31-2010, 10:55 AM
Opteron146

Quote:

Originally Posted by -Boris-

BD has more resources since it can use 2 ALUs and 2 AGUs every clock, Phenom II averages at 1.5 ALUs and 1.5 AGUs since the share pipe. Again, if you can't use it, it isn't a resource. 2+2=4 (3+3)/2=3..

Hans wrote for the K8:

Quote:

Each Scheduler can launch one ALU and one AGU operation per cycle. The ALU operation may come from one x86 instruction while the AGU operation may come from another.

http://chip-architect.com/news/2003_...it_Core.html#3
That is no 1.5, that is 3 ... maybe u missed the fact, that the MacroOps are splitted into µOps at that stage ?
08-31-2010, 10:57 AM
AliG

correct there are 3 full integer operations in k8 and on, that can do either ALU or AGU, but as I understand it is more efficient due to improved prefetchers and smaller die sizes to use a 2+2 simplified design
08-31-2010, 11:01 AM
informal

Quote:

Originally Posted by Opteron146

Hans wrote for the K8:
http://chip-architect.com/news/2003_...it_Core.html#3
That is no 1.5, that is 3 ... maybe u missed the fact, that the MacroOps are splitted into µOps at that stage ?

Yes ,but at the back end the Macro ops are retired and K8/10h can do 3 of those while each Bulldozer integer core can do 4. That is 33% difference.
08-31-2010, 12:41 PM
freeloader

Quote:

Originally Posted by JF-AMD

I would not make assumptions about how our processor works based on how our competitor has implemented technology.

As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.

I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.

Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.

People have been asking for someone to really bring some real innovation to the market, I think you will see that.

I'd love to see that, IN MY LIFETIME!.....Just joking....anyhow, I'm not asking anymore questions about BD. I'm just going to wait for a product release.
08-31-2010, 01:10 PM
god_43

Quote:

Originally Posted by JF-AMD

I would not make assumptions about how our processor works based on how our competitor has implemented technology.

As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.

I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.

Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.

People have been asking for someone to really bring some real innovation to the market, I think you will see that.

this right here is the human element! it separates you from the bots JF! you show us that you care, that you want to tell us; but are unable too.

we (most anyways) understand, and appreciate what you have told us so far.
08-31-2010, 01:29 PM
MTd2

JF, so each BD is faster clock per clock than the Phenom cores? Or is it by just comparing the top clocked frequency processors of each product line?
08-31-2010, 01:30 PM
Opteron146

Quote:

Originally Posted by AliG

correct there are 3 full integer operations in k8 and on, that can do either ALU or AGU,

No, it is not "either" it is both ... what do you not understand in the quote of Hans' article ?

Quote:

but as I understand it is more efficient due to improved prefetchers and smaller die sizes to use a 2+2 simplified design

That is correct, the current IPCs of usual code is around 1, I think Nehalem achievs 1.5-1.7 in best cases, thus: 2 pipes are enough :)

Quote:

Originally Posted by informal

Yes ,but at the back end the Macro ops are retired and K8/10h can do 3 of those while each Bulldozer integer core can do 4. That is 33% difference.

Yes you are right, but I never said anything against that point ;-)
Maybe one note on that, because I red it earlier: The AGU results are not retired, they go immediately into the LD/STR units, so the waiting µOp can get its mem-data ;-) Later, after the calculation of the µOp is finished, that µOp is retired.
So in short the retire / ExU ratio is 1:2 for both, not 1:3. For K10 it's (3:6) and for BD it's (4:8).
08-31-2010, 01:32 PM
Opteron146

Quote:

Originally Posted by MTd2

JF, so each BD is faster clock per clock than the Phenom cores?

Ok, I assume JF will be too frustrated to answer this, so I will do him a favour and answer:
Yes, of course ... :up:

If you want to hear it from JF directly, read "informal's" signature ;-) ;-)
08-31-2010, 01:46 PM
Olivon

AMD to Test Upcoming Bobcat Processors in Servers

Quote:

"We're definitely in the process of examining this as a design point," said Donald Newell, AMD's new server chief technology officer, in an interview. "It would be foolish not to."

Quote:

"There's only a few papers ... and there's a lot more data to collect," Newell said. "It really depends on a number of factors ... to whether or not that's a good design point."

Quote:

"It's hard for Arm to move up in the server world, like x86 would be to move down to dishwashers," Newell said.

AMD also is looking to mold graphics processors and separate accelerator units into its server offerings. Right now GPUs and accelerators are designed for specialist computing needs, but the company wants to build chips where all the architectural elements flawlessly work together, Newell said.
08-31-2010, 01:49 PM
blindbox

Quote:

Originally Posted by Olivon

AMD to Test Upcoming Bobcat Processors in Servers

I think someone took arsetechnica's advice and tested it anyway :clap::clap:
08-31-2010, 01:53 PM
informal

"We're definitely in the process of examining this as a design point," is not equal to actual testing of Bobcat in server environment. Today's journalists really like to twist the words and jump to (wrong) conclusions .To get to actual testing they have to see if it makes sense in the first place.

Quote:

Originally Posted by Opteron146

Yes you are right, but I never said anything against that point ;-)
Maybe one note on that, because I red it earlier: The AGU results are not retired, they go immediately into the LD/STR units, so the waiting µOp can get its mem-data ;-) Later, after the calculation of the µOp is finished, that µOp is retired.
So in short the retire / ExU ratio is 1:2 for both, not 1:3. For K10 it's (3:6) and for BD it's (4:8).

That's correct :).BTW I'm sure AMD at least investigated the other ALU/AGU possibilities and they came out with the most efficient one.Wasted resources&power/diminishing returns is not what they would want from a design like Bulldozer,especially with the clock targets they have in mind :).
08-31-2010, 01:56 PM
JF-AMD

I have debunked this in several places. We are NOT "testing" bobcat in servers.

We are looking at the market to determine whether there is a place for it. It would be irresponsible to not consider every piece of silicon and IP that we have access to. But, as Bobcat is defined today, it does not meet the needs of the server market. Just as Atom and ARM are coming up short as well. When you can get six cores @ 35W TDP in an Opteron 4000, why would you want to build more servers and have more physical hardware? The folks looking ar really low power environements are looking at embedded or they are looking to reduce management and power costs. 12 cores @ 35W/CPU in a single server makes a lot more sense than 6 low power (and low performance) dual core 1P servers. When you talk to the big cloud guys, core density is critical because that means fewer systems to manage.
08-31-2010, 02:04 PM
qcmadness

Quote:

Originally Posted by JF-AMD

I have debunked this in several places. We are NOT "testing" bobcat in servers.

We are looking at the market to determine whether there is a place for it. It would be irresponsible to not consider every piece of silicon and IP that we have access to. But, as Bobcat is defined today, it does not meet the needs of the server market. Just as Atom and ARM are coming up short as well. When you can get six cores @ 35W TDP in an Opteron 4000, why would you want to build more servers and have more physical hardware? The folks looking ar really low power environements are looking at embedded or they are looking to reduce management and power costs. 12 cores @ 35W/CPU in a single server makes a lot more sense than 6 low power (and low performance) dual core 1P servers. When you talk to the big cloud guys, core density is critical because that means fewer systems to manage.

I would expect Bobcat / Ontario will only prevail in HTPC / Set-top box markets.
08-31-2010, 03:16 PM
Opteron146

Anybody who wants to speculate about clock rates ?

Just rememered IBMs 4.25 GHz p7 8core chip with 4xSMT. That is with 45nm :eek:

So far I thought 5 GHz for BD is fanboy dreaming, but compared to that monsterous 45nm chip it should be rather reasonable now that a smaller BD die produced in 32nm together with high-k interconnects should be able to achieve that.

What do you think ? Is it ok, to speculate on x86 clocks by comparing it to Power / RISC numbers ?

@informal:
I agree totally ;-)

Thanks
08-31-2010, 03:32 PM
AliG

Quote:

Originally Posted by Opteron146

Anybody who wants to speculate about clock rates ?

Just rememered IBMs 4.25 GHz p7 8core chip with 4xSMT. That is with 45nm :eek:

So far I thought 5 GHz for BD is fanboy dreaming, but compared to that monsterous 45nm chip it should be rather reasonable now that a smaller BD die produced in 32nm together with high-k interconnects should be able to achieve that.

What do you think ? Is it ok, to speculate on x86 clocks by comparing it to Power / RISC numbers ?

@informal:
I agree totally ;-)

Thanks

I have a hard time believing 5ghz stock as that's just never been done before that I can recall. However, intel's Sandy Bridge lineup covers 2.5-3.4ghz, and assuming that they will have an ipc advantage, AMD may end up covering 3-4ghz (numbers per-turbo on both sides).

Overclocking BD should be fun if it is truly a high frequency design. Even though Netburst cpus are just about worthless in terms of performance, they are still some of the most fun to mess with. AMD could perhaps combine the best of both worlds, and give it more IPC than k10.5 while still making it clock like p4s (that would be a major win amongst enthusiasts now that Intel is locking fsb).
08-31-2010, 07:34 PM
informal

Quote:

Originally Posted by Opteron146

Anybody who wants to speculate about clock rates ?

Just rememered IBMs 4.25 GHz p7 8core chip with 4xSMT. That is with 45nm :eek:

So far I thought 5 GHz for BD is fanboy dreaming, but compared to that monsterous 45nm chip it should be rather reasonable now that a smaller BD die produced in 32nm together with high-k interconnects should be able to achieve that.

What do you think ? Is it ok, to speculate on x86 clocks by comparing it to Power / RISC numbers ?

@informal:
I agree totally ;-)

Thanks

I will give it a try :D
@95W envelope we have 6 cores done on 45nm working @ 2.8Ghz. If BD was done on the same node I guess ,with the targeted 20% in clock speed due to pipeline changes, we could have 2.8x1.2=3.36 or round up to 3.4Ghz.BUT,it will go to 32nm highK/mg instead.I would still pick the same clock and power draw values just to be conservative(let's disregard the 45->32nm node improvement since we have 33% more cores).That's a 4 module part. Now,if count in 10-15% IPC improvement(pick average 12.5) and 33% more cores and at last divide by 1.1(10%) for the "performance hit" in fully loaded modules,in multithreaded workloads we get an equivalent performance of 4.65Ghz X6 Thuban .This is with no Turbo over stock.

Now,with the new Turbo(<=1/2 of the cores are idle,picking Thuban's Turbo conditions),I would expect ~20-30% clock increase,take a 25% as middle .We get => 3.4x1.25=4.25Ghz in poorly threaded or single threaded applications.Now add the speculated 10-15% IPC jump(pick 12.5 as arithm. mean value) to get the equivalent Thuban class core clock=> 4.25x1.125~=4.8Ghz Thuban in single threaded workloads(no 10% hit here).If the power gating happens in a way so that 2 modules are gated,we have the 10% hit due to core scaling in modules => 4.8/1.1=4.36Ghz Thuban class core speed in poorly threaded workloads(1<no. of threads active<=4).

So to sum it up,I expect a 95W 3.4Ghz "X8" Bulldozer model,with 4.25Ghz effective turbo and 10-15%(pick 12.5%) IPC jump. This would be equal to a:
-4.8Ghz Thuban in purely single threaded workloads and
-4.36Ghz Thuban class core in poorly threaded workloads.
-4.65Ghz X6 Thuban in multi thread workloads.

In the 125W range I would expect 3.6 and 3.8Ghz models,and if they really want to push the limit,a 4Ghz 125W model. Turbo would be smaller,percentage wise and similar or slightly lower frequency wise than in the earlier example. So effectively just add 0.2, 0.4 and 0.6Ghz on top of the 3 numbers for "equivalent Thuban" above and you will have projection how these 3 125 or 140W ones could perform(top model ,the hypothetical 125/140W 4Ghz one could easily be equivalent to 4.7-5.4Ghz Thuban class core,depending on the workload).
Enough of xtreme speculation from me :)
08-31-2010, 07:46 PM
MTd2

Assuming that the 50% more performance with 33% more cores referes to IPC, we have that BD´s, we have a 12,5% increase in IPC in relation to K10h. Estimate that the area of a module in 32nm is the same of a core of the previous generation, and that the power envelope (just of the core now, not the whole chip) is the same for the same area of the previous generation. If we have a 20% higher frequency for the same power envelope, we've got a 35% increase for the same thermal envelope.

Each module have 30mm^2, so, the total will be 120mm^2, for a 4 module. Plus some 8mb of L3 cache, like, 60mm^2, we have 180mm^2. A previous generation had a heat of 125W at 3.4GHz, so this one will be 4,1GHz at 95W, turbo at 5GHz.

Let's see performance-wise. For 4modules/8 cores, we have that the performance of a bulldozer will be 70% higher, while consuming 30% less than a PhII 3.4GHz.

The IPC of a SB is 50% higher than PhII, but won't clock as high as a BD. At 95W, a 4core will be 3.3GHz. So, we have that at this power envelope, BD will be 20% faster than SB, with about the same die area of a SB, or slightly smaller.

Of course, Intel will release a 8 core SB, but its die area should be around 320mm^2, and no way that at 3.3GHz the power consumption will be lower than 150W. For servers, Intel must counter at least with a 10 core, absolute minimum.

So, you see, BD will be a competitor for Ivy Bridge, not Sandy Bridge.
08-31-2010, 08:37 PM
tifosi

Quote:

Originally Posted by AliG

... I honestly don't think it will be enough to compete with Sandy Bridge.

At least from my perspective, it seems to me that AMD is done challenging for the top enthusiast performance spot. They seem to have shifted onto a new direction, trying to offer the most performance per dollar, especially over the long run when you consider electricity bills. That's quite reasonable, as Intel has far more money spent on their fabrication process, and thus have denser, faster caches which seriously helps out on applications like Super Pi.

As you rightly said... Intel chippery has faster and denser caches which help in most desktop environment situations... AMD will be good, but beat Intel... Not unless some multi-threading is thrown into picture...

Then again, this is off topic but then you got to look beyond architecture to see, whether binaries involved in creating software are any of a bother and how much... As far as i'm aware, latest intel binaries do not allow AVX to work on any other chip than "Genuine Intel." This would shut up fanboys from both sides :P

Quote:

Originally Posted by MTd2

... The IPC of a SB is 50% higher than PhII, but won't clock as high as a BD. At 95W, a 4core will be 3.3GHz. So, we have that at this power envelope, BD will be 20% faster than SB, with about the same die area of a SB, or slightly smaller.

Of course, Intel will release a 8 core SB, but its die area should be around 320mm^2, and no way that at 3.3GHz the power consumption will be lower than 150W. For servers, Intel must counter at least with a 10 core, absolute minimum.

So, you see, BD will be a competitor for Ivy Bridge, not Sandy Bridge.

Yes, In server arena, which is the most lucrative for both Intel and AMD, BD will help AMD gain a competitive edge... People aren't yet using MC as much, as big OEM partners are yet to come out with servers featuring MC to its best... However, will they be able to resist BD? Actually Eagleton could not come any sooner for Intel... but as much as i learned, it would be based on Sandy-bridge and not Ivy-league (please correct me if i'm wrong on this bit). What i'm saying is, BD is going to be a major win for AMD in server arena, which is where they'll make most of the money.

AMD is keeping it quiet as the product is 2 or more quarters away and it gives Intel enough time to out-maneuver them in the market based on factors like price and all, limiting options for AMD... Hence my theory of "Intel Employee" when people want to know more than AMD has already offered :P
08-31-2010, 09:07 PM
JumpingJack

Quote:

Originally Posted by MTd2

Assuming that the 50% more performance with 33% more cores referes to IPC, we have that BD´s, we have a 12,5% increase in IPC in relation to K10h. Estimate that the area of a module in 32nm is the same of a core of the previous generation, and that the power envelope (just of the core now, not the whole chip) is the same for the same area of the previous generation. If we have a 20% higher frequency for the same power envelope, we've got a 35% increase for the same thermal envelope.

Each module have 30mm^2, so, the total will be 120mm^2, for a 4 module. Plus some 8mb of L3 cache, like, 60mm^2, we have 180mm^2. A previous generation had a heat of 125W at 3.4GHz, so this one will be 4,1GHz at 95W, turbo at 5GHz.

Let's see performance-wise. For 4modules/8 cores, we have that the performance of a bulldozer will be 70% higher, while consuming 30% less than a PhII 3.4GHz.

The IPC of a SB is 50% higher than PhII, but won't clock as high as a BD. At 95W, a 4core will be 3.3GHz. So, we have that at this power envelope, BD will be 20% faster than SB, with about the same die area of a SB, or slightly smaller.

Of course, Intel will release a 8 core SB, but its die area should be around 320mm^2, and no way that at 3.3GHz the power consumption will be lower than 150W. For servers, Intel must counter at least with a 10 core, absolute minimum.

So, you see, BD will be a competitor for Ivy Bridge, not Sandy Bridge.

That was quite a bit of work :)
08-31-2010, 09:44 PM
savantu

Quote:

Originally Posted by MTd2

Assuming that the 50% more performance with 33% more cores referes to IPC, we have that BD´s, we have a 12,5% increase in IPC in relation to K10h. ...

Or not. Add in the calculation a 20% frequency increase and see what gain there is.
BD should be 20% higher frequency from the process change alone, irrespective of uarch changes to facilitate higher clocks.
08-31-2010, 10:21 PM
blindbox

Quote:

Originally Posted by savantu

Or not. Add in the calculation a 20% frequency increase and see what gain there is.
BD should be 20% higher frequency from the process change alone, irrespective of uarch changes to facilitate higher clocks.

20% higher clocks, 33% more cores is a little bit too much for the TDP, don't you think? Unless you insist IPC isn't better than K10 despite how many times JF-AMD says it. :shakes:

Well anyway MTd2, they're server workloads.
08-31-2010, 10:34 PM
kl0012

1 Attachment(s)

Quote:

Originally Posted by MTd2

Assuming that the 50% more performance with 33% more cores referes to IPC, we have that BD´s, we have a 12,5% increase in IPC in relation to K10h. Estimate that the area of a module in 32nm is the same of a core of the previous generation, and that the power envelope (just of the core now, not the whole chip) is the same for the same area of the previous generation. If we have a 20% higher frequency for the same power envelope, we've got a 35% increase for the same thermal envelope.

Each module have 30mm^2, so, the total will be 120mm^2, for a 4 module. Plus some 8mb of L3 cache, like, 60mm^2, we have 180mm^2. A previous generation had a heat of 125W at 3.4GHz, so this one will be 4,1GHz at 95W, turbo at 5GHz.
Let's see performance-wise. For 4modules/8 cores, we have that the performance of a bulldozer will be 70% higher, while consuming 30% less than a PhII 3.4GHz.
The IPC of a SB is 50% higher than PhII, but won't clock as high as a BD. At 95W, a 4core will be 3.3GHz. So, we have that at this power envelope, BD will be 20% faster than SB, with about the same die area of a SB, or slightly smaller.

Of course, Intel will release a 8 core SB, but its die area should be around 320mm^2, and no way that at 3.3GHz the power consumption will be lower than 150W. For servers, Intel must counter at least with a 10 core, absolute minimum.

So, you see, BD will be a competitor for Ivy Bridge, not Sandy Bridge.

Right.
08-31-2010, 11:17 PM
-Boris-

Seems like we all know what we need to know then. :D
09-01-2010, 12:01 AM
Florinmocanu

Quote:

Originally Posted by blindbox

20% higher clocks, 33% more cores is a little bit too much for the TDP, don't you think? Unless you insist IPC isn't better than K10 despite how many times JF-AMD says it. :shakes:

Well anyway MTd2, they're server workloads.

AMD did a 50% core increase at the same frequency on the 45nm node without increasing tdp, actually lowering it. So don't hold you breath yet.
09-01-2010, 12:48 AM
savantu

Quote:

Originally Posted by blindbox

20% higher clocks, 33% more cores is a little bit too much for the TDP, don't you think? Unless you insist IPC isn't better than K10 despite how many times JF-AMD says it. :shakes:

Well anyway MTd2, they're server workloads.

In this very thread people discuss frequencies of 3.3-4 GHz for BD which is significantly higher than MC ( max 2.3GHz ).

The 50% more performance, 33% more cores applies versus Magny Cours. You also need to factor in frequency since this was the unknown part in the AMD slide.
09-01-2010, 01:13 AM
Mechanical Man

Quote:

Originally Posted by savantu

In this very thread people discuss frequencies of 3.3-4 GHz for BD which is significantly higher than MC ( max 2.3GHz ).

The 50% more performance, 33% more cores applies versus Magny Cours. You also need to factor in frequency since this was the unknown part in the AMD slide.

MC is 12 core variant. Peple were discussing >3,4GHz for 4 module -> 8 core variant bulldozer not 16 core that would be 33% more cores compared to MC.
09-01-2010, 01:29 AM
savantu

Quote:

Originally Posted by Mechanical Man

MC is 12 core variant. Peple were discussing >3,4GHz for 4 module -> 8 core variant bulldozer not 16 core that would be 33% more cores compared to MC.

And what frequency do you expect the 16 core variant to reach at launch ?
09-01-2010, 01:59 AM
madcho

Quote:

Originally Posted by savantu

And what frequency do you expect the 16 core variant to reach at launch ?

I think top frequency in G34 will be higher than 2.3 and i would say higher of 20% than 2.2ghz 95W ACP, so around 2.6ghz.

But no desktop 8 modules in fabs i think.
09-01-2010, 02:17 AM
Sn0wm@n

8 cores version will most likely be @ 3.0 & 3.2 maybe .... 6 cores 3.4 and higher ... and on and on
09-01-2010, 02:28 AM
Mechanical Man

Quote:

Originally Posted by savantu

And what frequency do you expect the 16 core variant to reach at launch ?

<3GHz, Maybe 2,6GHz. Also, i expect IPC gain to be atleast 10% in integer code, more in float when running "normal" application, one that does not have only floating point calcs. That kind of code would be better to be ran on gpu's anyway.
09-01-2010, 02:45 AM
MTd2

Quote:

Originally Posted by Sn0wm@n

8 cores version will most likely be @ 3.0 & 3.2 maybe .... 6 cores 3.4 and higher ... and on and on

I considered that the clock advantage would be related to the core area, which is really small in BD, such that 2 cores have about the same die space as a PhII core. So, the clock advantage is even higher from the point of view of a number of cores.

So, if the clocks are so small as you say, BD is more or less tied or a bit lower than SB in perf per watt.re

16 core, matching MC, would have 2,6 GHz, from the point of view of my reasoning.
09-01-2010, 02:58 AM
Sn0wm@n

on the mcm part they are still bound by tdp ... now have 2 of those glued together and you need to lower your clocks considerably ... so you still stay in the desired tdp so i dont think you will see clocks higher then 2.3 for the 16 cores version ...

maybe 2.5 for the 12 cores mcm part on 32nm ... and higher the less cores they have

Show 100 post(s) from this thread on one page

All times are GMT -8. The time now is 04:39 AM.

XtremeSystems