AMD: 32nm issues fixed

**Sn0wm@n** · 11-25-2010, 06:46 AM

what happened to saaya lately ????

**Dresdenboy** · 11-25-2010, 07:43 AM

Originally Posted by Dresdenboy

But already several people including me analyzed the photoshopped die photo and found it to be in the ballpark of 300 sqmm (+/-20) in size. My analysis based on L2 cell size, I/O cell size (as Hans de Vries did) and L1 I$ size relations also resulted in a module area of ~17-19 sqmm and a L2 area of a little more than 10 sqmm. So this estimation lands at roughly 28 sqmm for a module with cache. Take this with a grain of salt.

http://citavia.blog.de/2010/11/08/la...t-day-9933368/
Ok, the salt was worth 3 sqmm

**JF-AMD** · 11-25-2010, 10:33 AM

Well, we do know it is smaller than a Lisbon die.

**nn_step** · 11-25-2010, 10:40 PM

This may be completely off base but it sounds to me that bulldozer is virtually just two bobcats fused together but sharing Micro-code and the floating point unit.

**informal** · 11-26-2010, 12:10 AM

The problem with above statement is that Bobcat is actually slower per core than 10h while BD is faster. There are some similarities but overall they are not comparable(die size difference per core is one big indicator).

**JF-AMD** · 11-26-2010, 04:37 AM

Originally Posted by nn_step

This may be completely off base but it sounds to me that bulldozer is virtually just two bobcats fused together but sharing Micro-code and the floating point unit.

No, they are completely different cores designed by completely different teams.

**Opteron146** · 11-26-2010, 04:52 AM

Originally Posted by saaya

hmmm well 1333 to 1600 is a 25% boost, so only a 20% bandwidth boost sounds weird, and a 30% "memory performance" boost doesnt sound very impressive... to me it looks like an updated version, not even tweaked a lot just some stuff added and then clocked higher... which isnt impressive seeing as current imcs can already do 2000,,, but ok, we will see...

IF you think that +30% is not significant, then all intel processors since Core2 are "unimpressive" ;-)

of course not... thats why you dont do a platform for ocers but recycle one like intel did...

I meant the same think, AMD could recycle their triple channel C42 (or whatever the name is) platform, too. But even that costs money, not sure how many board manufacturers would build boards for it.
Well - it will depend heavily on the chip .. if it is fast enough, AMD would snatch lots of that small enthusiast market segment. But it will be hard against Intel's 22nm CPUs.

of course not, but isnt there something in between? at the edge is good, but right at the edge... im just saying after reading about all that whisker stuff and thermal stress on solder balls with memory chips (worked for a module house) it looks like a possible problem to me... especially thinking of nvidias bumpgate disaster in the past few years...

Why should nVidia's problem be of AMD's concern ? If it would have been ATi - then you could make at least a small connection .. but nVidia ?

Originally Posted by nn_step

This may be completely off base but it sounds to me that bulldozer is virtually just two bobcats fused together

With the same statement you could argue that a Corvette car with an 8 cylinder is just two Toyota Prius with 4 cylinders fused together .. does it make sense ? ;-)

**Manicdan** · 11-26-2010, 07:29 AM

Originally Posted by Opteron146

With the same statement you could argue that a Corvette car with an 8 cylinder is just two Toyota Prius with 4 cylinders fused together .. does it make sense ? ;-)

comparing a corvette to a prius? im pretty sure ive killed for less

also the improvement of the memory bandwidth with duel channel also reduces the server cost for people who need high speed, now do it with 33% less ram sticks. sounds like a good deal to me.

**nn_step** · 11-26-2010, 03:27 PM

Originally Posted by Opteron146

With the same statement you could argue that a Corvette car with an 8 cylinder is just two Toyota Prius with 4 cylinders fused together .. does it make sense ? ;-)

That is the thing, AMD is currently on a multi-core/multi-threaded direction and the logical step is not to pursue single thread performance. Rather to aim for the single thread sweet spot and hit Intel with simply a much more efficient design. Now there may be prefetch and prediction enhancements but besides that; I really don't logically see anything different beyond the sharing in the front end and in the FPU. I admit I'm probably wrong but it is certainly something to ponder.

**generics_user** · 11-26-2010, 07:41 PM

Originally Posted by nn_step

That is the thing, AMD is currently on a multi-core/multi-threaded direction and the logical step is not to pursue single thread performance. Rather to aim for the single thread sweet spot and hit Intel with simply a much more efficient design. Now there may be prefetch and prediction enhancements but besides that; I really don't logically see anything different beyond the sharing in the front end and in the FPU. I admit I'm probably wrong but it is certainly something to ponder.

lol every single piece in the BD architecture is at least 2 times stronger than on bobcat, front end of a bd module is 4 times wider than a bobcat core; integer cores are 2 times more powerful (make that 4 times for each core in a BD module), FPU is more than doubled with more functions; memory controller is at least 3 times faster than on bobcat (dual 1866 vs single 1066 with additional improcement), L3 cache is present; more L2 cache, better process and platform

a bd module is significantly more than just 2 bobcat cores slapped together...

**nn_step** · 11-26-2010, 08:36 PM

Originally Posted by generics_user

lol every single piece in the BD architecture is at least 2 times stronger than on bobcat, front end of a bd module is 4 times wider than a bobcat core; integer cores are 2 times more powerful (make that 4 times for each core in a BD module), FPU is more than doubled with more functions; memory controller is at least 3 times faster than on bobcat (dual 1866 vs single 1066 with additional improcement), L3 cache is present; more L2 cache, better process and platform

a bd module is significantly more than just 2 bobcat cores slapped together...

ok let us look at what has already been published.

Note that the integer units are very similar but bulldozer includes more flexible load/store logic.
Also note that the front end of bulldozer is exactly double that of bobcat.

Things such as Cache size and implementation are any argument against my idea; nor is the memory performance or implementation that radical of a change. [AMD has done it dozens of times with K8]

Also, I must mention that I am not suggesting that they just "slapped" two bobcats together. Rather that the integer units are rather bobcat derived and focus on performance per transistor rather than maximum performance.

And if I am correct, then the Bulldozer cores should not be more than 40% larger than K8. That is my prediction and my explanation for it; the validity can only be proven or disproved by AMD.

**saaya** · 11-26-2010, 10:44 PM

Originally Posted by Dresdenboy

Or think it this way: if 1MB more L2 (~6mm^2) buys you 10% higher average (!) IPC for a fraction of the power you need to use in the core (and a few mm^2) to achieve the same - which of those options would you choose?

See this review comparing - besides other CPUs - a 2.8GHz 2C PhII (downclocked) with 512kB L2 per core and 6MB L3 (for 2 cores!) and a 2.8GHz 2C AthII with 1MB L2 per core and no L3:
http://www.techpowerup.com/reviews/A..._X2_240/4.html
In several benchmarks the AthII is faster.

hmmm but does 1mb really boost ipc that much?
the highest gains from bigger caches in the past ive seen were around 5%...
funny btw, instead of adding cache to a design as a "midlife kicker" its now adding cores+cache

Originally Posted by Dimitriman

Saaya the server BD will accomodate standard ddr3 1600 while desktop will actually suppor ddr3 1866 which in dual channel will provide over 29 gb/s bandwidth. Thats more than the tripple channel socket 1366 b/w of 25 gb/s while using its dd3 1066 officially suppported.

source: http://www.techpowerup.com/134739/AM...-Revealed.html

hah, yeah but not many people run their mem below 2000, and if they do they usually run it at tight timings boosting bw :P

Originally Posted by informal

Saaya,each BD Module has two full cores,each having dedicated integer pipelines and can have either 1(per core) 128bit FMAC or 256bit FMAC,depending on the workload(single or multithreaded).So you see,the cores are not half as*ed and crippled in any way,you have everything that traditional core has plus more if you run single threaded workloads(2x FP throughput with ability to execute 2 FADDs or 2 FMULs in parallel,which is not possible today per single core,in any x86 core).Shared frontend is actually a good thing,being able to make the best use of fetch bandwidth available in either of the workloads scenarios (I bet this is the key component of the effectiveness of this design,apart from the schedulers of course).

hmmm so they CAN share their fpu but dont always? but then how does this affect the definition of a core? its still two cores then, just that they CAN in some scenarios work together... right?

Originally Posted by informal

As for memory,BD desktop officially supports 1866 standard,so one can assume it will unofficially be able to run with RAM modules clocked much higher than that.Add on top of that a 30% IMC throughput improvement,without adding a 3rd channel,and you get a massively stronger memory throughput,which is needed with 8 stronger than Thuban cores

.

well the presentation somebody posted here only mentions a 30% boost while mentioning a 20% boost in supported memory clocks... call me cynical but in my experience with presentations like this it doesnt mean 20% clockspeed resulting bw boost PLUS 30% arch resulting boost... it means 20% clockspeed resulting bw boost plus 10% arch resulting boost... BEST CASE SCENARIO

Originally Posted by flyck

1600-1333 = 267

267/1333 = 0.20030007

so to me this is 20% ?

my bad

Originally Posted by JF-AMD

You sure spend a lot of time complaining about AMD. If you are so sure that our products are a flop, why bother?

cause i actually like amd, and i think you guys can do a lot better than what youve been doing in the past years...

Originally Posted by JF-AMD

As for the memory controller, it will have no problem feeding the cores. AMD has a long history of high performing memory controllers, you should look at the legacy of products that we have delivered to market. Then look at the fact that we are giving a 50% increase in throughput. The combination of those two alone are 2 great pieces of evidence that the memory controller will be just fine. You have no basis for your statement other than the desire to see AMD fail.

amds ddr2 and ddr3 imcs are

e... no offense, but seriously... the ddr2 imc offered no advantages over ddr1 whatsoever, and the ddr3 imc had clockspeed issues from day1 and is comparable to intels P35/X38 ddr3 memory controller which is 2, soon 3 generations behind.

the only memory controller amd ever had that rocked was the a64 imc, and, correct me if im wrong, was mostly the brainchild of a single brilliant engineer, which left the company, so turning that into a "history of achievements of amd" is a pretty far stretch dont you think?

Originally Posted by JF-AMD

As to the caches, weren't all of the intel fanboys raving about intel's cache siszes in the past and saying that AMD caches were too small? You can't have it both ways.

you mentioned that before... ive never seen people complain about "small" caches on amd chips... not here on xs at least, people here tend to compare performance, not specs...
and when people DID complain about cache sizes, you should know that they dont complain about cache sizes, what they really want to say is "i want a faster cpu"

Originally Posted by JF-AMD

The problem with HT is that while it doubles the number of threads that you can handle, it does not double the number of integer execution pipelines, or, more importantly the number of schedulers. If you have only one scheduler and only one pipeline, calling to two cores would really be misleading.

thats what i said...

Originally Posted by JF-AMD

Each bulldozer module has 2 integer schedulers and 2 sets of integer pipelines, which is why it is defined as 2 cores. This FUD is really getting tiring.

so then why did somebody mention that with BD the definition of cores changes? thats what started the whole thing, it seems it was a misunderstanding? it made me think 2 bulldozer cores can actually only work in tandem and not chew through data on their own, which would make it ONE core and not two. several people then hooked in and disagreed that such a core would still be two cores for some reason.
i guess it was misunderstandings all over the place

Originally Posted by JF-AMD

Exactly. Plus, we have 1 FPU per core. We can combine them to get to 256-bit AVX. Intel combines a 128-bit FPU and the integer pipeline in order to get to 256-bit AVX. So, techincally, does that mean that Sandybridge is only a half core becasue its integer pipleine is shared with the FPU?

as long as each part CAN work fully independantly, no... as soon as two blocks are dependant on each other its ONE block... at least by my definition

thx for clarifying this

Originally Posted by JF-AMD

1. Everyone but you is seems to be OK with this.
2. Clock speed percentage rarely equals actual throughput percentage. I can't believe that you don't know that.
3. Every single Bulldozer core has an FPU. 16 cores, 16 FPUs. Spreading lies like you are is not helping your credibility.

1. good for them... whats it supposed to be to me?
2. i cant believe you think i dont know that, i think we both know that you know that i know it ^^
3. not my intention... the one fpu per module was a misunderstanding, my bad...

Originally Posted by Sn0wm@n

what happened to saaya lately ????

thx for contributing to the discussion

Originally Posted by JF-AMD

No, they are completely different cores designed by completely different teams.

but they do share the same building blocks right? some of them at least... makes sense...

Originally Posted by Opteron146

IF you think that +30% is not significant, then all intel processors since Core2 are "unimpressive" ;-)

i thought we are talking about memory bandwidth, not ipc or overall performance...
if bd has a 30% ipc or overall performance boost (per core) over their current chips ill be deeply impressed :0

Originally Posted by Opteron146

I meant the same think, AMD could recycle their triple channel C42 (or whatever the name is) platform, too. But even that costs money, not sure how many board manufacturers would build boards for it.
Well - it will depend heavily on the chip .. if it is fast enough, AMD would snatch lots of that small enthusiast market segment. But it will be hard against Intel's 22nm CPUs.

amd has a tripple channel platform?

wasnt it quad channel? its MCM so its two dualchannel chips on one package. and they ARE recycling that for servers and might use it for enthusiasts, though i doubt it... enthusiasts wont see benefits from the extra bandwidth i think... for enthusiasts latency is more important than bandwidth cause only few cores are actually used.

Originally Posted by Opteron146

Why should nVidia's problem be of AMD's concern ? If it would have been ATi - then you could make at least a small connection .. but nVidia ?

sigh... cause they all use solder balls to connect silicon to organic packages? :P

**FlanK3r** · 11-27-2010, 03:07 AM

sayya: no, IMC on Thuban is not

...2200 MHz is possible stable at Thubans. And Bulldozer IMC is from diferent world, easy and better for you "just relax and waiting for launch "

**Tomasis** · 11-27-2010, 04:09 AM

Originally Posted by saaya

enthusiasts wont see benefits from the extra bandwidth i think... for enthusiasts latency is more important than bandwidth cause only few cores are actually used.

exactly, thats why amd imc is so good.. due low latency.. they like very tight mem timings. I dont put weght much on bandwidth, it is like compare SSD and harddrives without attention for load times.

Moderately clocked Athlon 2 has lower mem latency than highly Oced I7 9xx

cpu power is another thing.

**informal** · 11-27-2010, 04:26 AM

Originally Posted by saaya

hmmm but does 1mb really boost ipc that much?
the highest gains from bigger caches in the past ive seen were around 5%...
funny btw, instead of adding cache to a design as a "midlife kicker" its now adding cores+cache

BD vs Bobcat cache differences:
BD has 2MB of inclusive cache,running at core level clocks. Bobcat has 512KB per core,dedicated L2 cache,running at half the clock speed.
BD has 8MB of L3 running at 2.4+Ghz,victim cache(mostly exclusive),partitioned at 4 subcaches of 2MB each. Bobcat has no L3 at all.
As you can see,BD has many times more potent cache subsystem.Not only the clocks are 2x in the L2 part,but it's 4x bigger effectively(for single thread workload) and it has 8MB of L3 on it's disposal.
As for the rest of the memory subsystem:
BD can do 2 128bit loads and 1 128bit store per cycle,per core.Bobcat can do 1 64bit load and 1 64bit store per cycle. BD and Bobcat have "full" OoO load/store capabilities. BD effectively gas 2x better L/S BW versus Bobcat,per core.

hmmm so they CAN share their fpu but dont always? but then how does this affect the definition of a core? its still two cores then, just that they CAN in some scenarios work together... right?

The FPU can be dedicated per core or shared(by one or both cores). This means,that a 8 core Orochi can have 8 128bit FMAC units,meaning each core has its own FMAC.It also means that in single thread workloads one core can have 2 FMACs to itself,executing FMA ops or consecutive FADDs/FMULs which is not doable in today's designs.Versus Nehalem,in classical FP code,one 128bit FMAC is equal to 2 Nehalem "ports" of execution(per core Nehalem has 2 dedicated ports,one for ADD and one for MUL) in FP code,or even faster than that.If the serial code has consecutive FADDs/FMULs than one BD core can do 2 of these per clock,which is not possible in today's x86 designs.

well the presentation somebody posted here only mentions a 30% boost while mentioning a 20% boost in supported memory clocks... call me cynical but in my experience with presentations like this it doesnt mean 20% clockspeed resulting bw boost PLUS 30% arch resulting boost... it means 20% clockspeed resulting bw boost plus 10% arch resulting boost... BEST CASE SCENARIO

30% is solely due to IMC logic improvement,while additional 20% is for clock speed improvement .

as long as each part CAN work fully independantly, no... as soon as two blocks are dependant on each other its ONE block... at least by my definition

Yes ,each core can have one 128bit FMAC(doing work on FADD+FMUL in parallel).If the second FMAC is not used by the other core or AVX code is in use,one core uses both FMACs.

but they do share the same building blocks right? some of them at least... makes sense...

Just as two cars share the same building blocks,say engine,differential,gear box etc

. The devil is in the details as you know yourself

.

if bd has a 30% ipc or overall performance boost (per core) over their current chips ill be deeply impressed :0

30% was never mentioned IIRC. I don't think you will see that kind of a jump going to BD. 15% jump in int is more reasonable.Turbo comes on top of that.

amd has a tripple channel platform?

wasnt it quad channel? its MCM so its two dualchannel chips on one package. and they ARE recycling that for servers and might use it for enthusiasts, though i doubt it... enthusiasts wont see benefits from the extra bandwidth i think... for enthusiasts latency is more important than bandwidth cause only few cores are actually used.

One server platform featuring improved BD core and 10 cores per chip(non MCM) will have 3 channel IMC.In MCM though,the IMC will still be Quad Channel.

Originally Posted by nn_step

ok let us look at what has already been published.
http://pc.watch.impress.co.jp/img/pc...405/860/01.jpg

Note that the integer units are very similar but bulldozer includes more flexible load/store logic.
Also note that the front end of bulldozer is exactly double that of bobcat.

Things such as Cache size and implementation are any argument against my idea; nor is the memory performance or implementation that radical of a change. [AMD has done it dozens of times with K8]

Also, I must mention that I am not suggesting that they just "slapped" two bobcats together. Rather that the integer units are rather bobcat derived and focus on performance per transistor rather than maximum performance.

And if I am correct, then the Bulldozer cores should not be more than 40% larger than K8. That is my prediction and my explanation for it; the validity can only be proven or disproved by AMD.

That diagram is a speculation by Hiroshige Goto.It's not based on real BD or Bobcat but on scarce data we have today. Also one Bobcat core can retire 2 instructions per cycle while each BD core can do 4(be it int or fp).There's the huge difference.Also,Bobcat does not have unified integer scheduler
like BD does.Bobcat's FP resources are 4x less (BD module Vs 2 Bobcat cores).

**Opteron146** · 11-27-2010, 04:40 AM

Originally Posted by saaya

i thought we are talking about memory bandwidth, not ipc or overall performance...
if bd has a 30% ipc or overall performance boost (per core) over their current chips ill be deeply impressed :0

Yes we talked about memory bandwidth, but I used the IPC as an example how differently your views are. You just stated it again... why is +30% IPC impressive, but not +30% Mem bandwith ?
Of course IPC is the overall number, and bandwidth has just a smaller influence on e.g. IPC. However, that does not make AMD's bandwidth achievements less impressive. +30% at the same mem speed is nothing to bother about, it is "for free" ... be happy about it.

amd has a tripple channel platform?

wasnt it quad channel? its MCM so its two dualchannel chips on one package.

Yes triple channel - planned for 2012- and no, no MCM. Maybe re-read the last pages ;-).

though i doubt it... enthusiasts wont see benefits from the extra bandwidth i think... for enthusiasts latency is more important than bandwidth cause only few cores are actually used.

Well enthusiasts are crazy about Intel's triple channel ... not sure if they have really benefits, but as long as they are crazy it is fine

sigh... cause they all use solder balls to connect silicon to organic packages? :P

So because Toyota has problems with their Prius, I should be worried about my Corvette, too, because both cars have 4 wheels to connect with the road ? Nice joke

**JF-AMD** · 11-27-2010, 05:04 AM

Originally Posted by saaya

hmmm so they CAN share their fpu but dont always? but then how does this affect the definition of a core? its still two cores then, just that they CAN in some scenarios work together... right?

well the presentation somebody posted here only mentions a 30% boost while mentioning a 20% boost in supported memory clocks... call me cynical but in my experience with presentations like this it doesnt mean 20% clockspeed resulting bw boost PLUS 30% arch resulting boost... it means 20% clockspeed resulting bw boost plus 10% arch resulting boost... BEST CASE SCENARIO

Yes, they can share the FPU but they don't always. To get to 256-bit AVX, we share 2 FPU pipelines and Intel shares a 128-bit FPU with integer pipelines. In either case, to get to 256-bit AVX, you are sharing resources. We just choose to do it in a way that gives you 8 AVX units AND 16 integer pipelines. Intel does it in a way that gives you 8 AVX units and LESS THAN 8 integer untis (because some of their integer resources are going to handle the AVX instructions.)

If you look at the slide we specifically call out 50% throughput increase, one portion from improvements to the IMC and one portion because of higher speed memory.

I realize that you might not want to believe it, but if the slide spells that out specifically (and our lawyers approve all slides) then you can't create your own interpretation of the slide to downplay the performance increase.

**-Boris-** · 11-27-2010, 08:30 AM

Originally Posted by FlanK3r

sayya: no, IMC on Thuban is not

...2200 MHz is possible stable at Thubans. And Bulldozer IMC is from diferent world, easy and better for you "just relax and waiting for launch "

Frequency doesn't matter. The efficency still sucks at AM3, even more than it sucked on AM2 when it was new.

**BeepBeep2** · 11-27-2010, 09:24 AM

Originally Posted by -Boris-

Frequency doesn't matter. The efficency still sucks at AM3, even more than it sucked on AM2 when it was new.

Efficiency was awesome on AM2 compared to intel, what are you talking about?

**Chumbucket843** · 11-27-2010, 11:57 AM

Originally Posted by JF-AMD

Yes, they can share the FPU but they don't always. To get to 256-bit AVX, we share 2 FPU pipelines and Intel shares a 128-bit FPU with integer pipelines. In either case, to get to 256-bit AVX, you are sharing resources. We just choose to do it in a way that gives you 8 AVX units AND 16 integer pipelines. Intel does it in a way that gives you 8 AVX units and LESS THAN 8 integer untis (because some of their integer resources are going to handle the AVX instructions.)

you might want to note that most code does not mix fp and int heavily when using sse. this means that integer resources are not important when running fp heavy code so it's ok to share those ports. with BD, pure 256b fp will be half speed relative to non shared fpu's. oh and you may want to brush up on intel's nomenclature so you can describe their uarch in a more understandable way(SB isnt really sharing). they use execution ports for multiple instruction types and it's been that way since core2.

**MrMojoZ** · 11-27-2010, 12:36 PM

Originally Posted by BeepBeep2

Efficiency was awesome on AM2 compared to intel, what are you talking about?

He has been ranting about AMD memory controllers for awhile, it has never made any sense.

**FlanK3r** · 11-27-2010, 12:45 PM

BORIS: did u not seen practice change from IMC 940 Deneb to todays Thuban? I think, it is big diference at the "same" architecture

**duploxxx** · 11-27-2010, 02:05 PM

Originally Posted by Opteron146

Yes we talked about memory bandwidth, but I used the IPC as an example how differently your views are. You just stated it again... why is +30% IPC impressive, but not +30% Mem bandwith ?
Of course IPC is the overall number, and bandwidth has just a smaller influence on e.g. IPC. However, that does not make AMD's bandwidth achievements less impressive. +30% at the same mem speed is nothing to bother about, it is "for free" ... be happy about it.
Yes triple channel - planned for 2012- and no, no MCM. Maybe re-read the last pages ;-).

Well enthusiasts are crazy about Intel's triple channel ... not sure if they have really benefits, but as long as they are crazy it is fine

So because Toyota has problems with their Prius, I should be worried about my Corvette, too, because both cars have 4 wheels to connect with the road ? Nice joke

where did you find such proven details on Terramar and Sepang?

**Opteron146** · 11-27-2010, 02:31 PM

Originally Posted by duploxxx

where did you find such proven details on Terramar and Sepang?

http://phx.corporate-ir.net/External...xUeXBlPTM=&t=1

**JF-AMD** · 11-27-2010, 03:15 PM

Terramar and sepang are server products. don't try to draw conclusions about client products based on server.

Thread: AMD: 32nm issues fixed

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions