AMD cuts to the core with 'Bulldozer' Opterons

Printable View

Show 100 post(s) from this thread on one page

12-16-2009, 11:53 AM
Helmore

Quote:

Originally Posted by haylui

If BD is JUST 10~30% faster than current K10.5, then I'm afraid that it could be another i7 vs K10.5 when facing SB

A single BD module will be 10 to 30% faster than a dual core K10.5, but keep in mind that this single BD module is about the same size as a single Sandy Bridge core and maybe even smaller than this. Just my guess though. For multithreaded software this means that a single BD module will be 120 to 160% faster than a single K10.5 core. Quite a leap in performance.

Quote:

Originally Posted by Manicdan

^nice, i wish though they did a bit more comparing the die space to other chips, instead of just itself.

I agree with you, but we can always do some guesswork.

We should judge how this chip will perform once it's launched though, not from what we think to know so far.
12-17-2009, 04:04 AM
ridney

from my limited understanding a BD module would be amd's answer to intel's single core with hyperthreading right? so if intel releases an 8 core SB with HT then amd's answer to it would be 8 module BD which would have 16 cores... that would be awesome. time to save up and splash on when BD comes to town :up:
12-17-2009, 05:22 AM
madcho

i seriously hope that amd will deliver same amount of BD modules than intel cores. But i'm think this is not possible.

Why ? Because BD cores a lot biger than older cores, even 32nm will not be enough for that, but BD modules are better than 1 intel cores with hyperthreading, that's easy too understand without any doubt.

I hope that AMD will go for 4 modules fast. That could be competitive to intel with 6 cores with hyperthreading. 8 threads on real cores vs 12 threads on hyperthreading.

AMD's way is beautyfull, a real bruteforce in integer; the big lack in K8, and the big problem of the K10 as my mind. And my best hope, that is AMD will enable with SSE5 to use FPUs of the GPU for CPU calcs. That's maybe why the FPU on BD is lower than on K10 ( lower with the same amounth of threads ) same if you concider that a BD is a new core.

4 pipelines for integer is the most beautyfull thing that i would love to see at start of the K10. Very sad that AMD didn't do it.
12-17-2009, 06:03 AM
Helmore

Quote:

Originally Posted by madcho

i seriously hope that amd will deliver same amount of BD modules than intel cores. But i'm think this is not possible.

Why ? Because BD cores a lot biger than older cores, even 32nm will not be enough for that, but BD modules are better than 1 intel cores with hyperthreading, that's easy too understand without any doubt.

A single BD module will probably be close to the same size as a single Intel Sandy Bridge core, when both made on each respective 32 nm process node. A BD module might even end up being smaller than Sandy Bridge. One limiting factor in core scaling though, is that AMD has apparently designed its cache structure in such a manner to only allow 4 modules to share their L3 cache. For AMD this means they will have 2 separate L3 cache pools when they put 8 modules on a single die and I don't think we will be seeing 8 BD modules single die CPUs for their first generation BD chips. I could be wrong though on this one, have to read up on it again.

Quote:

Originally Posted by madcho

I hope that AMD will go for 4 modules fast. That could be competitive to intel with 6 cores with hyperthreading. 8 threads on real cores vs 12 threads on hyperthreading.

AMD's way is beautyfull, a real bruteforce in integer; the big lack in K8, and the big problem of the K10 as my mind. And my best hope, that is AMD will enable with SSE5 to use FPUs of the GPU for CPU calcs. That's maybe why the FPU on BD is lower than on K10 ( lower with the same amounth of threads ) same if you concider that a BD is a new core.

4 pipelines for integer is the most beautyfull thing that i would love to see at start of the K10. Very sad that AMD didn't do it.

Bulldozer's FPU unit is more than twice as fast as K10.5's FPU unit. One of the biggest improvements it has is support for single cycle FMA, although I'm sure they will have made many other improvements as well. Not to mention support for AVX and as a result full support for all current SSE extensions.
12-17-2009, 06:20 AM
ajaidev

I would say FMA is the best thing over Sandy bridge that one can see on paper that is something intel cant bring out in time and most likely will with Ivy Bridge....
12-17-2009, 10:45 AM
LesGrossman

Quote:

Originally Posted by Helmore

I don't think we will be seeing 8 BD modules single die CPUs for their first generation BD chips. I could be wrong though on this one, have to read up on it again.

Interlagos? 8 modules right there.
12-17-2009, 10:49 AM
Helmore

Quote:

Originally Posted by LesGrossman

Interlagos? 8 modules right there.

Isn't that a dual die solution?
12-17-2009, 10:59 AM
informal

Quote:

Originally Posted by Helmore

Isn't that a dual die solution?

Yeah MCM of two 4 module MPUs.
12-17-2009, 01:45 PM
cky2k6

Quote:

Originally Posted by Helmore

Isn't that a dual die solution?

I doubt mcm will be that bad, even fsb was usable with quad cores, and amd's ht is a very fast link.
12-17-2009, 06:42 PM
JumpingJack

Quote:

Originally Posted by haylui

how does Anandtech got such information???

Anandtech is press, AMD communicates to press as part of public relations.
01-05-2010, 10:41 AM
JF-AMD

Quote:

Originally Posted by ridney

from my limited understanding a BD module would be amd's answer to intel's single core with hyperthreading right? so if intel releases an 8 core SB with HT then amd's answer to it would be 8 module BD which would have 16 cores... that would be awesome. time to save up and splash on when BD comes to town :up:

No, it is actually a different design philosophy. AMD believes that based on the current technologies, the best way to solve multi-threaded problems is with more threads over more discrete cores.

In every architecture there will be shared and discrete components (look at L3 caches and memory controllers today). Within the integer core you can make resources either shared or discrete.

Shared resources need to be wide enough to allow for more throughput without bottlenecks or contention.

The challenge with hyperthreading (or SMT in the more generic sense) is that it's philosophy is about "filling the pipeline when one thread stalls" and not about driving better efficiency. In a perfectly efficient system, SMT would not be needed because there would be no gaps in the pipeline. (this world does not exist).

Think of SMT like carpooling. It may appear to be more efficient for 2 people to carpool to work and save money, but that depends on how far they live from each other and how far they live from work. Clearly if they live 3 miles apart and work is only 1 mile away, carpooling becomes less efficeint.

Having seperate cars may appear to be less efficeint, but if the car are hybrids and the carpool car was an SUV, suddenly the math starts to make sense.

The key with our architecture is that there are always cores always available (up to 16 per CPU). You won't find a case where you have 16 cores but you can only run 8 threads because the others are waiting for a chance to "jump in."

Long term, over time, you want to drive to greater CPU efficiency. Every time you increase efficiency with real cores, you have the potential to get more overall throughput. Every time you increase efficiency on SMT you may simply "squeeze the balloon." More efficiency in the primary thread means less opportunity for the SMT thread to "jump in" so you get a net zero gain in throughput.
01-05-2010, 10:48 AM
ajaidev

JF-AMD welcome to XS

Nice first post...

EDIT: As you said SMT decrease's resources shared but in a situation where the code is broken and not very effective SMT does help does it not?

Not only that but old code also profit from it and synthetic benchs.

Also whts preventing AMD from doing a 3-4 core CMT? Will AMD use 2 core CMT in the future or would 3-4 core CMT show up?
01-05-2010, 10:59 AM
Olivon

Yeah, right ! Welcome to XS JF-AMD !

Really informative first post :up:
01-05-2010, 11:01 AM
kl0012

Quote:

Originally Posted by JF-AMD

In a perfectly efficient system, SMT would not be needed because there would be no gaps in the pipeline. (this world does not exist).
....
More efficiency in the primary thread means less opportunity for the SMT thread to "jump in" so you get a net zero gain in throughput.

I'm not sure if I understand you. Ability of increasing instruction parallelism is very limited. So the only way I see to increase core efficiency is to decrase number of pipelines and/or execution units per core. Is that what will happen in Buldozer?
01-05-2010, 11:38 AM
JF-AMD

It's about cache efficiency. Today, when there is a cache miss, the thread stalls while the core waits for the data to be fetched from memory. While that thread is stalled, SMT will dump the cache, insert a new thread, run that, then return the cache contents for the old thread (that just got the memory data.)

I know that is a REALLY simplistic description but should help you visualize.

HT originally came about in P4 because they had a very long pipeline and one cache miss had lots of penalty associated. But as they shortened the pipeline (i.e. Core2) they tossed out HT because they no longer needed that band-aid.

If you take that same logic and extend it, as a microarchitecture, you should always be striving to reduce cache misses as much as possible. As you reduce the misses, you increase the efficiency. That is good. But the cache misses give you the "opportunity" that you need for SMT to work. So as primary core efficeincy goes up, the SMT efficiency generally goes down.

The ability for parallelism to increase has more to do with the OS schedulers for the most part. OS's deployed 3 years ago were written when single cores ruled the earth. OS's deployed today were focused more on dual core and even to a small extent quad core, so they do a better job of scheduling. OS's that you will use in 3 years will do much better than today's. It is all a progression. Saying you don't need more cores in the future because today's OS's don't utilize all of the cores is like saying that a 1TB drive is too big. Give people enough storage space and they will fill it. Give them enough cores and they will figure out how to use them.

My notebook probably has 50 different services running (and 3-4 actual programs). There is always a use for more cores, the OS just needs to come along for the ride - and that will be happening.
01-05-2010, 03:10 PM
m^2

Quote:

Originally Posted by JF-AMD

It's about cache efficiency. Today, when there is a cache miss, the thread stalls while the core waits for the data to be fetched from memory. While that thread is stalled, SMT will dump the cache, insert a new thread, run that, then return the cache contents for the old thread (that just got the memory data.)

I know that is a REALLY simplistic description but should help you visualize.

HT originally came about in P4 because they had a very long pipeline and one cache miss had lots of penalty associated. But as they shortened the pipeline (i.e. Core2) they tossed out HT because they no longer needed that band-aid.

If you take that same logic and extend it, as a microarchitecture, you should always be striving to reduce cache misses as much as possible. As you reduce the misses, you increase the efficiency. That is good. But the cache misses give you the "opportunity" that you need for SMT to work. So as primary core efficeincy goes up, the SMT efficiency generally goes down.

The ability for parallelism to increase has more to do with the OS schedulers for the most part. OS's deployed 3 years ago were written when single cores ruled the earth. OS's deployed today were focused more on dual core and even to a small extent quad core, so they do a better job of scheduling. OS's that you will use in 3 years will do much better than today's. It is all a progression. Saying you don't need more cores in the future because today's OS's don't utilize all of the cores is like saying that a 1TB drive is too big. Give people enough storage space and they will fill it. Give them enough cores and they will figure out how to use them.

My notebook probably has 50 different services running (and 3-4 actual programs). There is always a use for more cores, the OS just needs to come along for the ride - and that will be happening.

You certainly sound like sb. who knows a lot, yet you use crappy PR metaphors instead of talking straight. :rolleyes:

And BTW, I bet that that your 50 services could easily use just 1 core and have plenty of spare power. What matters is programs and I'm pretty sure you know this.:rolleyes:
01-05-2010, 03:33 PM
god_43

Quote:

The key with "our" architecture is that there are always cores always available (up to 16 per CPU). You won't find a case where you have 16 cores but you can only run 8 threads because the others are waiting for a chance to "jump in."

Freudian slip?

nice explanation though :up:

:welcome:
01-05-2010, 03:39 PM
BrowncoatGR

Quote:

Originally Posted by god_43

Freudian slip?

nice explanation though :up:

:welcome:

No, he works for AMD

Oh and:welcome: to XS
01-05-2010, 04:08 PM
god_43

Quote:

Originally Posted by BrowncoatGR

No, he works for AMD

oh....nvm then.
01-05-2010, 04:16 PM
BatteryOperated

Quote:

Originally Posted by Helmore

Bulldozer's FPU unit is more than twice as fast as K10.5's FPU unit. One of the biggest improvements it has is support for single cycle FMA, although I'm sure they will have made many other improvements as well. Not to mention support for AVX and as a result full support for all current SSE extensions.

How exactly do we know this to be true?
01-05-2010, 04:49 PM
Chumbucket843

i am not seeing the advantages of FMA other than higher accuracy. it sounds more like PR to me.
01-05-2010, 05:06 PM
Helmore

Quote:

Originally Posted by BatteryOperated

How exactly do we know this to be true?

I don't know this, it's just a guess based on the information we have thus far. What we know thus far is that Bulldozer can do two 128-bit FMA FPU operations concurrently, while Barcelona/Shanghai can only do one 128-bit FPU operation. That gives Bulldozer an FPU execution unit that is more than twice as fast as Barcelona.

Quote:

Originally Posted by Chumbucket843

i am not seeing the advantages of FMA other than higher accuracy. it sounds more like PR to me.

IIRC FMA allows you to do more work per clock and it gives you higher accuracy. That is because with FMA Bulldozer can do the calculation: a = a + a*b in one cycle and it will do the rounding afterwards. On Barcelona this calculation would take 2 clock cycles and it would do the rounding in between. Correct me if I'm wrong though, as I'm not completely sure about what I just said.
01-05-2010, 05:38 PM
STaRGaZeR

JF-AMD, can you comment on Bulldozer's single thread perfomance? Or at least about the approach you've taken.
01-05-2010, 05:49 PM
Mechromancer

Welcome to XS JF!
01-05-2010, 05:59 PM
Chumbucket843

cpu's and gpu's today use MADD which is an independent add and mul unit. they can execute two flops per cycle. with FMA the multiplication and addition logic are in one fpu so if you cant do both a multiply and add then you have to insert a constant i.e. (a*1)+b or (a*b)+0. this is wastefull. a lot of algorithms can use FMA but my problem is that it does not reduce latency enough relative to power and area.

Show 100 post(s) from this thread on one page