AMD to Disclose Details About Bulldozer Micro-Architecture in August

**informal** · 06-28-2010, 10:52 AM

1 BD module will probably end up the same size or smaller than one SB core with L2. As long as AMD manages to cover or stay close to the total number of intel's threads(cores with SMT) they will do great.

**Chumbucket843** · 06-28-2010, 10:57 AM

well, AMD basically destroyed the meaning of the word module. it generally refers to smaller units that make up a core. they have well defined functionality (i.e. a NAND/NOR/XOR gate or an adder/multiplier). an AMD "module" is a core. from the looks of CMT it is probably bigger than an SMT counterpart assuming that everything else about the cores are constant. the thing is there is more potential performance gains from CMT. it's basically hyperthreading but more aggressive, with the possibility of a better risk/reward ratio.

in the end the nonsense arguments against hyperthreading will become the basis of intel fanboys' nonsense arguments against CMT, a full circle.

**Movieman** · 06-28-2010, 10:59 AM

question:
Is BD going to be compatable with the current MC dual socket boards?

**Manicdan** · 06-28-2010, 11:06 AM

Originally Posted by Chumbucket843

well, AMD basically destroyed the meaning of the word module. it generally refers to smaller units that make up a core. they have well defined functionality (i.e. a NAND/NOR/XOR gate or an adder/multiplier). an AMD "module" is a core. from the looks of CMT it is probably bigger than an SMT counterpart assuming that everything else about the cores are constant. the thing is there is more potential performance gains from CMT. it's basically hyperthreading but more aggressive, with the possibility of a better risk/reward ratio.

in the end the nonsense arguments against hyperthreading will become the basis of intel fanboys' nonsense arguments against CMT, a full circle.

exactly what im trying to get too, everything has their pros and cons

its simple: smaller and possible faster
its complex: bigger but has better IPC

in the end its going to be pretty damn close in terms of power requirement for the performance, and the only thing left is pricing.

**informal** · 06-28-2010, 11:22 AM

Originally Posted by Movieman

question:
Is BD going to be compatable with the current MC dual socket boards?

Yes

. You can plug in 2 Interlagos (16cores) beasts in your board ,next year

.

Originally Posted by Chumbucket843

well, AMD basically destroyed the meaning of the word module. it generally refers to smaller units that make up a core. they have well defined functionality (i.e. a NAND/NOR/XOR gate or an adder/multiplier). an AMD "module" is a core. from the looks of CMT it is probably bigger than an SMT counterpart assuming that everything else about the cores are constant. the thing is there is more potential performance gains from CMT. it's basically hyperthreading but more aggressive, with the possibility of a better risk/reward ratio.

in the end the nonsense arguments against hyperthreading will become the basis of intel fanboys' nonsense arguments against CMT, a full circle.

I haven't seen where generally module refers to smaller units inside a core.AMD defines a module as an "optimized dual core" . The optimized part refers to the fact that the (super beefed up) frond end is shared among dedicated execution units and is able to maximize the work done by catching peaks and values and cramming as much instructions in the the same units as possible at the time.
Also I'm not sure how you come to the conclusion that one module(2 cores) will be probably bigger than a SB core,which already in Nehalem generation is huge compared to one 10h core inside Shangai. Int exec. units are relatively small compared to the say FPU or the rest of the core. The fpu unit will be rather large,but will be very strong too,thanks to the FMAC capability that can dramatically improve performance.It's also shared so it can behave in SMT-like manner too.

**~~-Sweeper_~~** · 06-28-2010, 11:28 AM

Originally Posted by informal

1 BD module will probably end up the same size or smaller than one SB core with L2. As long as AMD manages to cover or stay close to the total number of intel's threads(cores with SMT) they will do great.

smaller than ~20mm²?

I dont think so

**informal** · 06-28-2010, 11:31 AM

Originally Posted by -Sweeper_

smaller than ~20mm²?

I dont think so

One Llano core with L2 is sub 10mm2

.That's a full dedicated core with no parts shared with other cores(dedicated L2 too).One BD module that will have parts shared(front end,L2),even with increased logic(invested to improve per core IPC) will probably land under 20mm2

.

**Movieman** · 06-28-2010, 11:32 AM

Originally Posted by informal

yes

. You can plug in 2 interlagos (16cores) beasts in your board ,next year

.

Now if we can only get them to 3+GHz..

**informal** · 06-28-2010, 11:36 AM

Originally Posted by Movieman

Now if we can only get them to 3+GHz..

32 cores at your disposal,with FMAC capable FPU

.
If the boards support minor HTT bumps it should be doable.But who knows,maybe they will ship with clocks close to the ones you desire so you wouldn't have to bother

. But somehow if that happens,I suppose you will just up the desired target to 4+Ghz ,won't you?

**Movieman** · 06-28-2010, 11:44 AM

Originally Posted by informal

32 cores at your disposal,with FMAC capable FPU

.
If the boards support minor HTT bumps it should be doable.But who knows,maybe they will ship with clocks close to the ones you desire so you wouldn't have to bother

. But somehow if that happens,I suppose you will just up the desired target to 4+Ghz ,won't you?

Of course..This is XtremeSystems!

Slightly OT but that dual MC system I have although running "slow" at 1900mhz is what you'd call a "good" system.
Max of 39-41C at 100% load in a 72-74F room.
Stable, no hiccups at all.
In terms of production in WCG it's equal to my dual Westmere quad E5640's with HT turned on..
That has 16 threads at 3022( slight OC from 2800 default)
Still tweaking those Intel quads..

**JF-AMD** · 06-28-2010, 11:54 AM

Originally Posted by Movieman

question:
Is BD going to be compatable with the current MC dual socket boards?

Yes.

**Chumbucket843** · 06-28-2010, 11:55 AM

Originally Posted by informal

I haven't seen where generally module refers to smaller units inside a core.

http://lmgtfy.com/?q=define%3A+module

AMD defines a module as an "optimized dual core" . The optimized part refers to the fact that the (super beefed up) frond end is shared among dedicated execution units and is able to maximize the work done by catching peaks and values and cramming as much instructions in the the same units as possible at the time.

i dont care what AMD refers to it as. core is completely independent. the ALU('s), which should NOT be the focus of an architecture, are dependent on the rest of the core to fetch, retire, and do everything else besides arithmetic. the front end has only been beefed up 33%. that's not enough to feed double the resources.

Also I'm not sure how you come to the conclusion that one module(2 cores) will be probably bigger than a SB core,which already in Nehalem generation is huge compared to one 10h core inside Shangai. Int exec. units are relatively small compared to the say FPU or the rest of the core. The fpu unit will be rather large,but will be very strong too,thanks to the FMAC capability that can dramatically improve performance.It's also shared so it can behave in SMT-like manner too.

im not sure how you came to the conclusion that i was comparing sandy bridge to bulldozer. im not even talking about BD here.

i simulated a processor with SMT and CMT. they are the same in every way except for the variable.

btw, im not a fan of FMA. the tradeoffs are pretty ugly compared to a MADD ALU. you can make pure addition/multiplication faster but ruin the benefit of FMA or you can ruin the performance of pure + or * but make MADD slightly faster and accurate.

**JF-AMD** · 06-28-2010, 11:59 AM

Originally Posted by OhNoes!

If HT hurts performance in a particular software environment, it's simply turned off. This is not even a question of it

Believe it or not, a good number of the customers that I have talked to had never considered turning off HT because they assumed it would give them more performance.

Nowhere, that I have ever seen documented, has Intel said that it could result in lower performance. So, intuitively, if the vendor is telling you that it doubles the threads and gives you more performance, why would you try to turn it off to see if you got better performance?

If it had been marketed differently I would buy your argument, but I have talked to a lot of customers who believe the opposite.

**informal** · 06-28-2010, 12:02 PM

Originally Posted by Chumbucket843

http://lmgtfy.com/?q=define%3A+module

i dont care what AMD refers to it as. core is completely independent. the ALU('s), which should NOT be the focus of an architecture, are dependent on the rest of the core to fetch, retire, and do everything else besides arithmetic. the front end has only been beefed up 33%. that's not enough to feed double the resources.

im not sure how you came to the conclusion that i was comparing sandy bridge to bulldozer. im not even talking about BD here.

i simulated a processor with SMT and CMT. they are the same in every way except for the variable.

btw, im not a fan of FMA. the tradeoffs are pretty ugly compared to a MADD ALU. you can make pure addition/multiplication faster but ruin the benefit of FMA or you can ruin the performance of pure + or * but make MADD slightly faster and accurate.

Google jokes are getting old you know

. AMD defines module as it is since it is not a core per se. There are dedicated ALU's and FPUs inside the module and the way it's organized is made on purpose(maximize the throughput of such a structure and minimize the core logic area investment,at the same time). The front end will not be the bottleneck as Chuck Moore stated on several occasions.
As for the module size,I'm sorry my mistake.I've thought you have been comparing BD and SB.

**Movieman** · 06-28-2010, 12:24 PM

Originally Posted by JF-AMD

Believe it or not, a good number of the customers that I have talked to had never considered turning off HT because they assumed it would give them more performance.

Nowhere, that I have ever seen documented, has Intel said that it could result in lower performance. So, intuitively, if the vendor is telling you that it doubles the threads and gives you more performance, why would you try to turn it off to see if you got better performance?

If it had been marketed differently I would buy your argument, but I have talked to a lot of customers who believe the opposite.

Your correct on this.
HT has it's place but not all the time.
Not that it's really relavant in this discussion but don't turn on HT on a dual westmere on Vantage or you'll see it come to a dead stop in the second cpu test and I mean dead stop.. No bluescreen just stop dead and thats with a machine that will score over 57,000 cpu with HT turned off in Vantage.
I know, seen it with my own eyes..

**savantu** · 06-28-2010, 01:13 PM

Originally Posted by Chumbucket843

linpack is so arithmetic intensive that all HT does is cut the amount of cache in half. it's just overhead. the amount of operands accessed from main memory is usually under 1%.

http://www.anandtech.com/show/3470

on the other hand HT can increase performance by up to 70%. with the 3rd channel it's easily 2x faster than other 45nm processors.
http://blogs.sun.com/4HPCISVs/entry/..._on_nehalem_an

You've shown the worst example for HT. Linpack uses Intel libraries which are assembler optimized, you cannot get any higher efficiency than Linpack on CPUs; obviously HT has nothing to help here.

**madcho** · 06-28-2010, 01:14 PM

HT is not a bad tech on wide super scalar cpu. P4 was a too long pipe and not really usefull because all apps where single threaded. But the advantage was to get some free cpu for system.

Now HT was improved by intel and is better efficient on a 4 issue core. And the negative effect is "canceled" by turbo in benchmarks. But when cpus is burning, turbo in multithreads don't give back the performance loss, but it's easyer to use the marketing word, no more performance loss in HT disable.

CMT seem very different of HT. I think AMD used the core 2 L2 cache in mind when they created the CMT.

Share units that are very low used, or used in a low number of time to get 2 cores with less transistors. But AMD, didn't want loose in IPC in single thread, and they finnally get their cpu in 4 wide.

I was disapointed when K10 didn't come with this.

A CMT module is 2 cores more wide glued in parts not very hardly thermal used. And finnaly the L2 is shared by two cores, and i think this is a good idea. It's gonna improve perfomance of L2 with the reduce of bandwith used between cores caches. The most black point in actual Phenom II architecture.

BD seem a good piece of silicon, We just need to wait a bit more, to see frequency, wattage, Ipc, and most important how much time will take to AMD to ramp up.

And How much will be L2 ? 2Mo ? 1Mo ? 512ko ?

**Chumbucket843** · 06-28-2010, 02:32 PM

Originally Posted by savantu

You've shown the worst example for HT. Linpack uses Intel libraries which are assembler optimized, you cannot get any higher efficiency than Linpack on CPUs; obviously HT has nothing to help here.

i showed you exactly what i intended: hyperthreading causing negative performance. nice straw man though. here are some more.

ASM optimizations are usually not that useful. the compilers designers know way more about optimizations than programmers (it's a big part of their job). they know when x87 is faster than SSE, what fp instructions have higher accuracy, what instruction have lowest latency, etc. well written source code with inline assembly can be faster than pure ASM, with less bugs, development time and more portable. MKL is still a great lib though.

**JF-AMD** · 06-28-2010, 02:58 PM

Originally Posted by savantu

You've shown the worst example for HT. Linpack uses Intel libraries which are assembler optimized, you cannot get any higher efficiency than Linpack on CPUs; obviously HT has nothing to help here.

Well, technically, he showed you what you asked for, and then you dismissed it.

If you are arguing that with perfectely optimized code, there will be no use for HT, then I agree with you. HT is designed to take advantage of gaps in the pipeline.

What happens when code gets more optimized, does HT become less relevant?

As code becomes more optimized, more physical cores will give you better throughput. As code becomes more optimized, HT would, theoretically, give you less throughput (or, more likely the same, but the main thread would be better and the HT thread would be worse, probably zero-sum game.)

**~~terrace215~~** · 06-28-2010, 03:14 PM

Perfectly optimized code != no HT-exploitable gaps

It depends on what the code needs to do (dataset size, randomness of access, interdependency of accesses, etc), how wide (in different ways) the core is, etc. The "best" way to solve most problems will hardly ever result in a completely occupied core.

On an earlier topic, BD "modules" will have the same scheduling issues as Hyperthreading presents. Perhaps to a lesser extent, but both architectures have thread-pairwise-shared resources. Either the hardware, the OS, or the application needs to manage the "moving this thread to a completely unoccupied core(HT case) or module(BD case) would be better" issue for maximum performance.

(Consider two AVX(256) threads, or two independent threads that each would ideally have the whole module's L2 to itself, etc.)

**Chumbucket843** · 06-28-2010, 04:17 PM

even with optimizations some code is just inherently unfriendly. take physics simulations for example. nature is unpredictable thus temporal and spatial locality are gone which means cache is basically useless if your program uses more than 8MB.

**OneEng** · 06-28-2010, 04:34 PM

mstp2009,

What a load of CRAP. There is so much wrong with these statements I just don't know where I should begin.

First, The thread scheduler of the OS takes care of this, you NEVER have to code your application to say what "core" you want it to run on.

First, it wouldn't hurt for you to be civil .... especially since you are wrong.

First read this: http://www.intel.com/software/produc...d_affinity.htm

I just did a quick search for some common API sets to do this with. If you do even a small amount of research, you can find quite a bit of information on this subject.

Any decent threaded program will utilize these calls since it is processor intensive to switch threads around from core to core. If you can set your critical threads to have high affinity to a specific processor (or processors), you can achieve much better performance rather than relying on the OS to simply scatter shot your threads all over the processors.

I run servers both with AMD Istanbuls and Intel Nehalems....

As pointed out earlier, how about MC?

Now don't get me wrong, I don't believe that the BS about SMT causing performance degradation either, but you have to give credit where credit is due. AMD's MC is quite a value.

Chumbucket843,

well, AMD basically destroyed the meaning of the word module. it generally refers to smaller units that make up a core. they have well defined functionality (i.e. a NAND/NOR/XOR gate or an adder/multiplier). an AMD "module" is a core. from the looks of CMT it is probably bigger than an SMT counterpart assuming that everything else about the cores are constant. the thing is there is more potential performance gains from CMT. it's basically hyperthreading but more aggressive, with the possibility of a better risk/reward ratio.

in the end the nonsense arguments against hyperthreading will become the basis of intel fanboys' nonsense arguments against CMT, a full circle.

First, surely you have something better to do than complain about AMD's use of the word "module"?

Second,
A module is not "a core". I can understand your objection, and have even asked a few questions myself about this architectural terminology. For instance, why doesn't Intel simply say that they have a 12 "core" product today?

The answer is in the performance scaling and amount of shared resources. According to AMD, each "core" will scale 80% as good as 2 complete cores scale today. The only real difference between "real" cores and BD "cores" is that BD cores share more resources than we have typically seen in existing architectures. Intel's SMT only gains 20-30% in the best situations.

CMT is nothing like SMT other than it attempts to share resources within a processor to execute more efficiently. AMD isn't calling their approach CMT (yet). I don't think they intend to at this point in time, but IMHO, it is pretty close to the definition.

This is a pretty good article: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

I don't think that CMT is appreciably bigger than SMT considering the rather small portion of the die that a core currently takes up compared to cache. Additionally, it can afford to be somewhat larger since it scales the performance by a much bigger margin than does SMT.

Finally, scaling CMT to 3 or 4 cores should be of little difficulty for AMD in future iterations of BD. Not only would Intel gain little performance from moving to 4 way SMT from 2 way, but it would be very costly in development time and would still have the limitations and issues discussed in the article I linked to above.

i dont care what AMD refers to it as. core is completely independent.

Really? What about cache? What about cache coherency? Processors haven't been "independent" since the advent of SMP.

i simulated a processor with SMT and CMT. they are the same in every way except for the variable.

Not according to the research I have read. Perhaps you have other information?

well written source code with inline assembly can be faster than pure ASM, with less bugs, development time and more portable.

Faster? Only if the ASM is poorly written. I agree with your other points though. Most applications have little need for the speed and efficiency of ASM (with the exception of embedded applications, and even most of these can be done in C with enough efficiency to get by competitively).

madcho,

CMT seem very different of HT. I think AMD used the core 2 L2 cache in mind when they created the CMT.

Indeed. I also think that AMD's L2 cache design will be pretty greatly changed in BD.... at least I am hoping so

i think this is a good idea. It's gonna improve perfomance of L2 with the reduce of bandwith used between cores caches.

Agree. I also believe that the L2 latency will be decreased.

**JF-AMD** · 06-28-2010, 05:44 PM

Originally Posted by terrace215

(Consider two AVX(256) threads, or two independent threads that each would ideally have the whole module's L2 to itself, etc.)

We'll have the ability to dispatch 8 256-bit executions per cycle, pretty sure that Sandybridge will as well. Not seeing the issue.

**savantu** · 06-28-2010, 11:41 PM

Originally Posted by Chumbucket843

i showed you exactly what i intended: hyperthreading causing negative performance. nice straw man though. here are some more.

Fine, I must say though that Linpack wasn't exactly the app when I had in mind when I asked for examples, I was thinking commercial apps.

Regarding the "more part", what's to see ?

- In visualization tasks you get 4% on average due to HT
- In rendering tasks ( which do matter btw ) you get 21% on average

That's a very significant improvement which helps a lot the productivity.

ASM optimizations are usually not that useful. the compilers designers know way more about optimizations than programmers (it's a big part of their job). they know when x87 is faster than SSE, what fp instructions have higher accuracy, what instruction have lowest latency, etc. well written source code with inline assembly can be faster than pure ASM, with less bugs, development time and more portable. MKL is still a great lib though.

??
Any respectable programmer is fully aware of instruction latency and mix, knows what to use for the intended target. When it comes to Intel, especially a benchmark like Linpack ( matrix multiplication is the most FP friendly thing ) , rest assured they squeezed every ounce of it.

Originally Posted by JF-AMD

Well, technically, he showed you what you asked for, and then you dismissed it.

His example is perfectly valid, but we cannot infer anything from it regarding HT utility, in fact his other examples clearly show that overall, it brings nice gains.

If you are arguing that with perfectely optimized code, there will be no use for HT, then I agree with you. HT is designed to take advantage of gaps in the pipeline.

True, and Linpack falls under perfectly optimized code.

What happens when code gets more optimized, does HT become less relevant?

As code becomes more optimized, more physical cores will give you better throughput. As code becomes more optimized, HT would, theoretically, give you less throughput (or, more likely the same, but the main thread would be better and the HT thread would be worse, probably zero-sum game.)

If anything I'd say code is becoming less optimized. Optimization means a lot of hard work and man hours, why not simply let the compiler do what it can and just make sure it works ? With the advent of a serious drive for cross platform portability and the increasing complexity ( you care about getting it done and working reliably even it it's not well optimized ) means less and less code is truly optimized to use all the HW resources.

Secondly, there is no such thing as "main thread" and "HT thread". They are both in flight at the same time in the core. And your "zero sum game" is easily disproved by an abundance of examples for increase throughput. Hell, some CPUs like Niagara used an 8 thread HT per core and those have nowhere near the resources of a Nehalem core. Yet, even with the per thread performance of a Pentium 2, they offer significant throughput and a sensible business proposition.

As for "more physical cores giving better throughput" that's an axiom. The question however is what are the costs ?
-HT takes little die area and power, but more validation time
-an extra physical core takes significantly more die area and power, validation is the same.

Secondly, HT hides the ever growing gap between CPU speed ( 3-4GHz ) and memory clock ( 266MHz the fastest one ). Another core doesn't help at all, but worsens the problem.

**Chrysalis** · 06-29-2010, 01:41 AM

Originally Posted by JF-AMD

Actually for HT, you do want to program for it. Just telling it that you have more cores is a fiasco because you'll end up scheduling too many threads to one core while other cores are sitting idle.

Remember that with HT, 2 cores do not execute at the same time, one has to wait for the other. So if you had a quad core with HT and you wanted to schedule 4 threads, you would put them all on different cores for the best speed. If you just said take the first 4 cores, you would get two threads sharing the first core, 2 sharing the second, and two idle cores.

In a server workload, HT gives you a 10-20% increase in throughput, so do you want 2 threads sharing for that 10-20% increase, or do you want all of your threads on individual cores?

my point been you cannot program for it.

HT shows up as extra real cores to an app.

On a intel i7 how many cpu graphs is there in task manager 4 or 8?

Thread: AMD to Disclose Details About Bulldozer Micro-Architecture in August

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions