1 BD module will probably end up the same size or smaller than one SB core with L2. As long as AMD manages to cover or stay close to the total number of intel's threads(cores with SMT) they will do great.
1 BD module will probably end up the same size or smaller than one SB core with L2. As long as AMD manages to cover or stay close to the total number of intel's threads(cores with SMT) they will do great.
well, AMD basically destroyed the meaning of the word module. it generally refers to smaller units that make up a core. they have well defined functionality (i.e. a NAND/NOR/XOR gate or an adder/multiplier). an AMD "module" is a core. from the looks of CMT it is probably bigger than an SMT counterpart assuming that everything else about the cores are constant. the thing is there is more potential performance gains from CMT. it's basically hyperthreading but more aggressive, with the possibility of a better risk/reward ratio.
in the end the nonsense arguments against hyperthreading will become the basis of intel fanboys' nonsense arguments against CMT, a full circle.
question:
Is BD going to be compatable with the current MC dual socket boards?
Crunch with us, the XS WCG team
The XS WCG team needs your support.
A good project with good goals.
Come join us,get that warm fuzzy feeling that you've done something good for mankind.
exactly what im trying to get too, everything has their pros and cons
its simple: smaller and possible faster
its complex: bigger but has better IPC
in the end its going to be pretty damn close in terms of power requirement for the performance, and the only thing left is pricing.
Yes. You can plug in 2 Interlagos (16cores) beasts in your board ,next year
.
I haven't seen where generally module refers to smaller units inside a core.AMD defines a module as an "optimized dual core" . The optimized part refers to the fact that the (super beefed up) frond end is shared among dedicated execution units and is able to maximize the work done by catching peaks and values and cramming as much instructions in the the same units as possible at the time.
Also I'm not sure how you come to the conclusion that one module(2 cores) will be probably bigger than a SB core,which already in Nehalem generation is huge compared to one 10h core inside Shangai. Int exec. units are relatively small compared to the say FPU or the rest of the core. The fpu unit will be rather large,but will be very strong too,thanks to the FMAC capability that can dramatically improve performance.It's also shared so it can behave in SMT-like manner too.
One Llano core with L2 is sub 10mm2.That's a full dedicated core with no parts shared with other cores(dedicated L2 too).One BD module that will have parts shared(front end,L2),even with increased logic(invested to improve per core IPC) will probably land under 20mm2
.
Last edited by informal; 06-28-2010 at 11:33 AM.
Crunch with us, the XS WCG team
The XS WCG team needs your support.
A good project with good goals.
Come join us,get that warm fuzzy feeling that you've done something good for mankind.
32 cores at your disposal,with FMAC capable FPU.
If the boards support minor HTT bumps it should be doable.But who knows,maybe they will ship with clocks close to the ones you desire so you wouldn't have to bother. But somehow if that happens,I suppose you will just up the desired target to 4+Ghz ,won't you?
![]()
Of course..This is XtremeSystems!
Slightly OT but that dual MC system I have although running "slow" at 1900mhz is what you'd call a "good" system.
Max of 39-41C at 100% load in a 72-74F room.
Stable, no hiccups at all.
In terms of production in WCG it's equal to my dual Westmere quad E5640's with HT turned on..
That has 16 threads at 3022( slight OC from 2800 default)
Still tweaking those Intel quads..
Crunch with us, the XS WCG team
The XS WCG team needs your support.
A good project with good goals.
Come join us,get that warm fuzzy feeling that you've done something good for mankind.
http://lmgtfy.com/?q=define%3A+module
i dont care what AMD refers to it as. core is completely independent. the ALU('s), which should NOT be the focus of an architecture, are dependent on the rest of the core to fetch, retire, and do everything else besides arithmetic. the front end has only been beefed up 33%. that's not enough to feed double the resources.AMD defines a module as an "optimized dual core" . The optimized part refers to the fact that the (super beefed up) frond end is shared among dedicated execution units and is able to maximize the work done by catching peaks and values and cramming as much instructions in the the same units as possible at the time.
im not sure how you came to the conclusion that i was comparing sandy bridge to bulldozer. im not even talking about BD here.Also I'm not sure how you come to the conclusion that one module(2 cores) will be probably bigger than a SB core,which already in Nehalem generation is huge compared to one 10h core inside Shangai. Int exec. units are relatively small compared to the say FPU or the rest of the core. The fpu unit will be rather large,but will be very strong too,thanks to the FMAC capability that can dramatically improve performance.It's also shared so it can behave in SMT-like manner too.
i simulated a processor with SMT and CMT. they are the same in every way except for the variable.
btw, im not a fan of FMA. the tradeoffs are pretty ugly compared to a MADD ALU. you can make pure addition/multiplication faster but ruin the benefit of FMA or you can ruin the performance of pure + or * but make MADD slightly faster and accurate.
Believe it or not, a good number of the customers that I have talked to had never considered turning off HT because they assumed it would give them more performance.
Nowhere, that I have ever seen documented, has Intel said that it could result in lower performance. So, intuitively, if the vendor is telling you that it doubles the threads and gives you more performance, why would you try to turn it off to see if you got better performance?
If it had been marketed differently I would buy your argument, but I have talked to a lot of customers who believe the opposite.
Google jokes are getting old you know. AMD defines module as it is since it is not a core per se. There are dedicated ALU's and FPUs inside the module and the way it's organized is made on purpose(maximize the throughput of such a structure and minimize the core logic area investment,at the same time). The front end will not be the bottleneck as Chuck Moore stated on several occasions.
As for the module size,I'm sorry my mistake.I've thought you have been comparing BD and SB.
Your correct on this.
HT has it's place but not all the time.
Not that it's really relavant in this discussion but don't turn on HT on a dual westmere on Vantage or you'll see it come to a dead stop in the second cpu test and I mean dead stop.. No bluescreen just stop dead and thats with a machine that will score over 57,000 cpu with HT turned off in Vantage.
I know, seen it with my own eyes..![]()
Crunch with us, the XS WCG team
The XS WCG team needs your support.
A good project with good goals.
Come join us,get that warm fuzzy feeling that you've done something good for mankind.
HT is not a bad tech on wide super scalar cpu. P4 was a too long pipe and not really usefull because all apps where single threaded. But the advantage was to get some free cpu for system.
Now HT was improved by intel and is better efficient on a 4 issue core. And the negative effect is "canceled" by turbo in benchmarks. But when cpus is burning, turbo in multithreads don't give back the performance loss, but it's easyer to use the marketing word, no more performance loss in HT disable.
CMT seem very different of HT. I think AMD used the core 2 L2 cache in mind when they created the CMT.
Share units that are very low used, or used in a low number of time to get 2 cores with less transistors. But AMD, didn't want loose in IPC in single thread, and they finnally get their cpu in 4 wide.
I was disapointed when K10 didn't come with this.
A CMT module is 2 cores more wide glued in parts not very hardly thermal used. And finnaly the L2 is shared by two cores, and i think this is a good idea. It's gonna improve perfomance of L2 with the reduce of bandwith used between cores caches. The most black point in actual Phenom II architecture.
BD seem a good piece of silicon, We just need to wait a bit more, to see frequency, wattage, Ipc, and most important how much time will take to AMD to ramp up.
And How much will be L2 ? 2Mo ? 1Mo ? 512ko ?
i showed you exactly what i intended: hyperthreading causing negative performance. nice straw man though. here are some more.
ASM optimizations are usually not that useful. the compilers designers know way more about optimizations than programmers (it's a big part of their job). they know when x87 is faster than SSE, what fp instructions have higher accuracy, what instruction have lowest latency, etc. well written source code with inline assembly can be faster than pure ASM, with less bugs, development time and more portable. MKL is still a great lib though.
Well, technically, he showed you what you asked for, and then you dismissed it.
If you are arguing that with perfectely optimized code, there will be no use for HT, then I agree with you. HT is designed to take advantage of gaps in the pipeline.
What happens when code gets more optimized, does HT become less relevant?
As code becomes more optimized, more physical cores will give you better throughput. As code becomes more optimized, HT would, theoretically, give you less throughput (or, more likely the same, but the main thread would be better and the HT thread would be worse, probably zero-sum game.)
Perfectly optimized code != no HT-exploitable gaps
It depends on what the code needs to do (dataset size, randomness of access, interdependency of accesses, etc), how wide (in different ways) the core is, etc. The "best" way to solve most problems will hardly ever result in a completely occupied core.
On an earlier topic, BD "modules" will have the same scheduling issues as Hyperthreading presents. Perhaps to a lesser extent, but both architectures have thread-pairwise-shared resources. Either the hardware, the OS, or the application needs to manage the "moving this thread to a completely unoccupied core(HT case) or module(BD case) would be better" issue for maximum performance.
(Consider two AVX(256) threads, or two independent threads that each would ideally have the whole module's L2 to itself, etc.)
even with optimizations some code is just inherently unfriendly. take physics simulations for example. nature is unpredictable thus temporal and spatial locality are gone which means cache is basically useless if your program uses more than 8MB.
mstp2009,
First, it wouldn't hurt for you to be civil .... especially since you are wrong.What a load of CRAP. There is so much wrong with these statements I just don't know where I should begin.
First, The thread scheduler of the OS takes care of this, you NEVER have to code your application to say what "core" you want it to run on.
First read this: http://www.intel.com/software/produc...d_affinity.htm
I just did a quick search for some common API sets to do this with. If you do even a small amount of research, you can find quite a bit of information on this subject.
Any decent threaded program will utilize these calls since it is processor intensive to switch threads around from core to core. If you can set your critical threads to have high affinity to a specific processor (or processors), you can achieve much better performance rather than relying on the OS to simply scatter shot your threads all over the processors.
As pointed out earlier, how about MC?I run servers both with AMD Istanbuls and Intel Nehalems....
Now don't get me wrong, I don't believe that the BS about SMT causing performance degradation either, but you have to give credit where credit is due. AMD's MC is quite a value.
Chumbucket843,
First, surely you have something better to do than complain about AMD's use of the word "module"?well, AMD basically destroyed the meaning of the word module. it generally refers to smaller units that make up a core. they have well defined functionality (i.e. a NAND/NOR/XOR gate or an adder/multiplier). an AMD "module" is a core. from the looks of CMT it is probably bigger than an SMT counterpart assuming that everything else about the cores are constant. the thing is there is more potential performance gains from CMT. it's basically hyperthreading but more aggressive, with the possibility of a better risk/reward ratio.
in the end the nonsense arguments against hyperthreading will become the basis of intel fanboys' nonsense arguments against CMT, a full circle.
Second,
A module is not "a core". I can understand your objection, and have even asked a few questions myself about this architectural terminology. For instance, why doesn't Intel simply say that they have a 12 "core" product today?
The answer is in the performance scaling and amount of shared resources. According to AMD, each "core" will scale 80% as good as 2 complete cores scale today. The only real difference between "real" cores and BD "cores" is that BD cores share more resources than we have typically seen in existing architectures. Intel's SMT only gains 20-30% in the best situations.
CMT is nothing like SMT other than it attempts to share resources within a processor to execute more efficiently. AMD isn't calling their approach CMT (yet). I don't think they intend to at this point in time, but IMHO, it is pretty close to the definition.
This is a pretty good article: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
I don't think that CMT is appreciably bigger than SMT considering the rather small portion of the die that a core currently takes up compared to cache. Additionally, it can afford to be somewhat larger since it scales the performance by a much bigger margin than does SMT.
Finally, scaling CMT to 3 or 4 cores should be of little difficulty for AMD in future iterations of BD. Not only would Intel gain little performance from moving to 4 way SMT from 2 way, but it would be very costly in development time and would still have the limitations and issues discussed in the article I linked to above.
Really? What about cache? What about cache coherency? Processors haven't been "independent" since the advent of SMP.i dont care what AMD refers to it as. core is completely independent.
Not according to the research I have read. Perhaps you have other information?i simulated a processor with SMT and CMT. they are the same in every way except for the variable.
Faster? Only if the ASM is poorly written. I agree with your other points though. Most applications have little need for the speed and efficiency of ASM (with the exception of embedded applications, and even most of these can be done in C with enough efficiency to get by competitively).well written source code with inline assembly can be faster than pure ASM, with less bugs, development time and more portable.
madcho,
Indeed. I also think that AMD's L2 cache design will be pretty greatly changed in BD.... at least I am hoping soCMT seem very different of HT. I think AMD used the core 2 L2 cache in mind when they created the CMT.
Agree. I also believe that the L2 latency will be decreased.i think this is a good idea. It's gonna improve perfomance of L2 with the reduce of bandwith used between cores caches.
Fine, I must say though that Linpack wasn't exactly the app when I had in mind when I asked for examples, I was thinking commercial apps.
Regarding the "more part", what's to see ?
- In visualization tasks you get 4% on average due to HT
- In rendering tasks ( which do matter btw ) you get 21% on average
That's a very significant improvement which helps a lot the productivity.
??ASM optimizations are usually not that useful. the compilers designers know way more about optimizations than programmers (it's a big part of their job). they know when x87 is faster than SSE, what fp instructions have higher accuracy, what instruction have lowest latency, etc. well written source code with inline assembly can be faster than pure ASM, with less bugs, development time and more portable. MKL is still a great lib though.
Any respectable programmer is fully aware of instruction latency and mix, knows what to use for the intended target. When it comes to Intel, especially a benchmark like Linpack ( matrix multiplication is the most FP friendly thing ) , rest assured they squeezed every ounce of it.
His example is perfectly valid, but we cannot infer anything from it regarding HT utility, in fact his other examples clearly show that overall, it brings nice gains.
True, and Linpack falls under perfectly optimized code.If you are arguing that with perfectely optimized code, there will be no use for HT, then I agree with you. HT is designed to take advantage of gaps in the pipeline.
If anything I'd say code is becoming less optimized. Optimization means a lot of hard work and man hours, why not simply let the compiler do what it can and just make sure it works ? With the advent of a serious drive for cross platform portability and the increasing complexity ( you care about getting it done and working reliably even it it's not well optimized ) means less and less code is truly optimized to use all the HW resources.What happens when code gets more optimized, does HT become less relevant?
As code becomes more optimized, more physical cores will give you better throughput. As code becomes more optimized, HT would, theoretically, give you less throughput (or, more likely the same, but the main thread would be better and the HT thread would be worse, probably zero-sum game.)
Secondly, there is no such thing as "main thread" and "HT thread". They are both in flight at the same time in the core. And your "zero sum game" is easily disproved by an abundance of examples for increase throughput. Hell, some CPUs like Niagara used an 8 thread HT per core and those have nowhere near the resources of a Nehalem core. Yet, even with the per thread performance of a Pentium 2, they offer significant throughput and a sensible business proposition.
As for "more physical cores giving better throughput" that's an axiom. The question however is what are the costs ?
-HT takes little die area and power, but more validation time
-an extra physical core takes significantly more die area and power, validation is the same.
Secondly, HT hides the ever growing gap between CPU speed ( 3-4GHz ) and memory clock ( 266MHz the fastest one ). Another core doesn't help at all, but worsens the problem.
Bookmarks