you guys should read who posted that, its none other than me
the BD 20 questions are all broken up sections, so we get the next set answered very soon it seems
Printable View
Since we're comparing with Sandy Bridge here guys... is there even an 8-core counterpart for SB? You sounded like it's easy to fit 8-cores. AMD just did, even then, they went the modules way to decrease the diesize area.
Mitch Alsup says when he left AMD, Bulldozer was : performance *decrease* of 5% from the microarch-slimming, together with hoped-for 20-25% frequency increase from the pipeline-lengthening.
Even assuming *perfect* perf scaling with clock, that's 15-20% increase over Ph-II.
http://groups.google.de/group/comp.a...14f6049?hl=de#Quote:
When I left, BD was supposed to be 20-25% faster frequency wise, and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture.
So they are really counting on speed-racer to bring the performance increase.
my guess would be 17 stages. a speed racer in a modern process is arguably going to be more efficient than a brainiac as long as you dont go over the top with pipelining. increasing IPC has much much worse diminishing benefits excluding multicore.
More interesting bits on the pipeline changes:
Most of what got cut was cut to enable the 12-gate pipe (if indeed
they did achieve that.) In Athlon/Opteron, one can forward a byte,
word, double, or quad from any of the 5 results to any operand of any
6 integer computation units {ALU, AGU}. If BD can't (or couldn't when
I left) forward anything to anywhere, and eats a little AFoM because
of this. This probably saved 2 real gate delays. Lopping off the extra
ALU, and a few other things saves another gate and we are then within
spitting distance (1-gate) of the desired 12-gate pipe in the integer
pipe. More lopping occured in the L1cache pipe to reach the cycle time
goal.
Bobcat? come on AMD, you had better names.
Why is no one else wondering about Bulldozer's Decode details?
He left just in time when the BD 1 was canceled(end of 2007) and BD 2 ,the one that is coming out 2 years later, was starting to take shape.
BD 2 was not all around new since it's naturally based on the BD 1 version that was supposed to come out at 45nm. I suspect that like in the case of Barcelona,they were power limited at 45nm and perfromance was not up there where they wanted .So they went with an improved core,done on a smaller node and delayed it 2 years(2009->2011). This gives them more room for improvements at the core level and more clocks ,all within the same power envelope.I expect 15-20% in core level improvement + 30% in clocks.
Its 4+1(branch fusion supported) decoder at the front end,with a so called "accelerate mode" if certain conditions are met.AMD is not disclosing anything about this particular feature ,but essentially this increases the decode rate by some unknown factor.
So, not to talk out of school, but I did ask one of our design engineers about the ability of the shared front end to keep two integer cores fed and he had absolutely no concern because of things that are done to improve the front end.
Can't say any more beyond that because a.) it is not public info and b.) I don't really know enough about how those things work to accurately describe them.
In my mind this is not a concern of the engineering team. After all it is a completely new design. If they had taken the front end off of an existing product it might be more of an issue, but as I understand it, that has not happened.
Given that Phenom has a 32 BYTE pick buffer and a 408bit fetch, I see that has highly unlikely.
Added to the fact that Bobcat has a 22 byte decode
But without more details, optimizing the decode rate is impossible.
For example, can a single thread take up the entire decode unit for a couple clock cycles if the other thread is sleeping?
Could you find out if the threads share a pick buffer or if it is shared.
and if so, what size(s)
This almost seems like a single module may possibly use both integer units along with the FPU when executing a single thread. If this is the case, single threaded performance on BD will not be a weak point at all :yepp:. I remember that old marketing slide saying BD would have the highest single threaded performance ever. It better be true dammit.
Single thread can occupy all the shared resources in the module.Decoder and thew whole front end ,with the extra beefed up prefetch is shared.FPU is shared.
Integer cores can't "combine" to work on single integer thread,but one integer core can use the whole FPU to itself.Also one FPU can be used a la SMT by 2 integer cores. What is shared in the module can be used by integer core(s).
AMD said 16-core Bulldozer has 33% more cores than Magny-Cours and 50% faster than Magny-Cours
Divide 2 parts of the statement by two and you'll get:
8-core Bulldozer has 33% more cores than Thuban and 50% faster than Thuban
Take cinebench result and approximate:
http://i9.fastpic.ru/big/2010/0829/6...ba6f078563.png
Now realize that the octal-core, 16-thread LGA-2011 competition will also be about 50% faster than the current top end, and well... nothing will have changed, relatively speaking. ;)
who did say 8-core bulldozer is a future AMD top? AMD will have 16-octal cpu, twice faster
Yeah, you're right! :D
As far as we know, the 33% more cores and 50% thing is on server loads (honestly, what server loads uses 100% all the time?). Conservative too, it's the bare minimum. I believe that's what JF-AMD said. Correct me if I'm wrong.
EDIT: Here it is.
Oh, terrace215 was posting at that thread too.
Again, we are all asking this. Is there an 8-core SB for desktops? 8-core is the new 4-core for AMD. It seems like 6-cores and 4-cores SB are going up against 8-core BD, which probably replaces the 6-cores and the 4-cores of AMD today.
I'm expecting good things.
AMD's desktop BD will offer 4 module aka 8 core version
Intels desktop SB will offer 8 core as far as i know and was told in march dont know if the 10c will be for consumers.
SB-EX for consumers?That's,IIRC, not even on the 2011 server roadmap.
Did you stop to think about how Alsup's comments on comp.arch echo AMD's slide bullet point....
Alsup: 5% loss due to u-arch trimming to meet higher frequencies
AMD: "without significant loss in single-thread performance"
:yepp:
So... yeah, it's about the changes in the integer side of the "module" from K10.5 to BD. They hope for higher frequencies to get back to gains, but the net is 15-20% for single/low-threaded work. Not nearly enough to catch SB.
It's down for Q4 2011.
Why give an EX-chief architect so much credibility? He's not in charge anymore, hasn't been for quite a while (couple years if I understand correctly). A lot can happen in a couple years (good or bad).
I guess you just like to crap on every AMD thread that exists.
SO BD will be the only 8-core chip for desktops. Not to mention a whole range of product of them (not just one product, judging from AMD's track record).
BD definitely wants some encoding love.
:confused: :confused:
Sandy Bridge LGA-2011 will come in 8- and 6-core flavors for the desktop. (in Q3)
There's a table toward the bottom of the wiki entry that may help you keep track of all the variants:
http://en.wikipedia.org/wiki/Sandy_B...oarchitecture)
with HT too,8c/16t and 6c/12t.correct?ive been hearing so many different things
now that the 1155 preview was done,and i think the s2011 keeps getting lost in the mix.is there any definitive info on this?
and also ive heard that the second iteration of BD might have 8c/16 using cmt instead of smt.any truth to this?that would be very sweet if its true:D
Something more official would be nice. Everything has been word of mouth lately. 8c without a doubt will come to servers but what about desktops?
Though I think we can trust ajaidev's words but yeah, it'd be nice to have an official word or slides, just to remove all doubts.
Bulldozer is pretty much an optimized CMT. CMT's supposed to give you twice a core's performance for even greater die-size but AMD settled for greater die-saving but for 80% of a core's performance.. I'm just quoting the slides.
<totally rhetorical question, NOT real numbers>
Which would you rather have:
80% of the performance with 50% of the cost and 50% of the power consumption
100% of the performance with 120% of the cost and 120% of the power consumption
</end rhetorical question>
People keep seeing that 80% number and thinking that it is a compromise. What they don't understand is that by sharing components we are able to add more cores in the same die space and same power budget.
It is by no means 80% of today's performance.
People do not get one thing.
the comparison of the BD module having 180% performance of 2 cores (which would have 200%) is not done in regards to a 10H core. So... a BD module (with two cores) is not 180% of 2 thuban cores. That is plain dumb to believe.
The 180% is vs 2 proper BD cores. So, instead of having 2 full Bulldozer cores for 200% perf, they chose a module aproach, giving 180% but at 50-60% the size of 2 full cores. Thus, you pack more cores into the same die.
This is exact the thing people like terrace hang on, that BD cores loose 5-10% compared to Thuban. That is just naive to believe. They loose 5-10% in fully threaded applications vs 2 theoretical BD full cores.
8 core Sandy has countless times been scheduled as a 2012 product by many reliable sources. Yet you are using wikipedia as your source.
Also, your Q4 2011 for Bulldozer is really crapping on everything that JF and the best AMD sources have been saying. If you want to have honest discussions, its fair. But what you do is simply spread disinformation like a blatant fanboy.
terrace215
http://ladyjava.javaura.com/wp-conte...0/04/troll.jpg
sorry, I couldn't resist :P.
http://www.uploadgeek.com/image-9A57_4C7ADAD3.jpg
This is old "Starting of 2010 old" no other comments....
^
lol what was that censoring good for?
FPU here we come again. :)
It's -5% IPC vs Thuban, then higher frequency due to the longer pipeline with faster pipe stages, estimated at 20-25%. This is for a single thread doing integer work.
So says AMD's ex Chief Architect.
IPC is not "performance" it is "performance/frequency". I believe this is the source of much confusion.
BD will have slightly LOWER (integer) IPC for single threads than Phenom-II, which it will attempt to (more than) make up for using speed-racer frequencies. Put together, you could see a performance increase of 15-20% (this depends on GloFo being able to deliver a good enough gate-first 32nm process, which is... uncertain) , but this is nowhere near enough to catch SB, as it just demonstrated a similar gain over Nehalem/Westmere, and Phenom-2 starts in a big hole relative to Nehalem. This is about integer single- (and therefore also low-) threaded workloads.
why hasn't this clown been banned yet?
No, your wrong.
It's minus 20% performance vs 2 full BD cores. 1 module = 180% of 2 theoretical BD cores. I think that is a clear fact and your just spinning stuff up.
That's the whole point of going for a module. You could have 2 full cores with 200% performance, but with much bigger size. JF-AMD says it as well, they chose smaller die-size so they can pack more modules. Never did they say the 180% is vs Thuban.
Plus, i don't buy the Ex-architect stuff, it's BS. IF you would be a high-end engineer and would, after leaving a company, reveal so much about it, you would get a big law-suit and would probably never get employed by other companies since they would fear you would do the same when working with them.
Usually work contracts have clauses which prevent you from revealing what have you worked during that time and any kind of confidential info.
That's just BS that someone would believe that such a high place engineer would throw smack at AMD. Only a kid would believe professionals act like kids when their career is at stake.
You're talking about AMD slides, he's talking about the BD architect comments.
Different things.
The comment was done on comp.arch.Quote:
Plus, i don't buy the Ex-architect stuff, it's BS. IF you would be a high-end engineer and would, after leaving a company, reveal so much about it, you would get a big law-suit and would probably never get employed by other companies since they would fear you would do the same when working with them.
Usually work contracts have clauses which prevent you from revealing what have you worked during that time and any kind of confidential info.
That's just BS that someone would believe that such a high place engineer would throw smack at AMD. Only a kid would believe professionals act like kids when their career is at stake.
If you'd bother to loiter around and see who posts there, I wouldn't doubt a single thing of what the guy said. If it were like you said, Andy Glew would be in jail by now. :P
i would think a module it's 160% if it loss 20%, i think it's on 2 cores the loss.
But the advantage is single thread way much faster than tuban, i think. One core ( so one thread ) can run code on the 256bits AVX FP. It's not faster with 2 threads on one module for FP, but the advantage, is less threaded FP code will be faster.
Int should be faster too in ipc even pipe is deeper. As JF- said, PII was 1.5+1.5, BD ( ² ? ) is 2+2. ( ALU+AGU ).
I think it's a good deal. JF- said this week we are going to know what about BD work in 4 threads on 4 modules. Will be auto on 2 modules and 2 modules off or 4 modules on ?
Huh ? K10 has 3 ALUs and 3 AGUs, BD has 2+2. Contrary to what informal&co were hipping around, BD's integer cores are simpler and less powerful than on K10. Which is no surprise, something had to give in order to keep a module size under control.
All the improvements done together with the frequency increase are meant to compensate the 3rd unit. You have the information in the AMD slide ( ...without significant loss on the serial single-threaded workloads components ), you also have the comments of M. Alsup ( ...and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture ). It all fits together now, irrespective of what marketing is trying to portray.
Without significant loss = loose (5%-ish )
AMD is giving up single threaded performance and is focusing on through-output. They've realize it is pointless to try and compete with Intel on "fat" cores ( already in commercial benchmarks they need a 2-to-1 ratio to stay competitive with Xeons ) so the alternative path they are taking is to cram as many cores as possible in a given die size and clock those simple cores as high as possible.
Magny Cours isn't adequate for this since the core size is still too big and the core advantage over Xeons at the same process node is too small. With MC, AMD had a 50% advantage in the number of cores. With BD they will have 60% ( and much higher frequency ) over same timeframe Xeon and this will only increase in the future.
terrace215 totally has never read any of JF-AMD's posts. JF-AMD has said COUNTLESS times that IPC will be HIGHER than K10.
richierich:
I think Movieman is just too scared to open this thread :D . Moderator's horror.
It's a problem when you're denying some things when facts are IN YOUR FACE. -5% IPC compared to Thuban? Wow where did that 33% more cores yet 50% more perf compared to magny core on server loads arguments went to? Or Dresdenboy's analysis of the uarch which are at least better than listening to some ex-chip architect who hasn't worked on the processor (of course, I mean BD) for 3 years? Or that none of you guys even mentioned about the improvements in other areas, and keeps attacking the fact that AMD removed something. Well, they added stuff as well. Heck, even the fact that the design is a ground-up design should mean something.
What Sunfire said anyway. You guys need to read the thread from the start.
3 ALU/AGUs is less than 2 ALU + 2 AGU. Each BD core is stronger than Phenom II at that point.
Bulldozer as a whole will be faster, both per core and per socket, 50% increase with 33% more cores would be impossible if it wasn't.
"Only" 80% increase from second core is the exact same thing as saying 11% faster per thread when only one core is utilized. That means faster at single threads.
Most of the facts we have doesn't fit your theory. The only thing you have - still - is the quote from Mitch Alsup here: http://groups.google.de/group/comp.a...14f6049?hl=de#
And it seems to me that you haven't read all that he has to say.
I've made some calc about BD, it's my own calc so it's pure speculation, nothing real, but it's awesome.
i used Magny cour 2.2ghz 12 cores for base and used the +50% +33% more core, i used 7.95x speed up for cinebench 11.5 for 2P 2x6 cores opterons.
And used the 80% from AMD slide to calc the multithread speed up of BD.
So if my calc are not f*cked ( i hope so ), i get :
16 cores ( 2x8 on G34 or 2*4 modules ) with 8.48x multithread speed up.
I set the IPC to magny cour to 1. So P1 ( P in single thread ) is 2200*1=2200, and P2 ( P in multithread ) is 2200*1*7.95=17.490k
+50% perf for BD in multithread it's said.
so : 17.490*1.5 = 26.235k.
So P1 for BD is 26.235k/8.48=3093.75.
So P1 BD / P1 magny cour = performance improvement for BD = 3093.75/2200=1.40.
So we know P = F * IPC, and we the deeper pipeline target 20 to 25% frequency improvement.
With 20% we get 2640mhz, so i use 2.6ghz for more rationnal number.
Aproximation IPC is P/F=3093.63/2600=1.189 to compare to 1 taked for magny cour.
SO I GUESS I'M CRAZY TO DO SOME STUPID CALCS LIKE THAT BUT I DON'T WANT GO TO ASYLUM ... TY :D
terrace and savantu are speaking about mystical losses in IPC,so they must have BD chips in their hands,right?Right. :p:
I dunno, but it's pretty amusing to watch.
I've made SS's of the posts for future sigability :lol:
That's a module mate. 1 core in that module has higher IPC than a Thuban core. Pretty simple. But, when both cores in 1 module work on 2 threads, than you loose 10% performance per core because of the shared components. In single thread scenarios, 1 of the 2 cores works at 100%.
IGNORE SAVANTU AND TERRACE
I am not any kind of genius, but for the love of Pete...
There are only two examples in IC history where a product has been deliberately designed to be slower than possible.
1.) Intel Pentium 4
2.) The new SOC for Xbox360 combining cpu+gfx.
Then we phrase it like 11% more performance in single threaded apps due to the increased amount of resources the core gets when the other is idle. ;)
10% more performance in one thread is compared to a module running two threads. 50% more performance with 33% more cores is a comparison with Magny Cours, both in multithreaded apps.
One is an number showing internal power differences, the other is a comparison with current processors.
Real soon there will be this magic touch... :)
I've seen it before, and it rocks!. ;)
stupid question: does 2 integer core means atleast 30% increase in performance comparing Intel Core to BD Module atleast going by this article
http://ixbtlabs.com/articles3/cpu/ar...2009-4-p2.html where an additional core (integer core in case of BD) adds about 39% improvement. sure there is slight single threaded perf loss, ipc improvements come into play and all that but...
Here is what it means.
Bulldozer cores are like Intel Hyper-threading cores.
The primary difference is that AMD throws more transistors at the problem by giving each thread it's own set of integer execution units. Added to the fact that AMD’s distributed schedulers and instruction grouping. This is a clear architectural trade-off of performance and decreased control complexity versus size and increased execution complexity. Replicating two full featured ALUs uses more die area, but provides higher performance for certain corner cases, and enables a simpler design for the ROB and schedulers.
The honest truth is if NO CPU designed yet, can keep a constant throughput of 2 instructions per clock. So the more efficient design of the Integer cores, suggest that we shouldn't expect any performance drop at all. [For 99.9% of all user applications ]
Someone please explain once more to these two terrace and savantu that when AMD speaks of tradeoffs, they do so on Bulldozer vs. Bulldozer, meaning 5% less performance on a BULLDOZER core with 2+2 ALU/AGU vs. a BULLDOZER core with 3+3.Quote:
All the improvements done together with the frequency increase are meant to compensate the 3rd unit. You have the information in the AMD slide ( ...without significant loss on the serial single-threaded workloads components ), you also have the comments of M. Alsup ( ...and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture ). It all fits together now, irrespective of what marketing is trying to portray.
I think some people can just be too blind in denial or wishful thinking, plus it drives me nuts how one can hug the nuts of one single company so tightly it feels like were talking about football rivalry.
You should grow up and learn to support healthy competition above all, you have NOTHING to gain if Bulldozer completely fails.