you guys should read who posted that, its none other than me
the BD 20 questions are all broken up sections, so we get the next set answered very soon it seems
Printable View
Since we're comparing with Sandy Bridge here guys... is there even an 8-core counterpart for SB? You sounded like it's easy to fit 8-cores. AMD just did, even then, they went the modules way to decrease the diesize area.
Mitch Alsup says when he left AMD, Bulldozer was : performance *decrease* of 5% from the microarch-slimming, together with hoped-for 20-25% frequency increase from the pipeline-lengthening.
Even assuming *perfect* perf scaling with clock, that's 15-20% increase over Ph-II.
http://groups.google.de/group/comp.a...14f6049?hl=de#Quote:
When I left, BD was supposed to be 20-25% faster frequency wise, and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture.
So they are really counting on speed-racer to bring the performance increase.
my guess would be 17 stages. a speed racer in a modern process is arguably going to be more efficient than a brainiac as long as you dont go over the top with pipelining. increasing IPC has much much worse diminishing benefits excluding multicore.
More interesting bits on the pipeline changes:
Most of what got cut was cut to enable the 12-gate pipe (if indeed
they did achieve that.) In Athlon/Opteron, one can forward a byte,
word, double, or quad from any of the 5 results to any operand of any
6 integer computation units {ALU, AGU}. If BD can't (or couldn't when
I left) forward anything to anywhere, and eats a little AFoM because
of this. This probably saved 2 real gate delays. Lopping off the extra
ALU, and a few other things saves another gate and we are then within
spitting distance (1-gate) of the desired 12-gate pipe in the integer
pipe. More lopping occured in the L1cache pipe to reach the cycle time
goal.
Bobcat? come on AMD, you had better names.
Why is no one else wondering about Bulldozer's Decode details?
He left just in time when the BD 1 was canceled(end of 2007) and BD 2 ,the one that is coming out 2 years later, was starting to take shape.
BD 2 was not all around new since it's naturally based on the BD 1 version that was supposed to come out at 45nm. I suspect that like in the case of Barcelona,they were power limited at 45nm and perfromance was not up there where they wanted .So they went with an improved core,done on a smaller node and delayed it 2 years(2009->2011). This gives them more room for improvements at the core level and more clocks ,all within the same power envelope.I expect 15-20% in core level improvement + 30% in clocks.
Its 4+1(branch fusion supported) decoder at the front end,with a so called "accelerate mode" if certain conditions are met.AMD is not disclosing anything about this particular feature ,but essentially this increases the decode rate by some unknown factor.
So, not to talk out of school, but I did ask one of our design engineers about the ability of the shared front end to keep two integer cores fed and he had absolutely no concern because of things that are done to improve the front end.
Can't say any more beyond that because a.) it is not public info and b.) I don't really know enough about how those things work to accurately describe them.
In my mind this is not a concern of the engineering team. After all it is a completely new design. If they had taken the front end off of an existing product it might be more of an issue, but as I understand it, that has not happened.
Given that Phenom has a 32 BYTE pick buffer and a 408bit fetch, I see that has highly unlikely.
Added to the fact that Bobcat has a 22 byte decode
But without more details, optimizing the decode rate is impossible.
For example, can a single thread take up the entire decode unit for a couple clock cycles if the other thread is sleeping?
Could you find out if the threads share a pick buffer or if it is shared.
and if so, what size(s)
This almost seems like a single module may possibly use both integer units along with the FPU when executing a single thread. If this is the case, single threaded performance on BD will not be a weak point at all :yepp:. I remember that old marketing slide saying BD would have the highest single threaded performance ever. It better be true dammit.
Single thread can occupy all the shared resources in the module.Decoder and thew whole front end ,with the extra beefed up prefetch is shared.FPU is shared.
Integer cores can't "combine" to work on single integer thread,but one integer core can use the whole FPU to itself.Also one FPU can be used a la SMT by 2 integer cores. What is shared in the module can be used by integer core(s).