Since we're comparing with Sandy Bridge here guys... is there even an 8-core counterpart for SB? You sounded like it's easy to fit 8-cores. AMD just did, even then, they went the modules way to decrease the diesize area.
Mitch Alsup says when he left AMD, Bulldozer was : performance *decrease* of 5% from the microarch-slimming, together with hoped-for 20-25% frequency increase from the pipeline-lengthening.
Even assuming *perfect* perf scaling with clock, that's 15-20% increase over Ph-II.
http://groups.google.de/group/comp.a...14f6049?hl=de#When I left, BD was supposed to be 20-25% faster frequency wise, and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture.
So they are really counting on speed-racer to bring the performance increase.
my guess would be 17 stages. a speed racer in a modern process is arguably going to be more efficient than a brainiac as long as you dont go over the top with pipelining. increasing IPC has much much worse diminishing benefits excluding multicore.
Main Rig:
Processor & Motherboard:AMD Ryzen5 1400 ' Gigabyte B450M-DS3H
Random Access Memory Module:Adata XPG DDR4 3000 MHz 2x8GB
Graphic Card:XFX RX 580 4GB
Power Supply Unit:FSP AURUM 92+ Series PT-650M
Storage Unit:Crucial MX 500 240GB SATA III SSD
Processor Heatsink Fan:AMD Wraith Spire RGB
Chasis:Thermaltake Level 10GTS Black
More interesting bits on the pipeline changes:
Most of what got cut was cut to enable the 12-gate pipe (if indeed
they did achieve that.) In Athlon/Opteron, one can forward a byte,
word, double, or quad from any of the 5 results to any operand of any
6 integer computation units {ALU, AGU}. If BD can't (or couldn't when
I left) forward anything to anywhere, and eats a little AFoM because
of this. This probably saved 2 real gate delays. Lopping off the extra
ALU, and a few other things saves another gate and we are then within
spitting distance (1-gate) of the desired 12-gate pipe in the integer
pipe. More lopping occured in the L1cache pipe to reach the cycle time
goal.
Bobcat? come on AMD, you had better names.
Why is no one else wondering about Bulldozer's Decode details?
Fast computers breed slow, lazy programmers
The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
http://www.lighterra.com/papers/modernmicroprocessors/
Modern Ram, makes an old overclocker miss BH-5 and the fun it was
[MOBO] Asus CrossHair Formula 5 AM3+
[GPU] ATI 6970 x2 Crossfire 2Gb
[RAM] G.SKILL Ripjaws X Series 16GB (4 x 4GB) 240-Pin DDR3 1600
[CPU] AMD FX-8120 @ 4.8 ghz
[COOLER] XSPC Rasa 750 RS360 WaterCooling
[OS] Windows 8 x64 Enterprise
[HDD] OCZ Vertex 3 120GB SSD
[AUDIO] Logitech S-220 17 Watts 2.1
He left just in time when the BD 1 was canceled(end of 2007) and BD 2 ,the one that is coming out 2 years later, was starting to take shape.
Fast computers breed slow, lazy programmers
The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
http://www.lighterra.com/papers/modernmicroprocessors/
Modern Ram, makes an old overclocker miss BH-5 and the fun it was
[MOBO] Asus CrossHair Formula 5 AM3+
[GPU] ATI 6970 x2 Crossfire 2Gb
[RAM] G.SKILL Ripjaws X Series 16GB (4 x 4GB) 240-Pin DDR3 1600
[CPU] AMD FX-8120 @ 4.8 ghz
[COOLER] XSPC Rasa 750 RS360 WaterCooling
[OS] Windows 8 x64 Enterprise
[HDD] OCZ Vertex 3 120GB SSD
[AUDIO] Logitech S-220 17 Watts 2.1
Fast computers breed slow, lazy programmers
The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
http://www.lighterra.com/papers/modernmicroprocessors/
Modern Ram, makes an old overclocker miss BH-5 and the fun it was
BD 2 was not all around new since it's naturally based on the BD 1 version that was supposed to come out at 45nm. I suspect that like in the case of Barcelona,they were power limited at 45nm and perfromance was not up there where they wanted .So they went with an improved core,done on a smaller node and delayed it 2 years(2009->2011). This gives them more room for improvements at the core level and more clocks ,all within the same power envelope.I expect 15-20% in core level improvement + 30% in clocks.
Its 4+1(branch fusion supported) decoder at the front end,with a so called "accelerate mode" if certain conditions are met.AMD is not disclosing anything about this particular feature ,but essentially this increases the decode rate by some unknown factor.
Last edited by informal; 08-29-2010 at 03:08 AM.
So, not to talk out of school, but I did ask one of our design engineers about the ability of the shared front end to keep two integer cores fed and he had absolutely no concern because of things that are done to improve the front end.
Can't say any more beyond that because a.) it is not public info and b.) I don't really know enough about how those things work to accurately describe them.
In my mind this is not a concern of the engineering team. After all it is a completely new design. If they had taken the front end off of an existing product it might be more of an issue, but as I understand it, that has not happened.
[MOBO] Asus CrossHair Formula 5 AM3+
[GPU] ATI 6970 x2 Crossfire 2Gb
[RAM] G.SKILL Ripjaws X Series 16GB (4 x 4GB) 240-Pin DDR3 1600
[CPU] AMD FX-8120 @ 4.8 ghz
[COOLER] XSPC Rasa 750 RS360 WaterCooling
[OS] Windows 8 x64 Enterprise
[HDD] OCZ Vertex 3 120GB SSD
[AUDIO] Logitech S-220 17 Watts 2.1
Given that Phenom has a 32 BYTE pick buffer and a 408bit fetch, I see that has highly unlikely.
Added to the fact that Bobcat has a 22 byte decode
But without more details, optimizing the decode rate is impossible.
For example, can a single thread take up the entire decode unit for a couple clock cycles if the other thread is sleeping?
Could you find out if the threads share a pick buffer or if it is shared.
and if so, what size(s)
Fast computers breed slow, lazy programmers
The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
http://www.lighterra.com/papers/modernmicroprocessors/
Modern Ram, makes an old overclocker miss BH-5 and the fun it was
This almost seems like a single module may possibly use both integer units along with the FPU when executing a single thread. If this is the case, single threaded performance on BD will not be a weak point at all. I remember that old marketing slide saying BD would have the highest single threaded performance ever. It better be true dammit.
Single thread can occupy all the shared resources in the module.Decoder and thew whole front end ,with the extra beefed up prefetch is shared.FPU is shared.
Integer cores can't "combine" to work on single integer thread,but one integer core can use the whole FPU to itself.Also one FPU can be used a la SMT by 2 integer cores. What is shared in the module can be used by integer core(s).
Last edited by informal; 08-29-2010 at 08:10 AM.
Bookmarks