AMD X2 6500+ coming

**JumpingJack** · 09-26-2008, 04:16 PM

Originally Posted by Macadamia

How much is architecture? How much is process?
I think a lot of people forgot that high-k is actually quite amazing in terms of process. AMD's not going to get it till 32nm, but I don't think it's at any aspect fair for those guys to get slugged that hard because they're less radical on processes.

This is a much harder question to answer ... the quick answer is a bit of both.

A somewhat simplified analogy is to think of the CPU as a progression of simple to complex. Transistor level, circuit level, then architectural level in that order much like cell, organ, organism. On the transistor level, the performance is centered around process and design of the physical transistor, and within the device, there are a multitude of different transistors used for different desired electrical properties... example, transistors making the bits in SRAM are different than the transistors making the logic circuits ... SRAM transistors are much smaller and also require higher minimum voltage to operate (by this I mean a higher minimum limit say 1.1 v as opposed to 1.0 volt minimum to work without error, just an example).

Today's transistors operate with a switching time of 1.0 -1.5 picoseconds, to simplify this lets assume 1.0 picoseconds (if you don't believe me, read for example IBM's process tech papers and they will quote you a td or gate delay time of about 1.2 ps for thier PMOS transistor on 65 nm). This is the time required for the transistor to charge the channel and turn on, so it is the limit at which it switch on the fastest. Let's double that number just for sake of engineering error, you have a transistor today that can switch as fast as 5x10^-12 seconds, or in frequency (1/td) = 200,000,000,000 or 200 GHz ... so why don't we have 200 GHz processors???

The answer lies within understanding that the transistors are strung together to make the circuit and the total gate delay of the circuit determines the speed of a circuit. That is, the total capacitance (which determines the delay in an oscillating circuit) is additive or an inverse of inverse law when summing the circuit. Circuits are built into logical blocks and stages (i.e. pipeline stages -- more on this later) that ultimately also add up to limit the total capable clock of the circuit. Circuit designers will measure this total delay in a metric called FO4 or fan out of four (which is a simplified measure of 4 inverters chained up). The total FO4 delays are a limitation by design and how they target their work to hit a target clock (read more about this here on IBM's power6: http://www.research.ibm.com/journal/rd/516/tocpdf.html the limited, for example, critical FPU delays to 13 FO4's).

In the end, two things in general affect the clockability of a processor -- the strength or switching speed of the individual transistor (process driven) and the depth of the circuit (design driven). Generally speaking, the more complex (i.e. higher transistor count) that goes into the functional block making the task work, the higher the FO4 delay, the slower the ultimate clock speed for a given process type.

This is why long pipeline CPUs are said to be designed to clock higher -- here is how it works. An OoOe superscalar processor does 5 general things when running code -- fetch, decode, reorder, execute, retire (unreorder). The first three are complex and what is sometimes referenced as the pipeline. The work done in the fetch, decode, reorder phase can be broken down in to stages, and but the total work through the three is always the same. Thus if I break it down into 10 stages (say 3 for fetch, 3 for decode, 4 for reorder as an example) I can get there but designers have broken it down even further into say 30 stages to do the same amount of work.

The complexity and transistor spent per stage in a 10 stage design is much higher, FO4 delay is much longer, and clockability is not as good as a 30 stage design where each stage has fewer transistors, lower F04 delays, and hence higher clockability. This is mother natures cruel little joke ... to extract high IPC and better per clock efficiency one must add transistors to the equation, but adding transistors also makes clocking the device harder.

It gets even more complex than this... beyond my understanding (I am always trying to learn more), but in a nut shell this is how both process and design play in.

AMD has generated a elegant native, monolithic quad core CPU -- but elegant does not make it technologically superior. AMD has great technology going into the processor but it also has some baggage holding it back... Hector called it the most complex x86 CPU to date, and his is absolutely right and high complexity is harder to clock -- for reasons I gave above and if you read the IBM paper reasons you will better understand. The inability to clock higher on 65 nm is a combination of a slower overall transistor in the 65 nm process combined with a tremendously complex design.

Intel is advantaged over AMD on both fronts, which is what is making the Intel products so potent in the battle of the big two... Intel wins on IPC and they win on process which drives the clock equation (to a large extent). Intel also uses 3 simple decoders and 1 complex, AMD uses 3 complex. Frankly, I am impressed that Conroe could clock as high as it could considering Intel more than halved the number of stages in the pipeline. The one area AMD really still holds the advantage is in aggregate BW, but the advantage doesn't show up until 2 socket high BW server loads, and pretty much most 4P setups (except it appears Dunnington is changing the game in that area as well).

Jack

Thread: AMD X2 6500+ coming

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions