PDA

View Full Version : Dothan Article V2 -- from Matt, your friendly neighborhood CompE!



matt9669
01-23-2005, 03:47 AM
Intel’s Dothan Part Deux: Tiny Tim Meets California’s Gov

Ten years ago, who would have thought Arnold would be in the governor’s chair, especially in deep-blue California . . .

Several months ago, who would have thought Intel had something cool, quiet and beastly powerful up their sleeves? ;)

Matt, your friendly neighborhood CompE is back in action, to unravel the mysteries of the (computing) world. (The Bermuda Triangle is a bit out of my reach, I think . . . :D) If you haven’t read the previous article, “Intel’s Dothan: The Little Engine That Could? (http://www.xtremesystems.org/forums/showthread.php?t=50219)”, please do so before proceeding, otherwise a mob of angry neo-anarchist nuns will invade your house and rape your children. Of course, if you have no children, guess there’s no harm in reading on . . .

Dothan, while a bit on the wild side, looks increasingly appealing as 533FSB and Alviso-based motherboards lie on the horizon. But with all the press and marketing efforts going toward the {understatement}successful{/understatement} Centrino brand, relatively little is known about the brains of the operation, the Pentium M core architecture. This humble creature boasts nothing of today’s more forward looking platforms: Athlon 64 with 64bit, NX bit and on-die memory controller, with SSE3 coming soon and dual core/DDR2 coming later - and Prescott with DDR2, HT, and EMT64/EDB, with rapidly approaching dual core versions and sooner-than-expected virtualization technologies.

Despite all the aforementioned buzzwords, Dothan still brings the heat on clock-for-clock performance, based on a design from yesteryear quietly restored to life to rule the mobile roost. After all, Intel told us the P6 core was old hat, past its prime, ready to be retired. Perhaps the Pentium 3’s of the world got together, decided they’d heard enough Netburst propaganda to invoke gag reflexes, and a la Frankenstein, created a monster from the Ghost of CPU Engineers’ Past. Now that monster has broken its mobile chains with blistering performance – but, dear Lazarus, how is this possible? :idea:

Pentium Pro Is Alive and Kicking

Anyone here remember the days of the Pentium Pro? I remember being a little kid in the barber’s chair, talking to my barber and fellow enthusiast Jeff about how sweet it would be to have a dual 200MHz Pentium Pro system. We’ve grown a lot since then – Jeff’s comfy with a laptop, 200MHz is slow for an embedded processor, and me – OK, I lied, I’m running a Vapochill and putting together an SLI rig. :hehe:

In reality, the Pentium M doesn’t go back that far. The Pentium Pro core dubbed “P6” has changed substantially from its initial incarnation, and P-M beats with the heart of a Pentium 3 (no surprise there). However, Dothan’s per-clock performance is much heftier than any P3 I’ve ever known – back in the day, the classic Slot A Athlon (codename K7, later K75 at 180nm and Mustang with Cu interconnect) and the Slot 1 P3 (codename Katmai, later Coppermine at 180nm w/ on-die cache) were on arguably equal footing. The move to SocketA/370 (Thunderbird/Mustang and Coppermine, respectively) didn’t make either a clear favorite, only increasing the performance divide in specific areas.

But Dothan bests even an Athlon FX in clock-for-clock performance (gaming benchmarks (http://www.gamepc.com/labs/view_content.asp?id=dothangaming&page=1), which I’ll focus on for this article), and the 64/FX has significantly improved IPC over Athlon XP’s. Not to mention Dothan’s power loss/clock is nearly one fourth that of a .13um A64. Quite frankly, that’s amazing!!

(Now a few oranges have fallen into the apple barrel – we don’t have good power figures on current 90nm A64’s, and the San Diego/Venice chips on strained SOI are not here yet – because Dothan is a mobile, 90nm chip. Still, power differences of that magnitude don’t disappear overnight.)

Dothan’s power efficiency isn’t difficult to explain (anyone notice me avoiding the IPC? :sofa: Good . . . :comp10: ) While die shrinks don’t have the same power savings they used to (for numerous reasons), there’s nothing inherently power-greedy about 90nm production. Transistors can be designed for two kinds of switching behavior – fast, with high power leakage, or slow, with low power leakage. I don’t have to tell you what kind Prescott uses *wink*

Since Dothan runs at much lower clockspeeds, Intel can use transistors that do not switch as fast, saving a large amount of power leakage, and enabling lower voltage operation. The P-M design needs fewer transistors for execution, with the current 2MB cache taking the lion’s share of the die space, and it’s easy to limit power loss on cache memory. Further, though it’s not mentioned in anything I read, I would suspect Dothan makes aggressive use of clock gating, a simple but relatively new technique – quite simply, the clock signal is “gated” or turned-off to portions of the chip that are not in use. Core architectural features also contribute to power savings, but I won’t mention these: you can read them in my primary reference for the next section, Ars Technica’s Pentium M article (http://arstechnica.com/articles/paedia/cpu/pentium-m.ars).

Killer IPC – Sometimes A Bit of Tweaking Makes All The Difference

Do yourself a favor and read Ars’ article, if you want to know more about what makes the P-M unique (warning, extremely technical!). If you’d rather just have a general idea of how it accomplishes more with less, please keep reading.

We know Banias/Dothan have SSE2 units, and have around 12 to 14 pipeline stages, but details on the SSE2 units and the exact pipeline depth remain a mystery. (Intel has chosen not to reveal that information - remember, Centrino is the press-maker, not the processor behind it.) However, that information is not vital to the chip’s inner workings, at least from a performance standpoint. (Whoa d00d, it’s snowing at 4AM! ADD under control :shakes: ) Intel made two key changes to the Pentium 3 core, based on the P4’s strengths – improved branch prediction and micro-ops fusion. I won’t cover the stack engine as though it does allow the CPU to do more useful work, it primarily adds power savings, in my ever-so-humble :devil: opinion.

In microprocessor terms, a “branch” is a point at which a program could take two or more execution paths. Because CPU’s actually continue to do work until the direction of the branch is determined, accurate branch prediction is crucial in keeping the processor doing useful work – a branch misprediction requires the CPU to flush the pipeline and start over, greatly delaying execution of the new branch. The Pentium 4 first brought us dynamic branch prediction (the Athlon 64 caught on later), but the Pentium M takes it further, making it possibly the most advanced branch prediction unit in wide use today (the Power PC 970 or G5 is comparable, but that’s not x86 so who gives a :rotf: :ROTF: :lol: :hehe: )

Entirely new is the “loop detector”, which allows the branch prediction unit better forewarning of execution threads exiting a loop. Again, if the PU guesses wrong, a pipeline flush is required. As with all things Netburst, the P4 has a more brute-force method of guessing the loop exit point, by keeping a history of previous loop exits with the branch history table – P-M essentially makes more efficient use of its BHT. Banias/Dothan are also equipped with the “indirect branch predictor”. Much like a pre-rendered cutscene, direct branches are those branches which the program has already told where to go next in case of a branch occurs – much like realtime cutscenes require calculation, indirect branches must have their next path determined at runtime. Instead of merely using the BHT as a history for the branch predictor’s reference, the P-M actually stores the new paths that indirect branches like to take, giving the PU more relevant information to use.

The Pentium M’s micro-ops fusion capabilities, on the other hand, are actually less advanced than those of the P4. In modern x86 CPU’s, x86 instructions from software programs are broken down into smaller “micro-ops” for the processor to handle internally. To do this, processing units called “x86 decoders” are used to break original x86 instructions into one or more micro-ops – the P4, for example, has two of such units, while the P3 has three. However, only one of the P3’s decoders can handle instructions that break down into two or more micro-ops. Two commonly used x86 instructions, which could only be handled by one decoder in the P3, can be handled by all three decoders in the P-M, resulting in potential execution of three times the instructions in parallel, and allowing certain memory-heavy operations three decoder targets instead of just one.

How does this work? In the P-M, the two so-called “simple” x86 decoders were not granted the capability to handle multiple micro-ops, as doing so would require too much power. However, in the two previously mentioned common x86 instructions, the output micro-ops are “fused” or handled as a single unit, allowing the simple decoders to handle these instructions where the P3’s simple decoders could not. In my yet-again-humble :devil: opinion, micro-ops fusion is the single most effective improvement in the P-M – you can see the possibilities when three operations can be handled at once!

But Mommy: Where’s the Hyper-Threading?

Whew – almost done folks.

What I find interesting is that in another Ars article, the question is raised – why no hyper-threading for the Pentium M? I don’t believe the chip could use it, or rather, because of the chip’s extremely high IPC, it’s less likely that execution units are sitting idle to handle work from other threads. The P4’s obvious low IPC allows one to imagine more execution units sitting idle in a given clock cycle, making HT a valuable asset.

As for dual core on Dothan, it will happen eventually. A little bird whispered of a likely scenario: when 65nm comes knocking, Dothan chips will be granted a second core, with the second core only in use when the system is plugged into the wall. Given one of Centrino’s highest priorities is low power consumption, I’d take this to be the most probable use of dual core, with desktop versions being able to utilize both cores 24/7 at the cost of power dissipation.

That is, assuming Intel allows desktop versions . . . wait, I’m talking to the wrong audience, Intel doesn’t have to allow desktop varieties for us to use them as such :cool:

What I mean is, don’t hold your breath for Intel to take this mainstream.

It’s late, I’m tired, and (hopefully! Goodness . . .) there’s nothing more to say. Here’s your friendly neighborhood computer engineer signing off . . .

*snore* . . . Kirsten . . . *snore*

-- Matt

kristos
01-23-2005, 06:43 AM
well I didn't encounter any, not that I payed much attention to it.


what do you mean by this? :

As for dual core on Dothan, it will happen eventually. A little bird whispered of a likely scenario: when 65nm comes knocking, Dothan chips will be granted a second core, with the second core only in use when the system is plugged into the wall. Given one of Centrino’s highest priorities is low power consumption, I’d take this to be the most probable use of dual core, with desktop versions being able to utilize both cores 24/7 at the cost of power dissipation.

seccond core only in use when it's plugeed into the wall? I'm probably taking this to litterally but it doesn't make sense to me

2 cores on one chip, power on, both will work, power off, both won't work no?

or do you mean the dual cores will have one core disabled on laptops and both cores enabled in desktops?

:confused:

enzoR
01-23-2005, 08:55 AM
he means when the laptop is running on battery power only 1 core will work. the otherone will be disabled. when plugged into the wall BAMM supercharged laptop :D

Very nice article again. i enjoyed reading it more than the first. i'm sure the next one will be even better :) :toast: cant wait for that one.

"Further, though it’s not mentioned in anything I read, I would suspect Dothan makes aggressive use of clock gating, a simple but relatively new technique – quite simply, the clock signal is “gated” or turned-off to portions of the chip that are not in use."
your suspicion is correct. Dothan does do that but only to the Cache. it turns parts of its cache off when not needed. :D

charlie
01-23-2005, 09:05 AM
:thumbsup:

Northwood
01-23-2005, 10:03 AM
Read This (http://www.xtremesystems.org/forums/showthread.php?p=670078#post670078) thread that i started, the 2nd part of the post in particular.

the successor to Dothan/Sonoma will indeed be 65nm dual-core, and be called "Yonah".

"Centrino II" perhaps?

matt9669
01-23-2005, 01:34 PM
the successor to Dothan/Sonoma will indeed be 65nm dual-core, and be called "Yonah".

I told you, a little birdie came by and told me! ;)

i found nemo
01-23-2005, 03:34 PM
does this processor adopt alot from the p3? cuz it sounds like it...which makes my puter proud lol...good job on keepin' us updated

matt9669
01-23-2005, 03:40 PM
It's built on the P3 core - it's more like a P3 that adopts from the P4, so your comp is more closely related than what most of us are running these days! :banana4:

i found nemo
01-24-2005, 11:18 AM
yay...now you all should thank me lol j/p

matt9669
01-25-2005, 05:57 PM
Hey crew - looking for my next possiblity to do an article/commentary on - let me know what you're thinking!

BTW, thanks for the great comments guys! :YIPPIE:

matt9669
01-25-2005, 09:51 PM
No thoughts? Come on guys, I know you want to hear my witty elucidation of something relevant to the XS community ;)

p0rl1n
01-27-2005, 09:17 AM
my vote for the next article is smithfield.