Intel Details Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed

**KTE** · 03-19-2008, 05:48 AM

K10h is a 12 stage pipeline, 65nm, 283mm², 463M transistor, 23.x FO4 delays design. Not made for high clocks in any way, AMD intended, as presented at one of the global IEEE 2006 conferences to reach 2-2.8GHz with Barcelona with it's rated supply Vdd. Intel Core 2 is a 21 FO4 depth design AFAIK and Penryn at FO4 ~18, it is supposed to have been reduced substantially since HKMG integration.

The IBM Power6 is not the least nor the only architecture with 13 FO4 inversion delay, it just happens to be very well tuned for absolute speed and performance. P3 had FO4 15 depth, Willamette P4 FO4 8-10, Alpha 21264 has 15 FO4, and so on. Neither of those could achieve what IBM did.

However, I don't think the IBM Power6 bears any relevance to desktop computers. It is a major success for it's HPC market and trumped anything any competitor had to offer in 2007 including Harpertown and Itanium 2 Montecito. It's the only CPU to hold all 4 major industry records in one go, transactions, Java, throughput and floating point. Beat Harpertown 3.16GHz 8 core vs 8 core in Int too. Best in SAP, TPC-C OLTP, OASO, Spec Jbb2005, Linpack HPC and so on last I checked late 2007. For instance in TPC-C:

Bull Escala PL1660R 16-cores IBM Power6 4.7 GHz 1,616,162tpmC
NEC Express5800/1320Xf 32-cores Intel Dual-Core Itanium 2 9050 1.6GHZ 1,245,516tpmC
Bull Escala PL1660R 4-core IBM Power6 4.7 GHz 404,462tpmC
HP ProLiant ML370G5 X5460 QC 8-core Intel X5460 3.16GHz 275,149tpmC

As you can see, it trumps anything for what it was designed to do.

What it does show is those who usually guess SOI is the only clocking restriction are wrong, as if you look at IBM technical documentations, IBM Power6 scales to 6.1GHz with low LpolySi tuning on air using SOI at 1.3V Vdd supply. Far more than anything else out there including HKMG 45nm CPUs. The official 3.2GHz IBM Power6 is rated for less than 100W TDP at 65nm SOI, big achivement. The 4.7GHz is rated for a maximum of 160W TDP with massive 790M transistors inside a big 341mm² transitor package, wowzer achievement, especially at the same pipeline, instructions per cycle and latch cycle overhead from 90nm. No other chip from AMD/Intel at 65nm or 45nm can do sub-200W TDP at those specs or temperatures (sub 60C air, with 105C limit). To compare, the Power5+ 1.9GHz is a 200W TDP CPU at 389mm² 276M transistors. Intel "Montecito" Itanium 9000 running at 1.6GHz is less than half as powerful as IBM Power6 with a hefty 104W TDP, just shows how brilliant the engineering on Power6 really is and yet, you forget the +25W minimum TDP of the memory controller with Power6 that Itanium 2 doesn't have. Comparing Kentsfield 2.67GHz MCM 65nm was at 130W TDP, 286mm² 582M transistor package to IBM Power6 4.7GHz 160W gives us:

2.67GHz vs 4.7GHz
130W (+35W NB) vs 160W
582M vs 790M
20.5MHz/W vs 29.38MHz/W
0.455W/mm² vs 0.469W/mm²
0.22W/MilT vs 0.20W/MilT

In all respects it is far better a CPU at 65nm, but it isn't a desktop market intended chip, hence comparisons with our market shouldn't be made to judge absolute numerical performance, although electrically, you can do.

Originally Posted by IBM

However, on a CPW per watt basis, the Power6-based server is 67 percent more power efficient than the Power5 box.

IBM also provided a comparison with the earlier iSeries Model 870 machine, equipped with 16 of IBM's Power4 cores running at 1.3 GHz, which consumed 6,000 kilowatt-hours for the base CEC. A four-core Power6-based System i 570 does essentially the same amount of work, but burns less than 1/4 of the juice as the Power4-based machine at the CEC level.

To get the speeds on any architecture is not just about one timing, jitter, skew, latch, wiring delays all add up and can increase delays and lower your clock performances per cycle. IBM mainly had to employ the use of very high speed low delay wiring and mainly, Dual Stress Liners within the transistors. Look at the low Vdd it needed for high frequencies, 0.9V Fmax is 3GHz.

Anyway, as for L3 cache, many have had it but not more than one Intel SKU before the K10h architectural lineup in the desktop market AFAIR (?). Alpha EV5 was the first that I can recall with others such as >Power4, UltraSPARC IV+, Madison/9M, etc. It helps only mainly when your L2 and L1 is saturated for high memory access or large matrix array applications, such as databases. It was always mainly a server design bonus, hence not featured much on the desktop but now that seems to be changing and it's led by AMD quite obviously with their MPU+IMC design. Not that Intel didn't know this before AMD, they just couldn't produce a chip below 45nm with it.

Someones also mentioned AMD K10h L3 cache is different to the upcoming "Nehalem" (it's not called Nehalem), as in inclusive rather than being exclusive. This isn't entirely true either, AMDs design is not specifcially inclusive or exclusive, but a bit of both:

The L3 cache is specifically designed with data sharing in mind. This entails three particular changes from AMD’s traditional cache hierarchy. First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy

Also the L3 is 20% of the die in K10h.

Well, the additions for Nehalem are good on paper, but fingers crossed as Native+IMC is too difficult to have running as a design without problems, esp. at your first go. 45nm HKMG helps a lot but not as much as 32nm would. I'm fearing the prices on these, as clocks are far harder to get, yeilds much lower, defect rates very high, and hence, price is where it'll bite us in the hind, unless AMD has something Intel fears by 13th October '08.

*Cache arrangement is nearly exactly the same as AMD K10h, no doubt, though Intel chose to keep it mainly inclusive. The L3 cache by nature of redundant access is slower than L2 and L1 but far quicker than RAM access. That large size of L3 will only help with large matrix applications, mainly in videoing/imaging/large gaming/server apps but beyond 8 or so MB data access, they might have paramount performance scaling issues when all caches are full with the same replete data, the latency will build. That's the problem with keeping it inclusive, they need speed and very low latency for it.
*IMC built within limits the current delta between IMC-Core and withholds overclocking/speeds. Each MB PWM design now has to provide separate power for the IMC and not just the separate cores. Same for VMods.
*IMC also increases power/TDP much, especially with triple channel memory support. You can add 30-60W of minimum to maximum power here at just 2.0-2.8GHz clocks, maybe even more so with SMT and QPI support being internal. Maximum theoretical AC and DC power consumption becomes much higher through individual latch testing.
*Triple channel memory is essentially needed, I reckon its a clever move, because Native+IMC design suffers from low real bandwidth, and worse so for write/copy bandwidth than read. Individual DRAM access by each core is the best way to go, should improve or at least keep level write/copy bandwidth but improve read bandwidth over current Penryn. Just having IMC+3 controllers, doesn't gurantee this at all though.
*4 vs 3 instructions executed at a time means it will obviously be quicker than Penryn per clock - unless something down the line holds it back, latencies being my fear. Hoping there will be major improvements here espceially with SSE 4.2 instruction updates once they're supported in software.
*I don't like the sound of keeping a small L2 and large L3, this is more a server segment design win. L2 will be far quicker than L3, but slower than L1; the Native+IMC approach require a large L2 for speed in smaller apps and large L3 for speed in larger apps, but suffers from little L2 in the smaller desktop apps. Apps like SuperPi will probably see a big hit with this, not just 1M, even 32M, although not as much hit as L3 exclsuive on AMD K10h does.
*Unfortunately, Native+IMC also means, lower bins, lower clocks, lower overclocks, higher defect rates, higher TDPs and much chances of cold bugs and low clocks held by the IMC, especially if they are in-sync. You have to realize the nature of binning and chip sorting is more than three times as difficult with 4-cores+IMC in one package. I hope they make the IMC run at a separate clock and PLL to the cores, fully adjustable, or IMC/MEM oc will also be poorer and very hard compared to modern Core 2 oc, which is easy. They have only quantified DDR3 800-1333 support which gives me the shivers of these clock restrictions since Nehalem is set for production in Q4 '08, for that time, I would've expected them to add DDR3 1600 support unless something is holding things back here.
*IMC clocking depends on the delta between IMC and MEM gate currents and volts, so this could be a very tricky area to have working with DDR3 1.5V unless the IMC gate voltages are at the required delta's.

Can't wait to see it in action, it is a revolutionary design for Intel, completely different to their previous CPU designs: a new architecture very clearly. They've chosen the same desktop architectural design as AMD now, to compete. It's the right way to go, but introducing SMT aka HyperThreading back again is not a good idea unless the single threaded Front End performance is weak or the clocks are lower than Penryn. That isn't a good sign of multi-threaded performance and clock speeds. We know software developers at Intel have been ringing developer ears since before Penryn of how poor multi-core paralellism exists everywhere on the desktop and home market (videos on their site showing 4 core to 6 core lost perf. scaling majorly) but let's hope they've improved this through the cache data fetch and eviction algorithms, the larger TLB and BTB can do this. This is mainly where SMT will help most.

As for people fighting cases of this firm vs that, you're all wrong as all electrical designs and knowledge of anything and everything is mostly copied and passed on = it's not called copying though, it's called sharing. How you teach your child, how you know anything about computers is mostly through reading or being told, which again is sharing by some male/female somewhere. How I draw a picture by watching a video of someone drawing it, doesn't mean I copied or that it isn't an achievement if I made it good or better. And FWIW, K10h featured many improvements which were exactly what Core 2 received from Pentium M.

I hope they do launch sub-120W TDP 2.8GHz quad-core Nehalem CPUs by October. Would be a major achievement to pull off with plus 1.1x Penryn performance per clock, not sure how many recognize ardently how difficult it is to produce especially at the same fabrication node as your current SKU lineups. Monstrously difficult job, go visit a fab and you'll realize much better. Just had a little time to sit down today.

Thread: Intel Details Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions