Intel Details Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed

**anubis** · 03-18-2008, 03:43 AM

Originally Posted by Movieman

Somehow I don't think it will take that long..

..theres already the sweet smell of XS WCG worlddomination in the air

our saga of david and goliath continues

i think i need a better job to buy more of these

**Calmatory** · 03-18-2008, 04:24 AM

I want one! Goodbye AMD, forever.

**Movieman** · 03-18-2008, 04:34 AM

Originally Posted by anubis

..theres already the sweet smell of XS WCG worlddomination in the air

our saga of david and goliath continues

i think i need a better job to buy more of these

Somehow I think we'll see what this beast does on WCG sooner than you think..

Originally Posted by Calmatory

I want one! Goodbye AMD, forever.

Lets hope not, that would be the worst thing that could happen to all of us.
monopoly of any kind leads to stagnation and high prices for all.

**onewingedangel** · 03-18-2008, 04:35 AM

Originally Posted by fellix_bg

Of course, this comes at the cost of totally available L3 size, as follows: 8 - (4*256K) = 7MB. That's the reason for the rather shy L2 per core, not because of the pure low latency design intentions.

Read it again: inclusive relationship with the L2 arrays!

The L3 still holds 8MB of data for quick access, which gets pulled into l2, then l1 when needed. The l3 is only there to provide a quick access point for the l2 to grab data, and the l2 has access to a pool of 8MB worth of data - when the l2 cache uses uses data from l3 thats a successful hit.

Regardless of the fact that data may be stored simultanously in the l2 and l3, each level 2 has access to 8mb of l3 (and if it is non dependent you could have all cores using the same data)

**fellix_bg** · 03-18-2008, 04:55 AM

Originally Posted by onewingedangel

...

My point was about the data update in the caches and coherent state snooping, but yes -- 8MB is 8MB, however you look at it. I just stressed the inclusive nature in that case and what it contributes to the threading in overall.
A simple pointer-chasing graph would tell us enough about the whole picture, anyway!

**onewingedangel** · 03-18-2008, 05:08 AM

But increasing l2 size wouldn't negatively affect total l3 - as even if you had more data duplicated, you still have the same total capacity l3. l2 wsn't restricted to 256KB just to reduce duplication (as in such a large design another 1MB cache isn't that much), but rather to keep the l2 as fast as possible.

Nehelams l3 will be about as fast as current l2 caches, and the l2 will be faster that current level 2 caches.

Intel could have gone for 512KB of l2 per core, but this would have meant a slower l2, which would more than negated the increased capacity. If intel can maintain the speed and increase capacity of l2 I'm willing to bet they will on subsequent generations.

**Blauhung** · 03-18-2008, 05:39 AM

desktop boards will probably have 4 dimm slots to maintain basic ATX design. 3 of them interleaved and 1 just extra. Since you don't have to balance trace length of the northbridge with all of the 3 other components on the mobo (CPU/mem/SB) you're left with a bit more freedom on layout, so the more creative with PCB real estate will probably be able to cram 6 slots on a board so that they can fill 2 dimms per channel.

**fellix_bg** · 03-18-2008, 05:50 AM

Originally Posted by onewingedangel

...as even if you had more data duplicated, you still have the same total capacity l3

Then why Intel bothers to propagate inclusive design here?
From your statements, there is no consequentive difference between inclusive and exclusive relationship, or it's me couldn't get your point?
By the way, Intel have--since long time--fast enough SRAM cells in much bigger arrays than this skinny 256K one.

**Shintai** · 03-18-2008, 05:53 AM

Originally Posted by fellix_bg

Then why Intel bothers to propagate inclusive design here?
From your statements, there is no consequentive difference between inclusive and exclusive relationship, or it's me couldn't get your point?
By the way, Intel have--since long time--fast enough SRAM cells in much bigger arrays than this skinny 256K one.

Look on Itaniums L2. It mimmicks that.

**fellix_bg** · 03-18-2008, 06:03 AM

Itanic is a long instruction architecture, so the caching organization is a subordinate to the rather weird specifics of its EPIC design.
Anyway, I think the L2 in Nehalem is--by design--to counter the shared L3, not as a decisive performant part for the architecture...
And by all means the L3 here should be way (relatively) faster, being closely related, than the AMD's K10 implementation and Dunnington one, too.

**Donnie27** · 03-18-2008, 07:28 AM

Originally Posted by AliG

yes but that will mean that amd will be forced to work harder which eventually get rid of hector which can only be a good thing for them, they should put someone like Dirk in that position instead, I like him a lot more

Anyways, yes it will hammer amd's cpus but hey we're the consumer, not the fanboys (well some of us might be), I go for where the performance is at, not company of choice. when amd was in the lead I bought two athlon products, but now that intel is doing the spanking, I'm going to get an intel product, its as simple as that.

Best advice I can give you is to not buy products from the company you only like if the competition offers something better, look back at the k8 days, it took amd almost the full 4 year lead they had to convince dell to buy their cpus.

QFT!

**Donnie27** · 03-18-2008, 07:38 AM

Originally Posted by onewingedangel

But increasing l2 size wouldn't negatively affect total l3 - as even if you had more data duplicated, you still have the same total capacity l3. l2 wsn't restricted to 256KB just to reduce duplication (as in such a large design another 1MB cache isn't that much), but rather to keep the l2 as fast as possible.

Nehelams l3 will be about as fast as current l2 caches, and the l2 will be faster that current level 2 caches.

Intel could have gone for 512KB of l2 per core, but this would have meant a slower l2, which would more than negated the increased capacity. If intel can maintain the speed and increase capacity of l2 I'm willing to bet they will on subsequent generations.

Not only latency, as was shown in their slides, Intel says its L2 is Smarter as well. If it is smarter, less size is needed, right?

Also it goes back to something Intel learned for the small Very Fast L1 and L2 used with the P4. They didn't want to make the same "Prescott" mistake that added 17% more L2 latency. No matter what's said in this forum, everything about Netburst didn't suck.

**Face** · 03-18-2008, 07:44 AM

Speaking of L3 Shintai,

Originally Posted by Shintai

100euro on its fake...

There would be no FSB, No L3 etc. And I dont think CPU-z even got the Nehalem word...but could be wrong.

L1 would also be 2x32KB or more.

And i´m sure a 2 socket Nelahem system would have more than 2GB memory...

Also why are these screenshots always in so poor quality...to hide the photoshop marks?

------------------------------------------------------
Not yet, as he even says himself it reads some parts wrong.

I still dont believe in a L3. It simply makes no sense when looking on the size and past history. Itanium only got a L3 due to the massive sizes of up to 24MB and soon 30MB. And I dont think anyone here on the board got access to a nehalem system, nor will have it for the next 3-6 months.

L3 is a step backwards for mainstream, not upwards.

Jus't teasing you man!

.

------------------------------------------------------

Here's some interesting comments from knowledgeable people: (Mainly on Faster Synchronization Primitives which looks like a nice feature)

>If using the lock prefix is a legacy operation what are
>the modern ones?

Linus Torvalds:
I don't think there are any - I think they just meant that
they made the old legacy instructions run faster, instead
of trying to introduce anything new.

Which I really look forward to testing. The serialization
overhead of Core 2 is better than many other processors,
but everything else is so good that it still stands out
like a sore thumb. We have lots of kernel loads where one
of the biggest costs is just locking (even without any
nasty contention and cacheline ping-ping), because of how
it serializes the pipeline.

Now that people are trying to push more and more multi-
threaded programming paradigms, the locking is finally
getting some real exposure. It's always been a big issue
in kernels, but now all the fast user-level locking is
making it show up in "normal" loads too.

--------------------------------------
That's something I'm also looking forward too. Even without contention acquiring locks is *painful*. Unless the data/code you're protecting takes a significant amount of time to process/execute you'll be bitten by the sheer cost of the lock/unlock couples so there is room for *lots* of improvement there.

----------------------------------------
+1

It's not uncommon for Java workloads to waste 10% or more of the time processing uncontended locks, and I've seen up to ~30% in real-world apps(1).

The underlying reason is that many critical parts of the core Java library are synchronized (StringBuffer, HashTable, many I/O functions). While there are new APIs that avoid this (StringBuilder, HashMap, etc) there is lots of legacy code that uses the old APIs directly or indirectly.

JVMs usesoptimization tricks to avoid this (lock removal, lock elison, lazy unlocking etc) but that only serves to alleviate the problem, and doesn't entirely resolve it.

-- Henrik

(1) Measured as the increase in throughput when locks are forcefully disabled in JRockit (using -XXlazyunlocking or just hacking the JVM to not issue CAS instructions). The 30% number comes from a JSP-heavy app I ran into some time back. SPECjbb2005 gains ~10% by the use of -XXlazyunlocking.

Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the
need to synchronize threads is also becoming more common. Next generation Intel
microarchitecture (Nehalem) speeds up the common legacy synchronization primitives (such
as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded
software will see a performance boost.

That's actually the part that I like the most. Better overall IPC is a very nice thing but lowering the cost of the synchronization primitives is much more interesting. It enables parallelization of 'harder' workloads which are not really suitable to parallelization and reap lower benefits because of the synchronization overhead.

http://realworldtech.com/forums/inde...88380&roomid=2
http://aceshardware.freeforums.org/n...ting-t423.html

Interesting.. Let's hope all these <on paper> enhancements and buzz will turn real.. If the claim of Nehalem > Core 2 more than Core 2 > P4 will hold true than it's going to be really insane.

**Levish** · 03-18-2008, 07:50 AM

yum, need more ram to make my ramdisk dreams possible

**Brother Esau** · 03-18-2008, 08:12 AM

Looks like Intel is ripping off AMD to me

**~~terrace215~~** · 03-18-2008, 08:33 AM

Originally Posted by Brother Esau

Looks like Intel is ripping off AMD to me

Yeah, cuz I mean, like, AMD was the first ever to use a 3-level cache design, or think of an integrated memory controller!

Oh, wait...

**AliG** · 03-18-2008, 09:00 AM

Originally Posted by terrace215

Yeah, cuz I mean, like, AMD was the first ever to use a 3-level cache design, or think of an integrated memory controller!

Oh, wait...

yes but they dumped the idea, you do realize that the original nehalem was netburst on steriods right? That's why some of the features, like hyperthreading are coming back with nehalem, as the same division made both.

If it wasn't for k8 being so successful, there would have been no conroe, instead just a beefier netburst and then on top of that intel has admitted they like the k10 design, but that its near impossible to produce it properly on a 65nm process. Now I'm not saying amd hasn't done the same, look at their original products, they were just intel parts with their name on it, but that doesn't mean intel didn't use some of the k10 design in nehalem

**LowRun** · 03-18-2008, 09:09 AM

Originally Posted by Face

Speaking of L3 Shintai,

Jus't teasing you man!

.

I like this one better

Originally Posted by Shintai

Not yet, as he even says himself it reads some parts wrong.

I still dont believe in a L3. It simply makes no sense when looking on the size and past history. Itanium only got a L3 due to the massive sizes of up to 24MB and soon 30MB. And I dont think anyone here on the board got access to a nehalem system, nor will have it for the next 3-6 months.

L3 is a step backwards for mainstream, not upwards.

I guess you are disappointed Intel is making a step backward

**Donnie27** · 03-18-2008, 09:28 AM

Originally Posted by Brother Esau

Looks like Intel is ripping off AMD to me

Pick self up from floor from laughing so hard. Or did you mean that as a Joke!

Intel® Pentium® 4 processor Extreme Edition 3.20 GHz supporting Hyper-Threading Technology, with an additional 2 Megabytes of L3 cache.

So if Intel uses it, stops using it, AMD copies Intel and then Intel returns to their original idea, it is Intel copying AMD, LOL!

**K404** · 03-18-2008, 09:29 AM

Originally Posted by xlink

256kb * 8 =2mb

you're basically thinking about it wrong though. Take todays core based CPUs, add in more cache to the l2 cache at a very slight penalty to latency(should be about as fast as 65nm l2 cache), very small.
then add in L1.5 cache which is somewhere between the speed of the l2 cache and the l1 cache.

_________

Sorry for being so slow to respond.

The cache doesnt look to be universal- each 256KB is dedicated to one core. The slide even says "per core." Whats the orange in between the L2 and the L1-Data? Is that what you've called the L1.5?

Also- im assuming it starts off as a quad, so the *8 is only accurate for servers.

I cant see Nehalem having more cache to play with than Penryn for single-threaded apps, depending on how the L3 is used.

**Donnie27** · 03-18-2008, 09:48 AM

Originally Posted by AliG

yes but they dumped the idea, you do realize that the original nehalem was netburst on steriods right? That's why some of the features, like hyperthreading are coming back with nehalem, as the same division made both.

If it wasn't for k8 being so successful, there would have been no conroe, instead just a beefier netburst and then on top of that intel has admitted they like the k10 design, but that its near impossible to produce it properly on a 65nm process. Now I'm not saying amd hasn't done the same, look at their original products, they were just intel parts with their name on it, but that doesn't mean intel didn't use some of the k10 design in nehalem

But that's a two-way street. If there wasn't a P3 replacing the P2's there would have been an Athlon. Each company pushes all of their competitors to get better or die. Way too soon to write-off AMD but to pretend they're not getting pimp slapped right is worse. There's nothing on K10 Intel wanted to Copy

**fellix_bg** · 03-18-2008, 09:50 AM

Regarding P4 (NetBurst) - in those times L2 cache was an essential factor for the performance of that architecture because of one simple fact: P4 don't actually have L1 cache for instructions (macro-op's by Intel's language), but the notorious trace cache, storing the already decoded µOp's (it added eight stages to the already long pipeline). That meant a directly loading of the macro-op's cache lines from the... yes, you guess it - the L2 region.

**~~GoThr3k~~** · 03-18-2008, 09:50 AM

Originally Posted by onewingedangel

Nehelams l3 will be about as fast as current l2 caches, and the l2 will be faster that current level 2 caches.

where do you get this from?
if true, that really would be impressive

**Shintai** · 03-18-2008, 09:53 AM

Originally Posted by LowRun

I like this one better

I guess you are disappointed Intel is making a step backward

Hehe, I am abit surprised. But I think the L2s are more like a L1.5, extremely fast and faster than we ever seen before with L2s. And an L3 with the speed of Core 2 L2s.

I guess the L2 will be around some 5-6cycles. And the L3 under 15cycles.

But it very mimmicks Itaniums cache design. And maybe a underlying requirement for effective SMT.

**DeathReborn** · 03-18-2008, 09:56 AM

Originally Posted by Donnie27

Pick self up from floor from laughing so hard. Or did you mean that as a Joke!

Intel® Pentium® 4 processor Extreme Edition 3.20 GHz supporting Hyper-Threading Technology, with an additional 2 Megabytes of L3 cache.

So if Intel uses it, stops using it, AMD copies Intel and then Intel returns to their original idea, it is Intel copying AMD, LOL!

I do believe Intel got the L3 idea from the DEC Alpha EV-5 21164. It may well have been first used before that but not by Intel/AMD.

Thread: Intel Details Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions