Intel Details Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed

**Donnie27** · 03-18-2008, 10:23 AM

Originally Posted by DeathReborn

I do believe Intel got the L3 idea from the DEC Alpha EV-5 21164. It may well have been first used before that but not by Intel/AMD.

I doo too and I agree

I've been jumpped here for saying Intel and AMD borrows heavily from DEC/Alpha. Intel had IMC before Alpha and AMD. There'd be no Athlons without Alpha's EV6. Hell even Timna was the forerunner of Fusion but tell that to some folks here?

Slipped and Skipped was the line about RDIMM or Rambus, hehehehe! The return of the, "Awe he didn't say that" LOL!

**savantu** · 03-18-2008, 12:20 PM

Originally Posted by GoThr3k

where do you get this from?
if true, that really would be impressive

The 12MB L3 on a Itanium 2 Montecito has 14 cycles latency.The The L2 has 5 for Int and 7 for FP.

Core 2 has 14 cycles for L2 ; K8 has 12 for L2 ; K10 L2 is 15 , L3 is 30 to 45.

**Shadowmage** · 03-18-2008, 01:54 PM

Originally Posted by xlink

if anything they'de go UP.

Core CPU
64k fast cache
6mb medium cash

k8 CPU
128k fast cache
2mb medium speed cache

k10 cpu
k8 CPU
128k fast cache
1mb medium speed cache
3mb SLOW cache

nehalem
64k fast cache
2mb somewhat fast cache speed cache
8mb medium cache

What you said is completely incorrect.

Nehalem's L2 and L3 speeds are comparable to Barcelona.

Also, K10 has the following:

128KB L1 (fast)
512KB L2 (medium)
3MB L3 (slow)

Nehalem has:

64KB L1 (fast)
256KB L2 (medium)
8MB L3 (slow)

Also, I think Shintai owes everyone a lot of money from his 100 euro bet :p

**Shadowmage** · 03-18-2008, 01:56 PM

Originally Posted by Shintai

Hehe, I am abit surprised. But I think the L2s are more like a L1.5, extremely fast and faster than we ever seen before with L2s. And an L3 with the speed of Core 2 L2s.

I guess the L2 will be around some 5-6cycles. And the L3 under 15cycles.

But it very mimmicks Itaniums cache design. And maybe a underlying requirement for effective SMT.

That's not possible. The L2's latency is going to be 10-15, just like Barcelona. L3 will be 20-40, just like Barcelona. If you have an L2 latency of around 5-6 cycles, then there is no point in having a L1.

Also, I think that Shintai owes everyone an apology for calling people ugly names when they ended up being correct :p Now what happened to my sig...

**Bail_w** · 03-18-2008, 02:01 PM

OMG, this is 100% pure geek :banana::banana::banana::banana:...

**LowRun** · 03-18-2008, 02:10 PM

Originally Posted by Shintai

Hehe, I am abit surprised. But I think the L2s are more like a L1.5, extremely fast and faster than we ever seen before with L2s. And an L3 with the speed of Core 2 L2s.

I guess the L2 will be around some 5-6cycles. And the L3 under 15cycles.

But it very mimmicks Itaniums cache design. And maybe a underlying requirement for effective SMT.

This is what Franck (CPU-Z author) as to say about Nehalem L3 cache speed. I'll take his word.

Originally Posted by cpuz

Hey guys,

Concerning caches on Nehalem : the L3 is now shared between 4 physical cores, meaning that is offers 4 access ports. The most access ports a cache has, the slowest it is. Consequently, it is not surprising that Intel added four small, fast and dedicated (and unified) L2 between the L1s and the L3. These caches keep using an inclusive relationship, so of course this means that the useful size of these L2s is only 128KB. However, those caches are not designed for high success rates but for speed.
CPU-Z is wrong on L1 Data size however, they should be 4x32 KB and not 4x16KB. And I don't know about FSB.

**Shadowmage** · 03-18-2008, 02:25 PM

Originally Posted by LowRun

This is what Franck (CPU-Z author) as to say about Nehalem L3 cache speed. I'll take his word.

Sure, it'll be faster than the current Penryn cache, but not by that much. There's no way it'll be under 10. My guess is a 12 cycle latency.

**LowRun** · 03-18-2008, 03:33 PM

Originally Posted by Shadowmage

Sure, it'll be faster than the current Penryn cache, but not by that much. There's no way it'll be under 10. My guess is a 12 cycle latency.

I have no idea but from what Franck said the 4 access ports shouldn't help to make it faster.

**Hornet331** · 03-18-2008, 03:49 PM

Originally Posted by Shadowmage

Sure, it'll be faster than the current Penryn cache, but not by that much. There's no way it'll be under 10. My guess is a 12 cycle latency.

well, if you dont call 12 cycles fast for a 3rd lvl cache than i dont know what you define as fast.

**BrowncoatGR** · 03-18-2008, 09:53 PM

Frank was implying that the L3 will be slower than current L2. I think its unlikely to be below 16cl possibly even higher

**hollo** · 03-18-2008, 10:08 PM

Originally Posted by Hornet331

Originally Posted by Shadowmage

Sure, it'll be faster than the current Penryn cache, but not by that much. There's no way it'll be under 10. My guess is a 12 cycle latency.

well, if you dont call 12 cycles fast for a 3rd lvl cache than i dont know what you define as fast.

i'd guess from :

Originally Posted by Shadowmage

That's not possible. The L2's latency is going to be 10-15, just like Barcelona. L3 will be 20-40, just like Barcelona. If you have an L2 latency of around 5-6 cycles, then there is no point in having a L1.

that he was talking about l2

**JumpingJack** · 03-18-2008, 10:10 PM

Originally Posted by Brother Esau

Looks like Intel is ripping off AMD to me

How do you figure?

**JumpingJack** · 03-18-2008, 10:23 PM

Originally Posted by AliG

I wouldn't say that, amd can't get their imc to clock well, but look at ibm's power6 monster, that is manufactured on a 65nm soi process with an imc and yet scales to 4.5ghz+ on air supposedly (though I haven't heard anything on the temps). As for hyperthreading, that failed previously because of the poor netburst design, the concept of it is quite good. Once multithreaded software appears more, you'll see the benefits, not to mention the much shorter pipeline to transfer data will help out with the hyperthreading usefullness

IBM also leaks like a sieve .. achieving 4.5 GHz on thier 65 nm process was manipulated through both architecture and process conditions to produce high clocks. This is because IBM moved away from an OoO engine to more in order, and simplied the engine to minimize the deepest FO4 delay.

http://www.research.ibm.com/journal/rd51-6.html (everything you want to know)

The most important article is this one:
http://www.research.ibm.com/journal/rd/516/curran.pdf

Various frequency/cycle-time targets were evaluated
during an exploratory phase. A cycle time corresponding
to 13-FO41 inverter delays was selected based on the
fastest known techniques to achieve back-to-back
execution of 64-byte dependent, fixed-point instructions.

IBM restricted themselves to a cycle time of only 13 FO4 delays for the fixed point latency, this is pretty short all things considered... but also means your circuits must be very very simply (transistor lean). Table 1 shows the FO4 delay reduction from power 5 to power 6, for both simple fixed point and fused multiply and add. IBM went in with the preconception of achieving high clockspeeds, and achieved it through this and process:

The POWER6 processor chip is fabricated using the IBM
high-performance 65-nm partially depleted SOI process
with 40-nm gate length n-FETs, 35-nm gate length
p-FETs, and 1.05-nm gate oxides

This gate thickness is about 0.15-0.2 nm thinner than either AMD or Intel at 65 nm (their reported thicknesses were 1.3 nm and 1.25 nm respectively as I recall). Translation, IBM's power 6 is a power sucker.

http://www.research.ibm.com/journal/rd/516/berridge.pdf

Figure 12 shows their leakage curve, following an exponential you would expect for tunneling current in such a thin gate. At nominal operating conditions for a 4.5 GHz processor which is about 8.5 ps, their leakage just through the gate is about 80 Watts.

This is doable for the market that Power6 is designed for, which are high class enterprise systems where cooling solutions can be specifically designed and, if throughput is high enough, the higher power can be justified.

IBM's design and the process tweaks they made to get there is a very special application, and is completely in appropriate for the markets AMD or Intel service... extrapolating or implying that AMD could do a 4.5 GHz because IBM can do 4.5 GHz is simply incorrect, and anyone counting on that should not hold their breath... it just ain't gonna happen.

Jack

**JumpingJack** · 03-18-2008, 10:44 PM

Originally Posted by Shadowmage

What you said is completely incorrect.

Nehalem's L2 and L3 speeds are comparable to Barcelona.

Also, K10 has the following:

128KB L1 (fast)
512KB L2 (medium)
3MB L3 (slow)

Nehalem has:

64KB L1 (fast)
256KB L2 (medium)
8MB L3 (slow)

Also, I think Shintai owes everyone a lot of money from his 100 euro bet :p

We don't know this detail yet.... if Intel does not put the L3 on it's own clock domain, Intel's L3 cache will clock at core frequency ... significantly better than AMD's 1800/2000 Mhz L3. In otherwords, it could be faster or it could be slower... we won't know until we have the CPUs in the wild and people actually measure it.

**gojirasan** · 03-18-2008, 11:07 PM

i might even go with the extreme processor to start as well!!!

If you plan to overclock you may have no choice. Bye bye FSB. Hello multiplier locking. I predict this is going to add a huge amount of value to their 'Extreme' chips. Intel tends to leave a lot of overclocking headroom in their chips. Makes me happy that I own my single share of Intel stock

.

Since they may be finally closing off the FSB overclocking loophole I just wish they would include two versions of the Extreme. One at the highest bin for whatever they want to charge. $1099 or something like that. And then a prosumer version with a lower bin but still with an unlocked multiplier for $699 or so. The bleeding edge enthusiasts with deep pockets would still get the high end chip, but the overclockers without so much money might be willing to spend a bit more than usual for the ability to overclock this monster of a chip. I know I would depending on how high it clocked over stock. But no matter how high it clocked it would be difficult to justify spending over $1000 on a cpu.

**LordEC911** · 03-18-2008, 11:12 PM

Originally Posted by Shadowmage

Also, I think Shintai owes everyone a lot of money from his 100 euro bet :p

Jeez... everyone kept saying how dumb I am... but yet I turn out to be right...
Hmmmm.... makes you wonder.

Edit- Sadly, it seems like Enjoy was the only one that took the bet, though it sorta seemed like you wanted to aswell.
I personally feel like we should hold him too it.

**JumpingJack** · 03-18-2008, 11:12 PM

Originally Posted by gojirasan

If you plan to overclock you may have no choice. Bye bye FSB. Hello multiplier locking. I predict this is going to add a huge amount of value to their 'Extreme' chips. Intel tends to leave a lot of overclocking headroom in their chips. Makes me happy that I own my single share of Intel stock

.

AMD CPUs don't have a FSB, how do those get overclocked?

**nosboost300** · 03-18-2008, 11:21 PM

through HTT

**Cronos** · 03-18-2008, 11:50 PM

To be more precise, through HTT base frequency, from which all other frequencies are derived.

The main limiting factor in Nehalem overclocking may be the same as it's main advantage over Core2 arch. - triple channel IMC. Increased number of wires required for 3-channel IMC may negatively affect over clocking headroom.

Lets wait and see. This should be the most exciting upgrade for me since i built my dual dual-core Opteron machine 3 years ago (i was not really impressed by my 8-core Intel Xeon build )

**savantu** · 03-19-2008, 12:02 AM

Originally Posted by Shadowmage

What you said is completely incorrect.

Nehalem's L2 and L3 speeds are comparable to Barcelona.

And you know that how ? Intel always had faster caches than AMD.Did they lost all that know-how overnight ?

**xlink** · 03-19-2008, 01:19 AM

Originally Posted by Shadowmage

What you said is completely incorrect.

Nehalem's L2 and L3 speeds are comparable to Barcelona.

Also, K10 has the following:

128KB L1 (fast)
512KB L2 (medium)
3MB L3 (slow)

Nehalem has:

64KB L1 (fast)
256KB L2 (medium)
8MB L3 (slow)

Also, I think Shintai owes everyone a lot of money from his 100 euro bet :p

here's the thing, more likely than not it will not have any slow l3 cache. The l3 cache will be comparable to todays' l2 cache in terms of speed.

and dont' say that's impossible because they've got dang low latency cache on itanium despite it having A TON of chache so the manufacturing tech is obviously there.

watch as the l2 is around 6-15 cycles and the l3 is around 10-20 cycles, AND the core is really high clocking making overall latency even lower than today's Core uArch.

this is netburst on steroids. It's wider more accurate and emphasizes width over length unlike the original. It's the return of the 20 stage pipeline and I think that this time the world is ready for it unlike the last. Just based off of the fact that it's using DDR3 on a 196 bit bus something tells me it'll be very bandwidth hungry. VERY. Why the heck do you think they redesigned the cache architecture they'de been using for the past decade?

**duploxxx** · 03-19-2008, 01:48 AM

Originally Posted by JumpingJack

We don't know this detail yet.... if Intel does not put the L3 on it's own clock domain, Intel's L3 cache will clock at core frequency ... significantly better than AMD's 1800/2000 Mhz L3. In otherwords, it could be faster or it could be slower... we won't know until we have the CPUs in the wild and people actually measure it.

that was original barcelona L3 speed, by the time you will see nehalem you will also see shangai with several redesigns including this l3 speed and you just have to check the amd forum to see what performance difference this makes on phenom cpu's...

**cpuz** · 03-19-2008, 01:53 AM

Originally Posted by LowRun

I have no idea but from what Franck said the 4 access ports shouldn't help to make it faster.

Hey guys,
meanwhile I learned more about the Nehalem caches.
Unlike what I previously stated, the L3 does not offer 4 access ports, but only one. This is also what explains the presence of these L2s. I explain what I understood :

When several cores share a cache level, this cache has to answer to them as fast as possible, in order the cores do not spend too much time waiting. Two methods exist to reduce latencies :

- increase the number of access ports. This was my 1st thought, since this is the best solution on the paper. However, in practice, this drastically increases the complexity, and increase from 1 to 4 port can increase the cache surface by 2 or 3. So this is not possible atm.

- use a banked access method, a little bit like what is done for DRAMs. This allows the cache to be accessed by different threads in the same time (under certain conditions, exactly like DRAMs technology), however the bank accesses results in lot of performance drop. Considering that a 8 MB L3 is already slow due to its size, this is not a good solution neither.

So, Intel choosed to reduce the number of accesses to this shared cache. This is what the small L2s are aimed for. These L2s are small, and due to inclusive relationship with L1s, the effective size can be as low as 196 KB (and not 128 KB as I previously said). With such a size, the hit rate can not be very high (see the Celeron), but this is not very important. Let's say the hit rate is only 50% (that is a pessimistic statement), that means that hafl of the core requests are handled by the L2. So, in the worst case of 4 requests in the same time, only two arrive to the L3. Exactly the same as what currently happens on the Core 2 Duo.
Moreover, the 50% of the requests handled by the L2 are treated much faster as if they were handled by the L3. So, the overall cache hierarchy efficiencey is even better.

There are some drawbacks however :
- SMT results in 8 possible simulataneous accesses, and not 4.
- power dissipation is increased. Adding 1 MB (4x256) results in 1/8 = 12.5% dissipation increase. For that reason, it is possible that the L3 uses different voltage/clock planes, but I was not confirmed that yet.

**savantu** · 03-19-2008, 02:07 AM

Originally Posted by cpuz

...

So, Intel choosed to reduce the number of accesses to this shared cache. This is what the small L2s are aimed for. These L2s are small, and due to inclusive relationship with L1s, the effective size can be as low as 196 KB (and not 128 KB as I previously said). With such a size, the hit rate can not be very high (see the Celeron), but this is not very important. Let's say the hit rate is only 50% (that is a pessimistic statement), that means that hafl of the core requests are handled by the L2. So, in the worst case of 4 requests in the same time, only two arrive to the L3. Exactly the same as what currently happens on the Core 2 Duo.
Moreover, the 50% of the requests handled by the L2 are treated much faster as if they were handled by the L3. So, the overall cache hierarchy efficiencey is even better.

There are some drawbacks however :
- SMT results in 8 possible simulataneous accesses, and not 4.
- power dissipation is increased. Adding 1 MB (4x256) results in 1/8 = 12.5% dissipation increase. For that reason, it is possible that the L3 uses different voltage/clock planes, but I was not confirmed that yet.

By simply looking at the die picture you see that things are much more complicated.Look at the L3 controllers ( the write buffers ) , they're freaking huge!

The new , 2nd level TLB also implies really complex sharing and arbitration mechanism.All of the above , coupled with Intel's second to none expertise in fast cache makes me believe we'll all going to be surprised by the performance of Nehalem's cache subsystem.

http://chip-architect.com/news/Shanghai_Nehalem.jpg

**Movieman** · 03-19-2008, 02:17 AM

Hi guys:
I read thru here and you folks know much more on the technical end of this than I do.
I work with the dual socket boards so that's what I tend to look for information on.
Now we know that the Harpertowns(Penryns) get an approximate 10% increase clock for clock over the Clovertown(Kentsfields,C2D) and what I'm hearing is that Nehalem will be 20-30% better clock for clock than the Harpertowns.
That's on pretty good authority.
Not scientific but lets just say this guy knows what he's talking about and no, not someone from this forum.
I also wouldn't stick my neck out and say this if I wasn't pretty damned sure this was true.

Thread: Intel Details Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions