Intel Details Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed

**cpuz** · 03-19-2008, 01:53 AM

Originally Posted by LowRun

I have no idea but from what Franck said the 4 access ports shouldn't help to make it faster.

Hey guys,
meanwhile I learned more about the Nehalem caches.
Unlike what I previously stated, the L3 does not offer 4 access ports, but only one. This is also what explains the presence of these L2s. I explain what I understood :

When several cores share a cache level, this cache has to answer to them as fast as possible, in order the cores do not spend too much time waiting. Two methods exist to reduce latencies :

- increase the number of access ports. This was my 1st thought, since this is the best solution on the paper. However, in practice, this drastically increases the complexity, and increase from 1 to 4 port can increase the cache surface by 2 or 3. So this is not possible atm.

- use a banked access method, a little bit like what is done for DRAMs. This allows the cache to be accessed by different threads in the same time (under certain conditions, exactly like DRAMs technology), however the bank accesses results in lot of performance drop. Considering that a 8 MB L3 is already slow due to its size, this is not a good solution neither.

So, Intel choosed to reduce the number of accesses to this shared cache. This is what the small L2s are aimed for. These L2s are small, and due to inclusive relationship with L1s, the effective size can be as low as 196 KB (and not 128 KB as I previously said). With such a size, the hit rate can not be very high (see the Celeron), but this is not very important. Let's say the hit rate is only 50% (that is a pessimistic statement), that means that hafl of the core requests are handled by the L2. So, in the worst case of 4 requests in the same time, only two arrive to the L3. Exactly the same as what currently happens on the Core 2 Duo.
Moreover, the 50% of the requests handled by the L2 are treated much faster as if they were handled by the L3. So, the overall cache hierarchy efficiencey is even better.

There are some drawbacks however :
- SMT results in 8 possible simulataneous accesses, and not 4.
- power dissipation is increased. Adding 1 MB (4x256) results in 1/8 = 12.5% dissipation increase. For that reason, it is possible that the L3 uses different voltage/clock planes, but I was not confirmed that yet.

Thread: Intel Details Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions