Originally Posted by
cpuz
...
So, Intel choosed to reduce the number of accesses to this shared cache. This is what the small L2s are aimed for. These L2s are small, and due to inclusive relationship with L1s, the effective size can be as low as 196 KB (and not 128 KB as I previously said). With such a size, the hit rate can not be very high (see the Celeron), but this is not very important. Let's say the hit rate is only 50% (that is a pessimistic statement), that means that hafl of the core requests are handled by the L2. So, in the worst case of 4 requests in the same time, only two arrive to the L3. Exactly the same as what currently happens on the Core 2 Duo.
Moreover, the 50% of the requests handled by the L2 are treated much faster as if they were handled by the L3. So, the overall cache hierarchy efficiencey is even better.
There are some drawbacks however :
- SMT results in 8 possible simulataneous accesses, and not 4.
- power dissipation is increased. Adding 1 MB (4x256) results in 1/8 = 12.5% dissipation increase. For that reason, it is possible that the L3 uses different voltage/clock planes, but I was not confirmed that yet.