No it's not. Barcelona/Shanghai's L3 is pseudo exclusive (if you will), with cachelines that are shared/contain code being possibly duplicated (instead of being evicted on a fetch from L3 they're maintained, thus breaking the exclusiveness). Nehalem's L3 is inclusive, read Ronak's presentation about it.
Higher associativity means longer snoop time, thus higher latency. There's no such thing as a free lunch in chip design. The prefetcher has merit for desktop workloads mostly, where datasets are small and with good prefetching you could, in theory, avoid fetching from RAM (which is still lol-slow even with IMCs). For large datasets it may be detrimental since it's more likely to increase cache trashing. Also, Barcelona/Shanghai also include prefetchers, albeit less apt ones compared to Intel's.
Bookmarks