AMD's has a pseudo-exclusive cache ( data cannot be found in the two cache levels at the same time, although “pseudo” means that there are a few exceptions) relationship.Things are problematic when you have an L3 miss , you need to check what the other cores have in their caches.
By using an inclusive cache , with an L3 miss data is guaranteed not to be in the other caches and a memory request is sent.
Given Nehalem's inclusive relationship and the flag system they use to maintain coherency , gives them little to no headaches about coherency traffic.
AMD's caches burn more BW and latency for this problem , that's all.
That is simply not true.AMD isn't bottlenecked by interconnects, it is precisely the coherency traffic which kills it.Huge amounts of BW are wasted with maintaining coherency.
In a Nehalem multicpu system , you need to maintain the coherency of the L3s.Furthermore , Intel implemented a directory based coherence protocol which is point to point instead of broadcast.
Not so with the Opteron because data in L1/L2
is more or less guaranteed not to be in the L3 .Also they use
a snoop based one protocol in which the caches listen in on transport of variables to any of the CPUs and update their own copies of these variables if they have them. Snooping logic in the processor broadcasts a message
over the bus each time a word in its cache has been modified. The snooping logic also snoops on the bus looking for such messages from other processors.
Since K8/10 use 64bit lines , can you imagine the traffic in a 4 socket system to maintain coherency ? What about 8 sockets ? Yeah , HT 3.0 will help , but it is a band aid curing the symptoms by brute force ( more BW ) and not the disease ( a better cache coherency protocol ).
Why do you think Newsys tried to build Horus , a directory based chipset to overcome this ?
Bookmarks