Quote Originally Posted by xsecret View Post
WCC is a joke in the current BD implementation and is not able to catch up with the massive loss that comes from the L1D. The entire caching-system is lowering the performance of the ľarch. The L3 is a non-inclusive victim cache (L2 data are evicted to the L3) with data transfered from L3 to the L1D of the expected core without being copied to the L2. That mean high snoop traffic in order to keep the coherency correct. And snoop traffic is something really unwanted from a bandwidth/performance pov. There is a pardox here : The L1 is in Write-through, but you're not sure a data not in L2 is not the L1D of another core.
Interesting reads here, but 3 questions:

1.) why would AMD do such a thing (you wrote for better clock-scaling, that means higher clocks or better scaling?)

2.) can it be optimized further

3.)
Write Trough means that every write to the cache causes a synchronous write to the backing store. Because L2 is slower than L1, L1 must wait for L2 to write out data.
this contradicts somewhat
The L3 is a non-inclusive victim cache (L2 data are evicted to the L3) with data transfered from L3 to the L1D of the expected core without being copied to the L2.
or you mean this can be due to WCC? I am not a professional, but I would guess WCC doesnt take that long to preserve coherence... If anything at all, I'd guess the waiting of L1D for L2 can be a problem, but if stuff is written to L1D fast, and then some cycles later WCC writes to L2, I guess there will only be rare cases where
The L1 is in Write-through, but you're not sure a data not in L2 is not the L1D of another core.
really matters...

I guess
data transfered from L3 to the L1D of the expected core without being copied to the L2.
will first speed up things, then
snoop traffic in order to keep the coherency correct
will need some time. All, to my eyes, depends on how this stuff is used, it can be faster in one case, and slow in the other... Mhm somewhere we came to that conclusion before

Maybe the problem is because of
The L1 is in Write-through, but you're not sure a data not in L2 is not the L1D of another core.
, one core might need to wait for WCC to complete to ensure coherency?

Ahhhh btw... CONGRATZ FOR NEW WR