I agree. But, in BD architecture has less tradeoffs than Netburst. I expect per module min. same level of performance of K10, not 40-50% lower. Look at horrific chineese results of wprime. It is 65% slower than Thuban core per core and per clock. Something is wrong here. I still can't believe that it is true.
Average IPC of most workloads isn't much more than 1 IPC on Thuban core. 4-way front end is more than enough to feed two threads.
No, that doesn't mean WT.And keep in mind the horribly slow L1 Write-Through, probably added in order to remove a bottleneck in frequency scaling. Write-Through means your writing from the frontend to the L2 "through" the L1.
Write Trough means that every write to the cache causes a synchronous write to the backing store. Because L2 is slower than L1, L1 must wait for L2 to write out data. But there is WCC (Write Coalescing Cache) to hold on data for later writing out. I can't see why the WT policy cache is so much issue with BD core. Ratio between loads and stores is arround 2:1. For every two loads, we have one store.
Not quite, because of WCC.So, seen from the frontend, the L1 write bandwidth is as "slow" as the L2 write bandwidth.
Again, I don't think so there is the problem with WT. L1D is WT, WCC is write buffer, and L2 is probably WB. Because of WCC there can be some issues with multiple write out streams. Also, we don't know what is behaviour of WCC when two integer cores writing data. There is probably WCC cache trashing.The last ľarch to use that horrible trick was Netburst, with high frequencies in mind. Bulldozer comes with a L1 WT too and that point only could explain many disappointments from a performances point of view.




Reply With Quote

Bookmarks