AMD Zambezi news, info, fans !

**drfedja** · 09-13-2011, 08:42 AM

Originally Posted by xsecret

Was a P4 slower than a P3 ? Sometimes no, sometimes yes, depending on the software. The absolute performance is not something so important for AMD. The most important thing is money. And just money. Spending gazillions dollars in R&D to reach the performance of a CPU sold in very low quantities (and generating very low incomes) like the 990X is ridiculous. Bulldozer must solve two problems : 1/ Be able to gain performances (with frequency increases) at mid-term without spending more gazillions in another µarch 2/ Compete with Intel *mainstream* CPUs (and not Extreme CPU) with a similar price/performance ratio.

I agree. But, in BD architecture has less tradeoffs than Netburst. I expect per module min. same level of performance of K10, not 40-50% lower. Look at horrific chineese results of wprime. It is 65% slower than Thuban core per core and per clock. Something is wrong here. I still can't believe that it is true.

Originally Posted by xsecret

High raw throughput for an FP unit is nice. But in order to use this power in real-world application, you need a frontend able to feed it correctly.

Average IPC of most workloads isn't much more than 1 IPC on Thuban core. 4-way front end is more than enough to feed two threads.

And keep in mind the horribly slow L1 Write-Through, probably added in order to remove a bottleneck in frequency scaling. Write-Through means your writing from the frontend to the L2 "through" the L1.

No, that doesn't mean WT.
Write Trough means that every write to the cache causes a synchronous write to the backing store. Because L2 is slower than L1, L1 must wait for L2 to write out data. But there is WCC (Write Coalescing Cache) to hold on data for later writing out. I can't see why the WT policy cache is so much issue with BD core. Ratio between loads and stores is arround 2:1. For every two loads, we have one store.

So, seen from the frontend, the L1 write bandwidth is as "slow" as the L2 write bandwidth.

Not quite, because of WCC.

The last µarch to use that horrible trick was Netburst, with high frequencies in mind. Bulldozer comes with a L1 WT too and that point only could explain many disappointments from a performances point of view.

Again, I don't think so there is the problem with WT. L1D is WT, WCC is write buffer, and L2 is probably WB. Because of WCC there can be some issues with multiple write out streams. Also, we don't know what is behaviour of WCC when two integer cores writing data. There is probably WCC cache trashing.

Thread: AMD Zambezi news, info, fans !

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions