I hope that is rare situations where prefetchnt instructions are critical. Maybe it could be critical when CPU works with high data paralelism and SIMD instructions. However it is crictical when we have data stream conflicts. When we use one single stream, writiong out data is comparable to 10h.
Probably lower performing prefetchnt instructions is caused by WT data cache policy. Bacause L1D is WT, every write to the cache causes a synchronous write to the backing store.
To avoid performance drop, AMD designers included WCC (Write Coalescing Cache) cache for WT stores for both integer cores.
In general, PREFETCHNTA instruction hints processor to fetch the data non-temporally (i.e. this data is not to be used again or used only once). e.g. You're copying data from one location to another you can use this instruction in that case. And PREFETCHTn instructions hints processor that these data are needed repeatedly. e.g. You're doing calculations on same data.



Reply With Quote

Bookmarks