I assumed, that you refer to Johan's cache ping pong test. One of your later postings confirmed this. Well, I followed the development of this test on the original aceshardware forums a while ago and many ideas have been discussed back then. You can find the full discussion and an early version of the code here:
http://web.archive.org/web/200505281...0681&forumid=2
First I have to say, that this special test is referring to a special variant of core to core communication. And here I think, that K10 got a performance hit in this benchmark due to it's write buffering and maybe even L3 cache (which BTW adds ~20ns to mem latency in case of a miss). This benchmark doesn't tell us anything about how fast a core can access data in another core's cache, which was not written right before this access but at least tens of cycles earlier. Except for semaphores and the like such an access behaviour would just stand for a bad multithreaded coding style.
SSE(2) instructions are mostly being double decoded on K8. SSE was vector decoded on K7. Since these 2 separate ops for both register halves on K8 finished one half one cycle earlier than the other half, it led to a nice 4 cycle latency for standard ops (add, sub, mul).
But as pointed out in the past (google for "k8 sse bottleneck"), there was a strange behaviour regarding SSE loads as you can see in the tests here again. Maybe due to the double decode, it was necessary, that such a decoded instruction uses a single FP unit sequentially. While using x87 or MMX loads it was possible to load two 64 bit values per cycle, this was not true using aligned 128 bit loads resulting in 0.5 SSE loads/cycle. This has been solved (maybe simply by avoiding the double decoding) - leading to a quadrupled SSE load performance compared to K8.






Reply With Quote
Bookmarks