Here's a little teaser....

**Dresdenboy** · 10-10-2007, 12:11 AM

Originally Posted by MR_SmartAss

Depends on what is "very quickly". The cores on a K10 @ 2.4GHz or less are communicating slower than the cores of the different dies of the Core2 Quad MCM at same frequency.

I assumed, that you refer to Johan's cache ping pong test. One of your later postings confirmed this. Well, I followed the development of this test on the original aceshardware forums a while ago and many ideas have been discussed back then. You can find the full discussion and an early version of the code here:
http://web.archive.org/web/200505281...0681&forumid=2

First I have to say, that this special test is referring to a special variant of core to core communication. And here I think, that K10 got a performance hit in this benchmark due to it's write buffering and maybe even L3 cache (which BTW adds ~20ns to mem latency in case of a miss). This benchmark doesn't tell us anything about how fast a core can access data in another core's cache, which was not written right before this access but at least tens of cycles earlier. Except for semaphores and the like such an access behaviour would just stand for a bad multithreaded coding style.

Originally Posted by MR_SmartAss

Depends of what kind of SSE code. For some code it is true, for some it isn't. For example during the decode phase the 128bit SSE instructions on the K8 are being split(vector path code) in two 64bit and executed in 2 cycles. K10 doesn't split the 128bit SSE instructions and it is executing them in 1 cycle.

SSE(2) instructions are mostly being double decoded on K8. SSE was vector decoded on K7. Since these 2 separate ops for both register halves on K8 finished one half one cycle earlier than the other half, it led to a nice 4 cycle latency for standard ops (add, sub, mul).

But as pointed out in the past (google for "k8 sse bottleneck"), there was a strange behaviour regarding SSE loads as you can see in the tests here again. Maybe due to the double decode, it was necessary, that such a decoded instruction uses a single FP unit sequentially. While using x87 or MMX loads it was possible to load two 64 bit values per cycle, this was not true using aligned 128 bit loads resulting in 0.5 SSE loads/cycle. This has been solved (maybe simply by avoiding the double decoding) - leading to a quadrupled SSE load performance compared to K8.

Thread: Here's a little teaser....

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions