MMM
Results 1 to 25 of 1126

Thread: Here's a little teaser....

Hybrid View

  1. #1
    Xtreme Addict
    Join Date
    Jun 2007
    Location
    Thessaloniki, Greece
    Posts
    1,307
    I posted this in the news thread as well.
    Here's some quick K8 results @2GHz specint base=9.77 rate=10.8
    specfp base=10.4 rate=10.9
    This is way too close. Either there is something wrong with the K10 IBM system or K10 is going to be a huge disappointment for everybody except the HPC crowd. I would wait for at least a couple more results to get published. Just look at the problems stephen has been having with the various Bioses.
    http://www.spec.org/cpu2006/results/...828-01902.html
    http://www.spec.org/cpu2006/results/...828-01900.html
    Seems we made our greatest error when we named it at the start
    for though we called it "Human Nature" - it was cancer of the heart
    CPU: AMD X3 720BE@ 3,4Ghz
    Cooler: Xigmatek S1283(Terrible mounting system for AM2/3)
    Motherboard: Gigabyte 790FXT-UD5P(F4) RAM: 2x 2GB OCZ DDR3 1600Mhz Gold 8-8-8-24
    GPU:HD5850 1GB
    PSU: Seasonic M12D 750W Case: Coolermaster HAF932(aka Dusty )

  2. #2
    Registered User
    Join Date
    Sep 2007
    Posts
    43

    barcelona

    Quote Originally Posted by BrowncoatGR View Post
    I posted this in the news thread as well.
    Um why we looking at single threaded benches on a Native Quad Core,Im confused.The power of the platform is that all the cores can talk to each other very quickly.The only thing I dont understand is the SSE should be twice as fast as K8,even on a single thread,One thing I know for sure is on my Quad FX ,setting memory speed to less that 800,even with lower latency timings slows it down,also setting the bios interleave all(disabling NUMA)kills it.It seemed S7 has that problem.Comparing an Intel with slower memory timing proves nothing because the platforms are totally different.In my tests nothing can use a dual socket AMD system as well as SUSE linux.Windows XP x64 sometimes takes 20 minutes to figure out its not using local memory and then move the thread so it does.Vista wasnt much better.Off node hits kill the AMD 2P .The Desktop Quad AMD will be using HT3 and have alot more memory bandwidth than Barcelona, I also heard that at a certain clock speed it has some kinda power band that kicks in that amazing,since the person who said that was one of the first to provide bench marks .Im gonna guess thats right.Barcelona had to plug into an old socket and run and it does but not as fast as it will in a new socket.I know this Barcelona scales better than lineir that has never before happened in the history of processors,example 1 processor scores 50,2 processors score 110,history says 2 processors are about 30% faster than 1 or 65

  3. #3
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by MR_SmartAss View Post
    Depends on what is "very quickly". The cores on a K10 @ 2.4GHz or less are communicating slower than the cores of the different dies of the Core2 Quad MCM at same frequency.
    I assumed, that you refer to Johan's cache ping pong test. One of your later postings confirmed this. Well, I followed the development of this test on the original aceshardware forums a while ago and many ideas have been discussed back then. You can find the full discussion and an early version of the code here:
    http://web.archive.org/web/200505281...0681&forumid=2

    First I have to say, that this special test is referring to a special variant of core to core communication. And here I think, that K10 got a performance hit in this benchmark due to it's write buffering and maybe even L3 cache (which BTW adds ~20ns to mem latency in case of a miss). This benchmark doesn't tell us anything about how fast a core can access data in another core's cache, which was not written right before this access but at least tens of cycles earlier. Except for semaphores and the like such an access behaviour would just stand for a bad multithreaded coding style.

    Quote Originally Posted by MR_SmartAss View Post
    Depends of what kind of SSE code. For some code it is true, for some it isn't. For example during the decode phase the 128bit SSE instructions on the K8 are being split(vector path code) in two 64bit and executed in 2 cycles. K10 doesn't split the 128bit SSE instructions and it is executing them in 1 cycle.
    SSE(2) instructions are mostly being double decoded on K8. SSE was vector decoded on K7. Since these 2 separate ops for both register halves on K8 finished one half one cycle earlier than the other half, it led to a nice 4 cycle latency for standard ops (add, sub, mul).

    But as pointed out in the past (google for "k8 sse bottleneck"), there was a strange behaviour regarding SSE loads as you can see in the tests here again. Maybe due to the double decode, it was necessary, that such a decoded instruction uses a single FP unit sequentially. While using x87 or MMX loads it was possible to load two 64 bit values per cycle, this was not true using aligned 128 bit loads resulting in 0.5 SSE loads/cycle. This has been solved (maybe simply by avoiding the double decoding) - leading to a quadrupled SSE load performance compared to K8.

  4. #4
    Xtreme Addict
    Join Date
    May 2004
    Posts
    1,756
    Quote Originally Posted by Dresdenboy View Post
    I assumed, that you refer to Johan's cache ping pong test. One of your later postings confirmed this. Well, I followed the development of this test on the original aceshardware forums a while ago and many ideas have been discussed back then. You can find the full discussion and an early version of the code here:
    http://web.archive.org/web/200505281...0681&forumid=2

    First I have to say, that this special test is referring to a special variant of core to core communication. And here I think, that K10 got a performance hit in this benchmark due to it's write buffering and maybe even L3 cache (which BTW adds ~20ns to mem latency in case of a miss). This benchmark doesn't tell us anything about how fast a core can access data in another core's cache, which was not written right before this access but at least tens of cycles earlier. Except for semaphores and the like such an access behaviour would just stand for a bad multithreaded coding style.


    SSE(2) instructions are mostly being double decoded on K8. SSE was vector decoded on K7. Since these 2 separate ops for both register halves on K8 finished one half one cycle earlier than the other half, it led to a nice 4 cycle latency for standard ops (add, sub, mul).

    But as pointed out in the past (google for "k8 sse bottleneck"), there was a strange behaviour regarding SSE loads as you can see in the tests here again. Maybe due to the double decode, it was necessary, that such a decoded instruction uses a single FP unit sequentially. While using x87 or MMX loads it was possible to load two 64 bit values per cycle, this was not true using aligned 128 bit loads resulting in 0.5 SSE loads/cycle. This has been solved (maybe simply by avoiding the double decoding) - leading to a quadrupled SSE load performance compared to K8.
    Really enjoy your posts here, they're very informative and always a nice addition to the forums

  5. #5
    Registered User
    Join Date
    Sep 2007
    Posts
    43

    Lol

    Quote Originally Posted by MR_SmartAss View Post
    Because we care about desktop performance. Most of the desktop software today is single or dual threaded. Four threaded software is rarity and we haven't seen any application that can fully utilize all 4 cores yet.
    About the "native" epithet, thats only a marketing which means nothing.

    Depends on what is "very quickly". The cores on a K10 @ 2.4GHz or less are communicating slower than the cores of the different dies of the Core2 Quad MCM at same frequency.

    Depends of what kind of SSE code. For some code it is true, for some it isn't. For example during the decode phase the 128bit SSE instructions on the K8 are being split(vector path code) in two 64bit and executed in 2 cycles. K10 doesn't split the 128bit SSE instructions and it is executing them in 1 cycle.

    You would notice the same on every system, but it is more noticeable on a K8(regardless of the number of CPUs).
    Sometimes yes, but sometimes it performs faster without NUMA. Depends of the OS and the code which is being processed.

    The HT3 is useless on the desktop and it won't offer any performance benefit over HT2 or HT1.
    The bandwidth on the AMD platforms scales with the number of sockets. So the single desktop CPU won't have more bandwidth than Barcelona(2 or more CPUs, ccNUMA). It will only have RAM with lower latency, which would boost its performance for sure. But how much, we can only speculate. Having two IMCs and a large L3 as a medium between the cores and the RAM leads me to a conclusion that it won't bring any dramatical performance improvements. 5% would be impressive.

    I don't know who is the person, but I know that at higher frequency K10(and every CPU made up to date) doesn't scale better in performance. At certain frequencies(such are 1.6GHz, 2.4GHz and 3.2GHz) it will run the RAM at a little bit higher(5% to 10%) frequency, but it won't make any noticeable difference in performance. The same happens with the K8, but we don't see a 2.4GHz K8 offering any noticeable IPC advantage over a 2.3GHz.

    This is nonsense. Barcelona doesn't scale better than linear, nor it scales linear.

    http://www.anandtech.com/cpuchipsets...spx?i=3092&p=6
    Note, that this comparison is between the "bugged?" B1 and the new B2, so if you compare B2 to B2, the scaling would be even lower.
    Also there was a guy(I don't remember who) from AMD's server division who officially said that K10 @2.5GHz would be around 15% faster than a 2GHz K10.
    im not going to respond to this except for this

    anandtech's Linpack test clearly showed that the processor scaled from one socket to two better than 100% it was more than twice as fast on 2 as one(better than linear)!

    As for the rest of what you said you dont own the hardware and are repeating what others have said .

    then you can explain why Quad FX preforms better at 5-5-5-18 800 than at 4-4-4-12 800,we no an Intel wouldnt react the same,because the platforms are way different.
    This is a server processor not desktop, a single socket will get more memory bandwidth on the desktop platform(single socket),obviously more processors on an NUMA system provides more bandwidth,thats why AMD made it.

    last point years ago my friend did Seti he had an 866 coppermine Intel,over in the corner was an old Xeon P2 400 the P2 400 slaughtered the 866 on seti at half the clock speed,so he decided to try games on it,games werent even playable.
    moral of the story trying to guess the desktop preformance based on server chips,doesnt always workout the way you think(pointless)
    Spec clearly showed AMD smashes core on FP with proper code, its that simple by over %100 on some types of code,I could be wrong but I beleive there are 19 FP tests AMD@2Gh beat Intel@3Gh on 17 of them. It also clearly showed that Intel wins on povray,that eveyone loves to run.See what happend is this,new processor gets run on Spec Intel being a huge beast and Media hound runs over to Spec.org to see were they beat AMD,Then they use the media and such to make the tests they win standard bench marks,then Intel throws in there shady compiler and bam they have a winner.Anandtech also clearly showed that on code not compiled with the Intel compiler AMD wins again
    http://aceshardware.freeforums.org/v...r=asc&start=60

    hey if you cant win fairly cheat LOL

  6. #6
    Xtreme Enthusiast
    Join Date
    Oct 2006
    Posts
    896
    anandtech's Linpack test clearly showed that the processor scaled from one socket to two better than 100% it was more than twice as fast on 2 as one(better than linear)!


    20-25 to 35-45, that's 100%+ alright. OTOH, 25-30 to 45-55, seems Intel scales better here.

    last point years ago my friend did Seti he had an 866 coppermine Intel,over in the corner was an old Xeon P2 400 the P2 400 slaughtered the 866 on seti at half the clock speed,so he decided to try games on it,games werent even playable.
    moral of the story trying to guess the desktop preformance based on server chips,doesnt always workout the way you think(pointless)


    I don't see a comparison. You equate PIII vs PII with. Totally different archs, compared with Barcelona and Phenom, same core.

    Spec clearly showed AMD smashes core on FP with proper code


    SpecFP and SpecFP_rate aren't the same. SpecFP measures pure FP performance, specfp_rate runs multiple instances (not threads) of the same bench, using exorbitant bandwidth requirements. http://www.realworldtech.com/forums/...83478&roomid=2

    See what happend is this,new processor gets run on Spec Intel being a huge beast and Media hound runs over to Spec.org to see were they beat AMD,Then they use the media and such to make the tests they win standard bench marks,then Intel throws in there shady compiler and bam they have a winner.




    Anandtech also clearly showed that on code not compiled with the Intel compiler AMD wins again


    I don't recall Anandtech disclosing which apps were compiled with what besides Linpack.

  7. #7
    Banned
    Join Date
    Aug 2007
    Posts
    1,014
    Quote Originally Posted by MR_SmartAss View Post
    If you don't understand the charts or if you don't know what does mean "better than linear", don't spread false misinformations. Here is the article from anand.

    As it is obvious to anyone who can read the numbers and the charts:
    Matrix size = 5000: 34 / 21 = 1.62 or 62% scaling
    Matrix size = 30000: 44 / 23.5 = 1.87 or 87% scaling
    There is no matrix size at which K10 scales 100% going from 1 to 2 sockets.


    Some of us have already playing with K10, after and before it was released. Do you own a K10?

    err.. Can you support this nonsense with anything?

    Again, can you support this or you are just guessing?

    What has the bandwidth to do with the performance scaling?

    Both P2 and P3 have no common points with K10, nor Seti has a common point with the test from anand. Also instead of trying to teach us with your noob knowledge, educate yourself. P3 performs so great in games because it supports SSE, while P2 doesn't. The reason why Core2 kicks K8's ass in gaming is because of Core2's SSE performance. Because K10 has better SSE engine then K8, K10 performs better then K8 in games.
    You don't have to be genius to conclude the performance scaling of an architecture similar to K8 on the same platform. QuadFX performs like a dual Opteron with the exception of few synthetic RAM bandwidth benchmarks. Why should anyone expect this to be different with K10 on the same platform?

    Again you are spreading FUD. Spec clearly showed that Core2 smashes K10(as well as K8) on both INT and FP. Also, it seems that you don't understand, "the proper" code is the code made by SPEC and thats only code executed on these tests.

    Again you are guessing something, and you are guessing wrong. Before you spread FUD, do a little research.

    Are you copy-pasting this from AMDZone, Scientia's or Sharikou's blog?
    Quick someone argue with this guy! i can't hes making perfect sense....

  8. #8
    Xtreme Addict
    Join Date
    May 2004
    Posts
    1,756
    false misinformations = true informations

  9. #9
    Xtreme Enthusiast
    Join Date
    May 2007
    Location
    There's no place like 127.0.0.1, Brazil
    Posts
    888
    Quote Originally Posted by LowRun View Post
    false misinformations = true informations
    LOL

  10. #10
    Xtreme Recruit
    Join Date
    Nov 2004
    Location
    Spring Grove, PA
    Posts
    94
    Quote Originally Posted by LowRun View Post
    false misinformations = true informations
    heh, that's right up there with my favourite phrases "i didn't say nothing," and "you don't know nothing."
    opteron 170 @ 2.6ghz 1.45v | koolance pc2
    asus a8n32 sli deluxe | vista ultimate 32
    7900gt ko | 2x1gb mushkin 1:1 2t
    perc4 w/128mb | 3 x 15k u320 raid 0

  11. #11
    Xtreme Mentor
    Join Date
    Feb 2004
    Location
    The Netherlands
    Posts
    2,984
    Quote Originally Posted by LowRun View Post
    false misinformations = true informations
    mon pointe exactly

    Ryzen 9 3900X w/ NH-U14s on MSI X570 Unify
    32 GB Patriot Viper Steel 3733 CL14 (1.51v)
    RX 5700 XT w/ 2x 120mm fan mod (2 GHz)
    Tons of NVMe & SATA SSDs
    LG 27GL850 + Asus MG279Q
    Meshify C white

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •