Results 1 to 25 of 111

Thread: The Book of Bulldozer - Revelations: Episode 2 (SuperPI / x87)

Threaded View

  1. #1
    Xtreme Legend
    Join Date
    Nov 2003
    Location
    Helsinki, Finland
    Posts
    1,692

    The Book of Bulldozer - Revelations: Episode 2 (SuperPI / x87)

    Exactly two year ago, when I tested a Bulldozer based Zambesi CPU for the first I was shocked.
    The early sample units were even hotter and slower than the final silicon revision CPUs, which finally were released four months later.
    One of the largest single let-down came from the way back: SuperPI.

    SuperPI mainly uses legacy x87 instructions which have been almost completely superceded.
    SuperPI doesn't show any indication what so ever about SMP performance as it can only utilize a single thread. On top of that it has no real world use or purpose as there are newer programs which can calculate PI almost 100 times faster.

    Still, SuperPI can almost be considered as a industry standard.
    Nowdays it is generally a VERY poor indicator of real world performance, yet it is so addictive for any old school overclocker. It scales very well along with the CPU/NB/DRAM/IO performance and tweaking it is a big challenge. An overclocker who hasn't ever benched SuperPI simply doesn't exist.

    SuperPI has a special place in my heart simply because it was one of the first benchmarks I ever ran... almost 14 years ago...

    So, why are all of the 15h (Bulldozer) based CPU/APU/NPUs performing so bad in SuperPI?
    Some people say it is because 15h family has 50% less FPs per core than the preceeding 10h family.
    In 15h family a compute unit (two cores) share a FP when the 10/12h family had a dedicated FP for each of the cores.

    If this would be the only reason, the issue would be solved when the "slave" core of the CU is disabled, leaving a "private" FP for the "master" (BSC) core. However this is not the case and it even shouldn't be as SuperPI is single threaded, remember?

    The caches on 15h family have higher latency than 10h family for example, and SuperPI happens to love large & low latency caches.
    15h family was initially designed for high frequencies. Just like the F1 engines, they produce no power at low revs. And unfortunately it currently doesn't seem to be possible to build an engine capable reving high enough. We might discuss more about the caches in "Episode 3"... If possible.

    Agner Fog from Copenhagen University College of Engineering has made an excellent document about the instruction latencies of the modern CPUs.

    Values for 10h family start from page 26, while 15h family values are located at page 36.

    Anyway...

    Few days when I was doing some low level testing for other purposes, I found something that didn't make any sense to me.
    Now I roughly know what it is and what it does, but still some questions remain: Why does this "feature" exist in the first place and why it is activated on all 15h family parts.

    I would normally assume it is a workaround for some errata, however no bulletin exists for this one either.
    Also this feature does not exist in any documentation, or it does but only AMD has access to the required level.

    I find it hard to believe that it would be a design issue as the affected instructions work fine (but slowly) and it existed since early Zambesi revisions and, currently is still present in Richland and probably beyond (within family 15h)...

    I'd say it is either a errata fix or a errata fix gone wrong.
    If it is a programming mistake which has gone un-noticed during the last two years...
    That would make me just sad

    Parts affected: AMD Barracuda (Zambesi, Vishera), AMD Comal (Trinity, Richland), AMD Virgo (Trinity, Richland)

    Effect: A massive performance hit in application heavily utilizing x87 instructions.

    Negative effects: TBD, none found yet. The performance in non x87 applications remains the same or improves very slightly. No instability, increased power consumption, reduced overclockability or anything else abnormal has been observed. However the final conclusion requires far more extended testing than I am able to do myself.

    After the fix has been applied SuperPI shows 18-30% improvement in performance.
    Bigger the calculation, bigger the improvement.

    Since this kind of fix is quite unheard of, I knew that I would be crucified if I would make such claims without any providing evidence.

    I generally hate to do videos however this time it was mandatory.

    I apologize the quality, 1080p is available but the quality is quite grainy due poor lightning.
    It was a cloudy day in Helsinki today.

    The video shows few important things:

    - In the video the fix is called as "The Plow of Bulldozer"
    - SuperPI 1.5 XS Mod validated by online MD5 checksum
    - CPU-Z 1.64.3 x32 validated by online MD5 checksum (can be found from Stasio's CPU-Z thread)
    - The clocks are being shown during the calculation (look for the affinity and CPU-Z core selection)
    - An external clock reference is provided (to prove there is no tampering with the timers, i.e. "Lab Burst" by MSI)
    - The air cooled setup is shown and so are the CPU temperatures (HWMonitor)





    For the 32M SuperPI run (time) you might want to look a reference from HWBot Piledriver 5G challenge thread.
    http://hwbot.org/submission/2386335_...n_14sec_718ms/

    39 seconds better time with stock CPU clocks (4.1GHz, NB 2500, MEMCLK DDR-2400) than on 5GHz Trinity with 2777MHz NB and DDR-2666 memory clocks.

    Since I 'happened' to have some LN2 in my disposal, I decided to do some high clock SuperPI runs on Richland.





    AMD 32nm SuperPI 32M record taken easily.
    Tomorrow when I throw in a Vishera, the reign of 10h should be finally over

    All of the runs are either completely or partially on video.
    Will upload them once I have time to edit them. I've been filming around 28GB worth of video during the last 48h hours.

    FAQ (as I would assume):

    Q: Does it help on my grandma's Palomino Athlon or on my Phenom / II or 12h Llano?
    A: NO

    Q: Does it work on my xxx branded motherboard?
    A: It has nothing to do with the motherboard, or the OS or any other component. If the CPU says AMD on it and it is Bulldozer based then it will work.

    Q: When will the fix come available to public?
    A: Once I have started and finished the programming. The program is quite simple really, only some minor hardware detection features required.

    And now...
    I will go and have atleast a proper meal and sleep.
    During the last 72 hours I have worked 45 hours, currently I probably look and smell like a pirate.

    Please don't send me pm regarding the software.
    It will be available to everyone at the same time.
    If you have something to ask, please do. I'll answer when I have time.

    "The OP will surely deliver. Let's just wait."
    Just kidding

    Soon.

    Update on 06/21/2013: Available for download. Please check post #42 for the link.

    [B]Update on 06/26/2013: Version R1.02B available Post #75
    Last edited by The Stilt; 06-26-2013 at 03:18 AM.

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •