MMM
Results 1 to 25 of 815

Thread: New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

Hybrid View

  1. #1
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Quote Originally Posted by CRFX View Post
    The sad reality of all AMD users, everyone optimizes for Intel.
    TBH, I wrote the FMA and XOP code-paths long before Bulldozer was released. Both would be available in 256-bit, including both FMA3 and FMA4. I had no idea that 256-bit would actually be worse than 128-bit.

    When people started leaking benchmarks on the ES Bulldozers, I was extremely surprised that 256-bit AVX was worse than 128-bit SSE3. But the code had already been written (and tested via emulation). I figured AMD will eventually have native 256-bit execution units in the future. So I don't exactly want to waste any effort "backwards optimizing".

    We'll see. It's been a long time since I've looked at the relevant code. So I can't say exactly how hard it would be to branch the 256-bit XOP code-path and modify it 128-bit.

    That said, making and maintaining a lot of codepaths is a lot of work. So I am trying to strike a balance between optimizing for everything, and maintainability.

    Right now, the code-paths I have for v0.6.1 are:
    • x86 (I will most likely be disabling this and removing support for it completely.)
    • x86 SSE3
    • x64 SSE3 ~ Kasumi (AMD K10)
    • x64 SSE4.1 ~ Nagisa (Intel Core 2)
    • x64 SSE4.1 ~ Ushio (Intel Nehalem)
    • x64 AVX ~ Hina (Intel Sandy Bridge)
    • x64 XOP ~ ??? (AMD Bulldozer)
    • x64 AVX2 ~ ??? (Intel Haswell)


    *AVX2 implies FMA3. These have 256-bit code-paths and are together under the same code-path.
    *XOP implies FMA4. These are also 256-bit code-paths and also together under the same codepath.
    *The AVX2 code-path doesn't actually use any AVX2 yet (just FMA3). But I'm labeling it AVX2 so that I can add them later without needing to "upgrade" it from FMA3.

    In any case, it's a lot of code-paths that I'm maintaining. x86 will mostly likely be dropped for technical reasons. I would have dropped x64 SSE4.1 ~ Nagisa (Core 2) a long time ago if it weren't for my X5482 machine.

    As I had mentioned, the FMA3/AVX2 and FMA4/XOP code-paths will come when:
    1. v0.6.1 is ready.
    2. I get my hands on the needed hardware.


    v0.6.1 (with everything up to AVX) will be released when it's ready regardless of whether I get the hardware for AVX2 and XOP.


    Quote Originally Posted by st0ned View Post
    Hi I don't know if you happen to drop by or if you did get my email, anyhow I'm happy to see you around.

    You answered almost everyone of the questions I posed on the email, except for the part I asked if you intend to start releasing beta versions to the public, or particular beta testers. If I can do something to help with the code for Windows 8 I've it running in two machines and I might be of use.

    best regards !
    Your email was mainly what promoted me to check on this thread.

    Yes, I'll be doing a public beta once all the important features are in. Right now, v0.6.1 is barely even functional. It has just enough functionality to test the core math-library.

    Once swap mode and validation is done, I'll be releasing a public beta. Only then will I start searching for a Haswell and Bulldozer machine to test/tune their code-paths.
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  2. #2
    Xtreme Enthusiast
    Join Date
    Mar 2009
    Location
    Bay Area, California
    Posts
    705
    Just a quick update. I've updated the charts on my website. But I haven't updated them here on XS. I'll get to that later.

    Let me know if I missed your submission. It's been a long time since I'm updated the charts (since I've been busy). So I easily could have missed someone.

    In any case, there's someone floating around with a dual-sandy, 256 GB, and a bunch of SSDs...
    Main Machine:
    AMD FX8350 @ stock --- 16 GB DDR3 @ 1333 MHz --- Asus M5A99FX Pro R2.0 --- 2.0 TB Seagate

    Miscellaneous Workstations for Code-Testing:
    Intel Core i7 4770K @ 4.0 GHz --- 32 GB DDR3 @ 1866 MHz --- Asus Z87-Plus --- 1.5 TB (boot) --- 4 x 1 TB + 4 x 2 TB (swap)

  3. #3
    Registered User
    Join Date
    Aug 2012
    Posts
    70
    Thanks Alex for updating your lists.

    As I am the one "floating" around :-), I thought there might be some interest in the xtremesystems community to share a few words about the 2 systems I built for a personal study project in the last 4 months, which I used here.

    Traditionally and in the last few years, PCs grew their computational capabilities faster than any other part of the overall system. Main memory bandwidth grew slower than mips and mflop/s. I/O latency and bandwidth advances grew significantly slower than sheer compute power, etc .... So, from an I/O perspective, systems became more and more "unbalanced" for data intensive workloads. Yet the amount of data to be processed grew at least as fast as the development of CPU capabilities. Another imbalance continue to happen was in hard disks. The disk capacity grew faster than sustainable bandwidth and even faster than the reduction in latency.

    When I started the project to built an I/O balanced workstation, I took advantage of - from my POV - a few significant developments in recent months, which in combination allow much improved data handling improvements. I'd like to list just 4 components which contribute to the improvements:
    • The new I/O architecture of latest generation Sandy Bridge CPU's, allowing a massive increase in I/O capabilities
    • The latest generation of low-cost SAS/SATA hostbus adapters, which are not impeding the performance of operating parallel SSDs
    • The performance characteristics of indiv?dual SSDs are well known. But there is much less experience in parallel configurations in PCs with an I/O speed of over 20 GB/sec
    • The final availability of Windows Server 2012 with a much improved I/O and networking subsystem


    This is not the place to go deeper in I/O, but I used the PCs to let it compute Alex Yee's excellent ycrunch application as a background task. (All runs in memory were on an otherwise idle machine)
    Not only is ycrunch well optimized on the computational front, but its I/O subsection is able to hit peak transfer rates of 12 GB/sec and more.

    I am currently writing a paper (mostly on weekends) to describe the systems and its performance characteristics of the HW and SW environment in more detail, but a few words about the 2 systems which are in a kind of constant reconfiguration state:

    1) The single socket PC
    CPU: i7-3960K
    MB: Asus P9-X78 WS
    Mem: 8 x 8 GB Kingston DDR3-1600
    Disk controller: 4 x LSI 9207-8i (each with 8 x 6GBit/s SAS/SATA ports)
    Data SSDs: 32 x Samsung 830 (128GB)
    OS: SanDisk SSD 240 GB

    2) The dual socket PC
    CPU: 2 x E5-2687W
    MB: Asus Z9PE-D16 (4 x GBit LAN ports)
    Mem: 16 x 16 GB Kingston ECC DDR3-1600
    Disk controller: 6 x LSI 9207-8i (ea. 8x SATA/SAS ports)
    Data SSD: 48 x Samsung 830 (128GB)
    OS: SanDisk SSD 480GB

    Disk controllers and data SSDs are shared betwen the 2 PCs, depending on requirements.


    Some comments on the numbers and observations during the runs:
    1. "Small" sizes of Pi (below 100m) achieve better performance when HT is disabled
    2. Overall efficiency expressed as % of peak is more challenging on NUMA machines (vs. single socket machines with one physical memory space)
    3. The 1 trillion pi run generated close to 500 TB of data transfer (avg 725 MB/sec over the total run time and > 12 GB/sec peak)
    4. The Sandy Bridge architecture is an excellent platform for high I/O apps (either dedicated I/O application, or as part of a combined compute/IO application like ycrunch)
    5. The new generation of low cost SAS HBA controllers offer much better scaling than previous generation controllers
    6. As said, the machines are in a constant flux of configurations. The runs were done with I/O system configurations ranging from 0 to 48 SSDs
    7. Long running applications like ycrunch with algorithmic error detection show the value of ECC in RAM
    8. To keep the CPUs safe with this computationally demanding application, I ran them below 60 degree Celcius.


    I've tried to aggregate the data in the list below as accurately as possible, please let me know of any potential error.

    With that, thanks to Alex for his great application, and to all community members, enjoy the fascinating world of computing,
    Andy


    PS:
    In the spirit of transparency and as I mentioned a product of my employer.
    In my day job, I am currently working as Regional Technology Officer in Microsoft's field organisation in Western Europe.

    Full size to download



    Depending of the state of the application, the 16 physical cores (plus HT) were quite busy


    During I/O intense times, the CPU graphs look differently


    One snapshot while the application was writing faster than 12 GB/sec
    Last edited by Andreas; 11-19-2012 at 11:13 AM.

  4. #4
    Xtreme Enthusiast
    Join Date
    Sep 2007
    Location
    Coimbra - Portugal
    Posts
    699
    Beast system there ! Would you care to post just a bechmark of that 48x samsung array ? I just wanted to see the scaling and global performance out of curiosity.

  5. #5
    Registered User
    Join Date
    Aug 2012
    Posts
    70
    Quote Originally Posted by st0ned View Post
    Beast system there ! Would you care to post just a bechmark of that 48x samsung array ? I just wanted to see the scaling and global performance out of curiosity.
    The peak transfer rates are (measured with IOMeter):
    Read: 20 GB/sec (out of 25 GB/sec theoretical max). CPU load: 2%
    Write: 15 GB/sec (out of 15 GB/sec max)

    IOPS (I/O operations per second):
    2,2 million I/O with 4 KB sector size = 8.6 GB/sec transferrate with random access
    This level of performance is primarily limited by the CPUs, not I/O.
    Next barrier would be the QPI interconnect between the 2 CPUs, then the 6 SATA controllers, then the 48 Samsung drives and lastly the PCIe subsystem.


    I'll provide more background info and configuration information in the paper.

    kind regards,
    Andy

  6. #6
    Xtreme Enthusiast
    Join Date
    Sep 2007
    Location
    Coimbra - Portugal
    Posts
    699
    Thanks for your time Andreas, I'm looking forward to read your paper!

    As I though that array performance is incredible, moreover achieving 20GB/sec out of the theoretical 25 is good scaling if we consider the number of SSDs involved. Do have any overheating problems with your raid controllers ?
    Another thing that I find amusing is that you should have already or you are very close to overpass your ram raw write/read speed with your SSDs, although I recognize that ram access times should be 1/10th of those of the SSDs


    regards,

    Miguel

  7. #7
    Registered User
    Join Date
    Aug 2012
    Posts
    70
    Miguel,
    there were quite a lot of interesting things I could learn via this project, like saturation levels, component selection, etc ...
    To avoid a further off-topic deviation in this thread on Pi, I'll open up a new one so we have more space to discuss and other people potentially interested in this topic of high performance I/O can join as well.

    regards,
    Andy

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •