New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

**poke349** · 11-04-2013, 12:21 AM

A bump after a long time...

Here's a screenshot from a binary tuned for AMD Bulldozer using FMA4 and XOP instructions.
AMD FX-8350 @ 4.0 GHz (stock) with 16 GB @ 1333 MHz:

It's not close to done yet as the large size algorithms still need to be re-tuned.
I plan to release this binary in v0.6.4. But it may come earlier (in v0.6.3) if the stuff that's supposed to be in v0.6.3 drags on too long.

If all goes well (which it never does), v0.6.3 is ETA: late December. v0.6.4 in January.

In the meantime, if you have a Bulldozer machine, I highly recommend running the "x64 SSE3 ~ Kasumi" binary instead of what the program auto-selects (which is "x64 AVX ~ Hina"). I've found Bulldozer's 256-bit AVX performance to be pretty crappy.
The author of Prime95 explains in this link: http://www.mersenneforum.org/showthread.php?t=17618
And as such, the FMA4/XOP binary will use 128-bit AVX, FMA4, and XOP instructions.

If the FMA4/XOP binary doesn't make it into v0.6.3, I'll have the version-selector choose "x64 SSE3 ~ Kasumi" instead "x64 AVX ~ Hina" for AMD Bulldozer line processors.

Other news: I burned my Sandy Bridge machine last week.

A careless short-circuit took out the motherboard and possibly the CPU as well. So I will no longer be able to do performance tuning for the "x64 AVX ~ Hina" binary. The binary will remain (for a while), but all the tuning parameters can no longer be updated.

**skycrane** · 12-06-2013, 06:13 PM

Hey Poke, nice to see your still working on this

ive got a question for ya. what do you think of a C6100 8xL5520@2.6 with 96gb of ram

ive got one inbound for a late Dec delivery and wanted to have some fun with it before i deidicate it to 24/7 boinc

**poke349** · 12-08-2013, 12:23 PM

Originally Posted by skycrane

Hey Poke, nice to see your still working on this

ive got a question for ya. what do you think of a C6100 8xL5520@2.6 with 96gb of ram

ive got one inbound for a late Dec delivery and wanted to have some fun with it before i deidicate it to 24/7 boinc

Wow... Is that an 8-socket I'm seeing?

**skycrane** · 12-08-2013, 01:45 PM

its actually 4 dual socket nodes

that all fit in a 2u rack

**poke349** · 12-10-2013, 10:32 AM

Originally Posted by skycrane

its actually 4 dual socket nodes

that all fit in a 2u rack

That will be interesting. Especially since the NUMA affect will be extreme.

There's one NUMA friendly algorithm in the program. But it's activated only above 50 billion digits since it is slow. So if you're willing to toy around a bit, I can send you a version with the threshold dropped to say 1 billion to see if it does any better than what's available on my website right now.

At some point in the future, I intend to make this threshold adjustable by the user, but it's not that easy... Right now it needs to be hard-coded into the program and recompiled.

**skycrane** · 12-14-2013, 10:38 AM

yea that sounds good it would be fun to try it out. now how all this stuff works im a bit clueless with it. but could we set up a remote login so you can do your magic with the programing? just as long as you dont mess up my boinc workunits

lol

also, i was wondeirng about the lan conections. i was hoping to get 3 of them and inifiniband them all together as a nice lil cluster, but it looks like each node on all 3 racks will need add in card. witch might be a bit cost prohibitive... lol do you think that a dual Gbit connected to a switch will be fast enough to feed the info between all of the nodes

**poke349** · 12-16-2013, 12:02 PM

Originally Posted by skycrane

yea that sounds good it would be fun to try it out. now how all this stuff works im a bit clueless with it. but could we set up a remote login so you can do your magic with the programing? just as long as you dont mess up my boinc workunits

lol

also, i was wondeirng about the lan conections. i was hoping to get 3 of them and inifiniband them all together as a nice lil cluster, but it looks like each node on all 3 racks will need add in card. witch might be a bit cost prohibitive... lol do you think that a dual Gbit connected to a switch will be fast enough to feed the info between all of the nodes

It would probably depend on how fast the Infiniband is. For a system of this calibur you'd gonna need at least 20 GB/s of sustained bandwidth to have any hope of being able to use it efficiently as shared memory. I'm also unsure of how the high latency is going to play out. Perhaps HyperThreading will be able to cover up most of those delays. I don't know though.

Lemme know when it's ready so I can send you a binary with the high-end algorithm threshold dropped to 1 billion (or even lower). If the performance scaling turns out to be okay on two motherboards, then you can try going higher. That NUMA-friendly algorithm is NUMA friendly because it's heavily optimized to simply not use memory until it's absolutely needed. But it isn't actually "aware" of the NUMA. By comparison, most of the algorithms thrash memory all over the place.

**skycrane** · 12-16-2013, 02:41 PM

damn it needs to be that fast??? so i would need what a dual 10gbt card in each node. then connect all 8 wires to the switch?
did you get my pm? i havent gotten any offers on anything, maybe they are all to expensive?? but its yours untill i can sell it. then ill let you do some remote login work for your classes, and to see what magic you can work with the NUMA programing when you have access to this

for testing purposes.. hehe

what i really would love doing with it is some HPC work using boinc to really rack up the stats.... do you think this would work better, or a blade center? ive got all the software i need for it, i just need to know what sort of hardware ive got to get. would you be able to do a lil reasearch and tell me what i need for either the c6100 or this http://www.ebay.com/itm/HP-C7000-BLA...91034888783%26
would be better for what i have inmind?

**poke349** · 12-17-2013, 10:24 AM

Originally Posted by skycrane

damn it needs to be that fast??? so i would need what a dual 10gbt card in each node. then connect all 8 wires to the switch?
did you get my pm? i havent gotten any offers on anything, maybe they are all to expensive?? but its yours untill i can sell it. then ill let you do some remote login work for your classes, and to see what magic you can work with the NUMA programing when you have access to this

for testing purposes.. hehe

what i really would love doing with it is some HPC work using boinc to really rack up the stats.... do you think this would work better, or a blade center? ive got all the software i need for it, i just need to know what sort of hardware ive got to get. would you be able to do a lil reasearch and tell me what i need for either the c6100 or this http://www.ebay.com/itm/HP-C7000-BLA...91034888783%26
would be better for what i have inmind?

Yeah it kinda does. At least enough to match the internal bandwidth or the socket <-> socket connection. That's the problem when you try to use distributed memory like shared memory. Latencies can be hidden pretty well with HT and good cache locality, but not bandwidth.

FWIW, we had 2 GB/s of just disk bandwidth when we did the 10 trillion digit computation of Pi. Not only was it severely limiting, but the program is specifically optimized for using disk.
There is somewhat of a fundamental problem though: The FFT algorithm requires very high Bisection Bandwidth to run efficiently.
Of course this doesn't exist - even on the best connected super-computers. So the efficiency is extremely poor on them. (even with specialized distributed implementations)

That's not to say I can't find a way to do any better. But I have a full-time job now and I don't have as much time as I used to.

but its yours untill i can sell it.

I would feel pretty bad taking another machine from you.

I also kind of broke the promise of putting the quad Opteron on WCG. I had it running for a few months, then I realized that I had no way to monitor the heath the machine. (with Summer approaching) So I took it off and used it only for things that needed the NUMA. (to preserve the operational life) So that's how it is right now. It's off most of the time, but every once in a while, I'll boot it up to run some scalability testing.

what i really would love doing with it is some HPC work using boinc to really rack up the stats.... do you think this would work better, or a blade center? ive got all the software i need for it, i just need to know what sort of hardware ive got to get. would you be able to do a lil reasearch and tell me what i need for either the c6100 or this http://www.ebay.com/itm/HP-C7000-BLA...91034888783%26
would be better for what i have inmind?

That looks really cheap for a Sandy Bridge blade. (if I'm reading it right) I would imagine that simple high-end desktops (OCed) would be the cheapest and most power-efficient approach for truly distributed tasks that require little communication. The main reason why you would go with multi-socket boards is to get fast bandwidth between the two chips. But I guess that's not the case here.

**NEOAethyr** · 12-31-2013, 09:14 AM

Hi, I recently lossed my array and ran chkdsk midway when recovering it, screwing up a dozen or so programs doing that.
One of them was the copy of y cruncher I had.

None of the newer ver's work on my config.
I know the exact zip name of the one I need too but it's not online anymore.
y-cruncher v0.5.4.9148 (fix 1).zip

The newer ones just crash, program has stopped working error... (I don't know exactly which ver this started with, I ran 2 other ver's when I 1st got my r4be board a few weeks ago and they both errored out on startup)
And yes I installed both the x86 and x64 packs of vc2010.

I don't have any games installed right now other then some ps2 games, I wanted to test 4.3ghz on my cpu, I think it's stable on stock volts with the pll overvoltage enable setting.
Plan was to run y-cruncher in the bg for 8hrs+ while I watch fma brotherhood, seems I better just reset my pc back to 4.2ghz and wait it out for now lol.

Update:
If I use y-cruncher.exe from the older ver I had, and the files in the binaries folder form the new ver (this is what I was missing), it runs.
It will not run with the newer y-cruncher.exe file though.

However as much as I'de like it to be all good, it's not quite what I want anymore as a stress teting program.
It stresses the cpu a bit to much, I can't game with this while it's in the bg.
Heck I can't do anything while it's running, it lags my mouse so much...
My mouse lags on the intel setup when the cpu is stressed passed 90% or so, I notcied this when I 1st got the board but didn't understand what was going on, it was only the other day when messing with avisynth 2.6 mt that I knew for sure what caused the mouse lag.
(Anything past say 90% usage causes it to be slower then say 80%, again noticed this in avisynth, yeah I can't use the last 10% of my cpu without everything slowing down to crap)
10 Threads works out fine but still...

I prefer the algo's used by the older ver I had

.
Otherwise I just don't have any use for this program anymore sorry

.
You don't happen to have an old copy of "y-cruncher v0.5.4.9148 (fix 1).zip" lying around do ya?
I found that ver useful...
No offense intended.

Hmm, if I set it to 7gb, 10 threads, and disable all the tests except vst it might be of some use to me, for cpu.
Fft might be useful for mem, I don't know what hnt is though.
Wish it supported a cmd tail though, ohwell.
Hmm :\.

**poke349** · 01-02-2014, 12:22 AM

Originally Posted by NEOAethyr

Hi, I recently lossed my array and ran chkdsk midway when recovering it, screwing up a dozen or so programs doing that.
One of them was the copy of y cruncher I had.

None of the newer ver's work on my config.
I know the exact zip name of the one I need too but it's not online anymore.
y-cruncher v0.5.4.9148 (fix 1).zip

You can get the older versions here: http://www.numberworld.org/y-cruncher/versions.html

The newer ones just crash, program has stopped working error... (I don't know exactly which ver this started with, I ran 2 other ver's when I 1st got my r4be board a few weeks ago and they both errored out on startup)
And yes I installed both the x86 and x64 packs of vc2010.

That should not happen. And I haven't received any other reports of this issue. Do you have a screenshot of it or something? It's hard to say what's wrong since I've never seen it before.

I don't have any games installed right now other then some ps2 games, I wanted to test 4.3ghz on my cpu, I think it's stable on stock volts with the pll overvoltage enable setting.
Plan was to run y-cruncher in the bg for 8hrs+ while I watch fma brotherhood, seems I better just reset my pc back to 4.2ghz and wait it out for now lol.

Update:
If I use y-cruncher.exe from the older ver I had, and the files in the binaries folder form the new ver (this is what I was missing), it runs.
It will not run with the newer y-cruncher.exe file though.

That's interesting. How are you running it? Double-click? Command line?

However as much as I'de like it to be all good, it's not quite what I want anymore as a stress teting program.
It stresses the cpu a bit to much, I can't game with this while it's in the bg.
Heck I can't do anything while it's running, it lags my mouse so much...
My mouse lags on the intel setup when the cpu is stressed passed 90% or so, I notcied this when I 1st got the board but didn't understand what was going on, it was only the other day when messing with avisynth 2.6 mt that I knew for sure what caused the mouse lag.
(Anything past say 90% usage causes it to be slower then say 80%, again noticed this in avisynth, yeah I can't use the last 10% of my cpu without everything slowing down to crap)
10 Threads works out fine but still...

I prefer the algo's used by the older ver I had

.
Otherwise I just don't have any use for this program anymore sorry

.
You don't happen to have an old copy of "y-cruncher v0.5.4.9148 (fix 1).zip" lying around do ya?
I found that ver useful...
No offense intended.

Hmm, if I set it to 7gb, 10 threads, and disable all the tests except vst it might be of some use to me, for cpu.
Fft might be useful for mem, I don't know what hnt is though.
Wish it supported a cmd tail though, ohwell.
Hmm :\.

No offense taken.

It's not uncommon for stress-tests to be "too much" for a computer. (especially laptops)

**poke349** · 02-22-2014, 12:51 AM

Some updates on v0.6.4...

Back in November when I "plugged in" my pre-written FMA4 instruction macros, the performance gain on AMD Piledriver actually negative. Some 10 - 20% slower. This was because of the 256-bit AVX that AMD Bulldozer and Piledriver can't handle well.
So I spent some time re-working the FMA4 and XOP code to use 128-bit instead. This got to a 2% improvement over the SSE3 binary. Pathetic, but at least it's positive.
After rewriting my auto-tuner and running it on my FX-8350, I got the improvement up to 5%.
With more on-and-off tweaking, I've gotten it up to 7 - 8%.

I'm going to leave it at that. I've run out of things to tweak and I'd rather move on to AVX2 for Haswell. 7% is actually quite a large improvement for a new instruction set that doesn't double the vector width.

Here are the benchmarks for v0.6.4 on a stock FX8350 with 16 GB @ 1333 MHz. (For some reason, this machine resisted all attempts to overclock it. As soon as I take it off of "Auto" settings, the memory goes unstable. And I haven't had the time to really mess with it.)

	x86	x86 SSE3	x64 SSE3	x64 SSE4.1	x64 AVX	x64 XOP
25m	27.018	13.704	6.544	7.37	9.128	7.207
50m	45.746	24.635	13.734	14.771	18.678	13.908
100m	87.906	47.453	28.336	29.467	37.52	27.797
250m	224.24	117.473	76.576	78.533	103.326	71.436
500m			166.859	170.067	225.879	153.344
1b			376.746	382.436	503.946	338.529
2.5b			1085.634	1132.52	1396.444	1009.923

Notes:

SSE4.1 is slower than SSE3 because the SSE4.1 binary is specialized for Intel Nehalem. The SSE3 binary is specialized for AMD K10 which Bulldozer/Piledriver seems to like better.
AVX is slower than SSE3/4.1 because Bulldozer/Piledriver can't efficiently handle 256-bit AVX instructions.
The XOP binary doesn't actually get any faster until you pass a certain size. Without going into details, it was what the auto-tuner chose. (for a valid reason) So I stuck with it.

I plan on releasing v0.6.4 before Pi day - provided that I don't find any serious bugs by then.

**Sandon** · 03-12-2014, 08:47 PM

My 25 billion benchmark:

Code:

Constant :  Pi
Algorithm:  Chudnovsky Formula

Decimal Digits    :   25,000,000,000
Hexadecimal Digits:   Disabled

Threads:    32
Mode   :    Ram Only

Start Time: Wed Mar 12 18:44:01 2014

Reserving Working Memory...          117 GB
Constructing Twiddle Tables...      4.38 MB
Allocating I/O Buffers...           0 bytes

Begin Computation:

Summing Series...  1,762,841,738 terms
Time:    7730.888 seconds  ( 2.147 hours )
Division...
Time:    240.249 seconds  ( 0.067 hours )
InvSqrt...
Time:    156.062 seconds  ( 0.043 hours )
Final Multiply...
Time:    104.593 seconds  ( 0.029 hours )

Pi:  8231.793 seconds  ( 2.287 hours )

Base Converting:
Time:    329.628 seconds  ( 0.092 hours )

Writing Decimal Digits:   25,000,000,000  digits written

Verifying Base Conversion...
Time:    154.667 seconds  ( 0.043 hours )

Start Time: Wed Mar 12 18:44:01 2014
End Time:   Wed Mar 12 21:11:31 2014

Total Computation Time:             8561.420 seconds  ( 2.378 hours )
Total Time (with output + verify):  8850.835 seconds  ( 2.459 hours )

CPU Utilization:        1658.35 %
Multi-core Efficiency:  51.8234 %

Last Digits:  Pi
2448547079 5329693979 7145627081 9204187454 9483487803  :  24,999,999,950
1309759846 5364560010 7388984278 8403481193 9913806533  :  25,000,000,000

Version:          0.6.3 Build 9416b (fix 1) (x64 AVX - Linux ~ Hina)
Processor(s):     Genuine Intel(R) CPU @ 2.60GHz
Logical Cores:    32
Physical Memory:  203,221,774,336 (  189 GB )
CPU Frequency:    2,600,380,032 Hz  (frequency may be inaccurate)

Result File: Validation - Pi - 25,000,000,000.txt

Benchmark Successful. The digits appear to be OK.

And I started this up:

I don't know if I will actually let it run though:

Code:

Current Settings: (select option # to change setting)

  1     Constant:    Pi
  2     Algorithm:   Chudnovsky Formula

  3     Decimal Digits:        13,300,000,000,000
  4     Hexadecimal Digits:    11,045,410,915,501

  5     Multi-Threading:   32 threads
  6     Write Digits To:   /data/pi
  7     Compress Output:   Yes - Compress digits and split them into multiple
                           files with  100,000,000,000  digits per file.

  8     Computation Mode:  Swap Mode

  9     View Swap Configuration
 10     Change Swap Configuration
 11     Run I/O Benchmark

 12     Min I/O Size:      32.0 MB  per smallest unit. ( 32.0 MB global )

 13     Memory Needed:      179 GB  ( Minimum =  156 MB )
        Disk Needed:       70.5 TB  +  10.1 TB for output

  0     Start Computation!

option: 0



Constant :  Pi
Algorithm:  Chudnovsky Formula

Decimal Digits    :   13,300,000,000,000
Hexadecimal Digits:   11,045,410,915,501

Threads:    32
Mode   :    Swap Mode

Start Time: Wed Mar 12 21:30:19 2014

Reserving Working Memory...          179 GB
Constructing Twiddle Tables...      82.2 MB
Allocating I/O Buffers...           64.0 MB

Begin Computation:

Summing Series...  937,831,802,335 terms
Summing: 0%  ( 32 )  -> ( 4,454,943,452 )

Curious to see how bad performance hits when it starts having to hit the disk on linux. I don't know why CPU usage would have been a problem with O_DIRECT since DD writing even at 1 Gbyte/sec is <20% of a core when using direct I/O.

**Sandon** · 03-12-2014, 09:21 PM

Stopped and ran the I/O performance analsys thingy:

Code:

I/O Performance Analysis:

Note that this may take a while depending on your hardware configuration.

Working Memory:      179 GB
Swap-file Size:      358 GB
Min I/O Size:       32.0 MB
Computation Threads:    32

Sequential Write:          852 MB/s
Sequential Read:          1.74 GB/s
Threshold Strided Write:   489 MB/s
Threshold Strided Read:    498 MB/s

Overlapped VST-I/O Ratio: 0.5933

Notes:

  - The overall I/O speed is unable to keep up with the CPU(s).
    The I/O throughput is 1.68549x slower than the CPU throughput.
    Large computations will be significantly slowed down by disk access.
    I/O bandwidth can be increased in a number of ways:
      - Add more drives in parallel. This is the obvious way.
        Many machines have 4 or more drives just to run this program!
      - Defragment the drives.
      - Use empty drives. Empty and freshly formatted drives perform best.

  - Your threshold non-sequential I/O bandwidth is very high.
    This may cause sub-optimal algorithm selection for large computations.
    The optimal ratio between sequential/non-sequential I/O is about 3 to 1.
    It is recommended to decrease the "Min I/O Size" setting and re-run
    this benchmark.

  - Your write bandwidth is significantly lower than your read bandwidth.
    It is recommended to examine your storage configuration if you are
    expecting balanced read/write speeds.

Press ENTER to continue . . .

These values don't seem bad?

**NEOAethyr** · 03-16-2014, 11:38 AM

Originally Posted by poke349

You can get the older versions here: http://www.numberworld.org/y-cruncher/versions.html

That should not happen. And I haven't received any other reports of this issue. Do you have a screenshot of it or something? It's hard to say what's wrong since I've never seen it before.

That's interesting. How are you running it? Double-click? Command line?

No offense taken.

It's not uncommon for stress-tests to be "too much" for a computer. (especially laptops)

Sorry I kinda forgot about ya, I didn't realize I still had a screenshot on my drive for ya but never posted it.

Sorry it's not much..

Edit:
If I were eventually able to collect up enough screenshots of my 4930k failing on the edge of stability, usually around 4.5hrs..., would you beable to make it so it runs the same "old" test repeatedly so the error can be found faster?
The newer ver's don't seem to find the error any faster then the older ver that runs perfect (I can launch the newer ones with the old launcher thing).

Ibt and linx don't detect the error at all, tried 8hrs worth and nothing.

Anyways I got one screenshot, I've had it fail twice so far I think it was the same test, the 1st time at 4.5hrs.

**Movieman** · 03-16-2014, 11:51 AM

Yea, after all this time I still own all the records from 25 mil to 5 billion!

Figured by now someone would have stepped in and booted me out!

**poke349** · 03-17-2014, 05:32 PM

Originally Posted by Sandon

Stopped and ran the I/O performance analsys thingy:

Code:

I/O Performance Analysis:

Note that this may take a while depending on your hardware configuration.

Working Memory:      179 GB
Swap-file Size:      358 GB
Min I/O Size:       32.0 MB
Computation Threads:    32

Sequential Write:          852 MB/s
Sequential Read:          1.74 GB/s
Threshold Strided Write:   489 MB/s
Threshold Strided Read:    498 MB/s

Overlapped VST-I/O Ratio: 0.5933

Notes:

  - The overall I/O speed is unable to keep up with the CPU(s).
    The I/O throughput is 1.68549x slower than the CPU throughput.
    Large computations will be significantly slowed down by disk access.
    I/O bandwidth can be increased in a number of ways:
      - Add more drives in parallel. This is the obvious way.
        Many machines have 4 or more drives just to run this program!
      - Defragment the drives.
      - Use empty drives. Empty and freshly formatted drives perform best.

  - Your threshold non-sequential I/O bandwidth is very high.
    This may cause sub-optimal algorithm selection for large computations.
    The optimal ratio between sequential/non-sequential I/O is about 3 to 1.
    It is recommended to decrease the "Min I/O Size" setting and re-run
    this benchmark.

  - Your write bandwidth is significantly lower than your read bandwidth.
    It is recommended to examine your storage configuration if you are
    expecting balanced read/write speeds.

Press ENTER to continue . . .

These values don't seem bad?

An answered most of this in the email reply. But yes, it's an amazing system.

Originally Posted by NEOAethyr

Sorry I kinda forgot about ya, I didn't realize I still had a screenshot on my drive for ya but never posted it.

Sorry it's not much..

In the first case with the illegal instruction, it appears that you don't have proper operating system support to use AVX instructions. But y-cruncher is mistakenly detecting that it does.
The proper behavior of the program is to give you a red warning that your OS doesn't support AVX, then fall back to the SSE4.1 version.

In v0.6.1 - v0.6.4, the AVX binaries use the Microsoft compiler which does no run-time checking for instruction set compatibility. Since my own check is clearly buggy, it proceeded to crash on an AVX instruction.
In v0.5.4 - v0.5.5, the AVX binaries use the Intel Compiler. The Intel compiler does its own compatibility checks and it detects that you don't have proper operating system support. So it refuses to run the AVX binary.

Question: What OS are you running anyway? And service pack? I'd like to know so I can fix the AVX detection.

Edit:
If I were eventually able to collect up enough screenshots of my 4930k failing on the edge of stability, usually around 4.5hrs..., would you beable to make it so it runs the same "old" test repeatedly so the error can be found faster?
The newer ver's don't seem to find the error any faster then the older ver that runs perfect (I can launch the newer ones with the old launcher thing).

Ibt and linx don't detect the error at all, tried 8hrs worth and nothing.

Anyways I got one screenshot, I've had it fail twice so far I think it was the same test, the 1st time at 4.5hrs.

I don't develop older versions of y-cruncher. For that matter, I don't even fix bugs in the latest version unless they are serious. (since I usually have even newer builds*)
So if you plan on sticking with v0.5.5, what you have is it. In v0.6.x, the component stress-tester is fully customizable.

*Hint: My latest developer build has a fully working AVX2 binary...

Originally Posted by Movieman

Yea, after all this time I still own all the records from 25 mil to 5 billion!

Figured by now someone would have stepped in and booted me out!

Shigeru Kondo sent me some benchmarks a while back. I just haven't updated the charts yet. So I don't remember if they were faster than yours though.

**NEOAethyr** · 03-17-2014, 06:45 PM

My os is win7 x64 sp1.
It's just tweaked to heck.

**Movieman** · 03-17-2014, 06:56 PM

Originally Posted by poke349

An answered most of this in the email reply. But yes, it's an amazing system.

In the first case with the illegal instruction, it appears that you don't have proper operating system support to use AVX instructions. But y-cruncher is mistakenly detecting that it does.
The proper behavior of the program is to give you a red warning that your OS doesn't support AVX, then fall back to the SSE4.1 version.

In v0.6.1 - v0.6.4, the AVX binaries use the Microsoft compiler which does no run-time checking for instruction set compatibility. Since my own check is clearly buggy, it proceeded to crash on an AVX instruction.
In v0.5.4 - v0.5.5, the AVX binaries use the Intel Compiler. The Intel compiler does its own compatibility checks and it detects that you don't have proper operating system support. So it refuses to run the AVX binary.

Question: What OS are you running anyway? And service pack? I'd like to know so I can fix the AVX detection.

I don't develop older versions of y-cruncher. For that matter, I don't even fix bugs in the latest version unless they are serious. (since I usually have even newer builds*)
So if you plan on sticking with v0.5.5, what you have is it. In v0.6.x, the component stress-tester is fully customizable.

*Hint: My latest developer build has a fully working AVX2 binary...

Shigeru Kondo sent me some benchmarks a while back. I just haven't updated the charts yet. So I don't remember if they were faster than yours though.

Well I might have some backups laying around here somewhere!

**poke349** · 03-17-2014, 08:10 PM

Originally Posted by NEOAethyr

My os is win7 x64 sp1.
It's just tweaked to heck.

Win7 SP1 supports AVX. So my program is properly detecting it.
But according to this: http://superuser.com/questions/24421...on-my-computer
It looks like AVX can be enabled and disabled in the OS. I haven't tried it, but it's possible your AVX somehow got disabled. (not sure why anyone/anything would want to do that)

Either way, it seems that having a capable OS isn't sufficient. I also need to check that it is enabled.

Originally Posted by Movieman

Well I might have some backups laying around here somewhere!

I'll try to update the charts later this week so you can see what kind of competition you have. Although at the moment Shigeru's having some HD troubles...

**Movieman** · 03-17-2014, 08:12 PM

Originally Posted by poke349

Win7 SP1 supports AVX. So my program is properly detecting it.
But according to this: http://superuser.com/questions/24421...on-my-computer
It looks like AVX can be enabled and disabled in the OS. I haven't tried it, but it's possible your AVX somehow got disabled. (not sure why anyone/anything would want to do that)

Either way, it seems that having a capable OS isn't sufficient. I also need to check that it is enabled.

I'll try to update the charts later this week so you can see what kind of competition you have. Although at the moment Shigeru's having some HD troubles...

Will the app support the new 15 core IB xeons?

**poke349** · 03-17-2014, 08:44 PM

Originally Posted by Movieman

Will the app support the new 15 core IB xeons?

No problem.

The app will allow up to 256 threads. But that's an arbitrary limit that I can increase at any time.

**NEOAethyr** · 03-17-2014, 11:29 PM

I didn't even know you could disable avx...
Though I notice aida64 is saying it's disabled as well, but honestly I think it's just a screw up, It's probably working anyways.
There's alot of odd things here and there that don't work on my systems, past and present.
Cpu load, all sorts of perf counters and so on.
Anyways I'm gonna try this bcdedit mod to see if I can force avx to enable or whatever.

Oh and the reason I asked for support on an older ver, what I meant was that the tests in the new ver's don't detect errors on the cpu any faster then the old ones.
I'm not even sure the new ver's even detect the errors at all.

It's not that I can't get my cpu stable, it's just that when it's so close to the edge of stability, finding a program that gives off an error is pretty hard.
When I know for a fact it's not 100%, yet I can't find any apps that tell me that except for an older ver of y-cruncher, and it taking 1.5 - 4.5hrs to tell me so, well, that's no fun

.
That's why when I was saying if I could pool together enough screenshots of the error in the older ver with the exact tests it fails on, if it were possible for those tests to be re-included in the new ver's as a custom test.
Because even the older ver's, I can't pick those tests to run outright, it just doesn't do that it seems.

But then again if you don't wanna mess with it, I'll just keep using the older ver then and wait it out for so many hours to tell me about a cpu vcore error lol.
The prog is great for finding mem errors rather quickly, but finding cpu errors is not so great.
But then again at least it can find them over a long period of time, linx and ibt couldn't find the errors at all, tested both of those overnight a little while back.
They're great for finding cpu errors quickly when the cpu is far from being stable, but when it's only 0.005v -/+ or so off then it's nearly impossible.

Anyways all those errors from that screenshot I posted from the diff major ver's, all stem from my os being a bit to tweaked out.
I need to go back and re-check them all, like for ex. why vc-2008 won't install after tweaking (new ver's work fine...), all sorts of things...
The reason I posted is because the prog should run regardless, I mean, the older ver does lol

.

Anyways I'm off to play with bcdedit.
Though I doubt aida64 will beable to tell if it's working either way.

Update:
Ok well, forcing avx to enable doesn't work.
Avx is just isn't working on my setup lol, got the os, cpu and etc.
I apparently gutted avx along with float-16 and so on without realizing it, I wouldn't of known with my older amd cpu at the time.
I didn't think there was another say 80% perf boost just waiting for me lol.
I got a fresh os to the side I haven't finished setting up, no tweaks.
I planned on getting around to fixing up my tweaks for x64 win7 but just haven't gotten around to it other then installing windows and calling it quits for the time being, until now anyways.
Sigh I don't even wanna use windows lol, but I don't have the free space right now for linux...

I should try re-stressing my cpu with avx enabled a little later on, I thought I was using it but apparently not...
Linx went from 90 gflops to 160 gflops so apparently not lol.

**xman01** · 03-20-2014, 03:48 PM

re-ran with new hardware

**Sandon** · 03-28-2014, 11:01 PM

Performance test is looking better with a new raid controlller.. hopefully should speed up my 13.3 trillion calculation by quite a bit I hope:

Code:

Sequential Write:         1.59 GB/s
Sequential Read:          1.77 GB/s
Threshold Strided Write:   864 MB/s
Threshold Strided Read:    881 MB/s

Overlapped VST-I/O Ratio: 0.779955

Notes:

  - The overall I/O speed is unable to keep up with the CPU(s).
    The I/O throughput is 1.28213x slower than the CPU throughput.
    Large computations will be significantly slowed down by disk access.
    I/O bandwidth can be increased in a number of ways:
      - Add more drives in parallel. This is the obvious way.
        Many machines have 4 or more drives just to run this program!
      - Defragment the drives.
      - Use empty drives. Empty and freshly formatted drives perform best.

  - Your threshold non-sequential I/O bandwidth is very high.
    This may cause sub-optimal algorithm selection for large computations.
    The optimal ratio between sequential/non-sequential I/O is about 3 to 1.
    It is recommended to decrease the "Min I/O Size" setting and re-run
    this benchmark.

I lol'd a bit during the sequential read test for a while it was > 2 GB/sec and I saw what I think is an easter egg =)

Code:

Sequential Read:          2.01 GB/s  WTF?!?!

The WTF?!?! was in blue.

Thread: New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

Thread Tools

Search Thread

Rate This Thread

Display

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions