Ok, so how many records did I break? :sofa:
Ok, so how many records did I break? :sofa:
First, fix the screenie for your 32M run. :p:
I'd like to see what the 32M SuperPi record of 6 and half minutes on LN2 turns into... :rofl::rofl::rofl:
Otherwise every single run except for the 1M run is a new record. :ROTF:
On that note:
For the SuperPi-sized records on the download page, you should be able to clean-sweep everything from 2M all the way to 1G with 6GB of ram.
The 1M time of .405 seconds is hard to beat without OCing because it's too small of a computation to get much benefit from multi-threading.
This better?:D
http://img505.imageshack.us/img505/2135/ycrunch32mu.jpg
What? No updates yet? It's been 5 mins! :stick:
http://www.numberworld.org/y-cruncher/#Benchmarks
I'll do it tonight. I don't have access to my webserver right now.:rolleyes:
Interesting, your 32M time (9.45s) is slower than the W5580s (9.30s)... even though your memory is probably faster. Seems like that person did some serious tweaking.
You might have to do the same to beat those numbers. But for something larger like 256M, 512M, or 1G, your clock and memory speed advantage should beat any tweak.
For something as small as 32M, there's a lot of thread-creation/destruction overhead. So you might want disable HT or use the Custom Compute mode to override the thread settings to use fewer threads. At these sizes, the program probably spends a significant amount of time creating and destroying threads... bleh... Then I again, I never optimized the program for small computations.
Another possible reason is that since your memory is faster, your timings are more relaxed. I found that Nehalems have more memory bandwidth than the program needs. So tighter timings and slower memory might be better.
@ El Greco
I think we have a winner here! The first person to show up with a non-power-of-two cores! :D:D:D
You didn't unlock the 4th right?
Here's a 25m and 50m
25m on the left, 50m on the right.
http://img8.imageshack.us/img8/3496/...00mhzi7.th.jpg
100m (no cpu-z in this one)
http://img525.imageshack.us/img525/1230/100m.th.jpg
100m is weird. I've run it about 5 times so far, and, i've had about 4 failures and 1 pass. apparently i'm just on the edge for that test as far as being stable.
1m then 32m
http://img299.imageshack.us/img299/8...m4200i7.th.jpg
edit2: and yes, dave is defeated in 1m @ least :p:
take that old man
Updated the list (on the first post of this thread) with some of the SuperPi-sized benchmarks.
I also "think" I've fixed "Sanity Check Error" for the smaller SuperPi-sized benchmarks.
Here's a little something interesting...
I ran some automated benchmarks on my Lan-Box to see how the new version scales with multiple cores.
Here are the results: Main Page
http://www.numberworld.org/y-crunche...raph_small.jpg
The graph shows how many times faster a multi-threaded run is than a single-threaded run.
Obviously, 2 threads cannot do any better than 2x improvement and 4 threads cannot do better than 4x.
When the computation is small, the amount of time spawning threads dominates actual computation time. Therefore it scales very poorly for small computations. The bigger you go, the better it scales.
Notice that below ~5 million digits, Hyper-Threading with 8 threads is slower than 4 threads without HT. This is because the benefits of HT is outweighed by the overhead of spawning double the threads.
So if you have an i7 and you want to get the fastest 1M, 2M, or 4M times, try disabling HT...
This wasn't the case in the older versions. Because of a number of optimizations and bug-fixes, v0.4.1 doesn't scale as well as the older versions for small computations. (But it scales slightly better for large computations.)
interesting results with SMT on i7. i think its hilarious your program can calculate 10 trillion digits of pi. is it stable after using more than 46 gigs?
That's 10 billion. Trillion has another set of zeros... (I lose count around that point too... :D)
Yes, the program is 64-bit. So it has no trouble using as much memory as it wants. As for the computer, it isn't even OC'ed so yes, its perfectly stable with 64GB no problems.
I know of one person who benched this program with 128GB of ram... (namely dual Xeon X5470 with 16 x 8GB DDR2 FB-DIMM)
The program will do a LOT more than just 10 billion.
The current version allows up to 200 billion digits. Though it's only been tested up to 31 billion.
As for what the "true" limit of the program is... I actually have no idea...
I don't see any major "wall" in the implementation until at least 10^16 digits...
But no computer in the world will have the ram or the computational power to test it - not even Road Runner.
My Phenom II wants on the list right ahead of those two Q6600s@3.2Ghz :).Quote:
Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3)
Processor(s): AMD Phenom(tm) II X4 940 Processor
CPU Frequency: 3210978905
Thread(s): 2^2
Digits: 25000000
Total Time: 15.455
Checksum: 006b916ff8c0d1b40b57fc77474331ca
http://i145.photobucket.com/albums/r...y-Cruncher.jpg
My laptop is utterly WEAKSAUCE! I need a Phenom II X4 laptop (If you ask why, you're not xtreme).
For this run I had to clock my memory to 800mhz as I'm running 4-Dimms of mismatched RAM. One set of Corsair Dominators and one set of OCZ Reapers. I haven't figured out timings that allow all four to work at 1066mhz yet. I suppose it's still not a bad score.
http://i145.photobucket.com/albums/r...uncher1Bil.jpg
Yeah, I'm surprised at how evenly Phenom II and Core 2 Quad are matched. The program was written and tuned on two machines: Pentium D and Harpertown (the only two I had at the time). So I'd expect Intels to run faster. But obviously that isn't really the case.
C2Q is still a tiny bit faster though, but not by much. (Your run is a bit faster because of the newer version.)
Mine isn't much better... 1.6 GHz Core Duo...:D I think the 25m time is like 120s or so... I'm waiting for some cheap quad-cores...
Mismatched? What were you running before?
EDIT: Nevermind, I see it in your siggy. I suppose you could try some really conservative timings to see what happens...
And no, it isn't a bad score... The fact that you have enough ram to do 1b automatically makes it a good score... :D since full ram configurations are harder to OC.
EDIT: There isn't really such thing as a good or bad score... since the range of hardware on that list is massive... (From Atom to Gainestown...) So it's only fair to compare with hardware similar to yours.
That new version is alot faster, same setup as before.
http://i131.photobucket.com/albums/p...mapImage12.jpg
I was thinking the exact opposite. Core 2 has better SSE throughput than K10, but Core 2 is limited by memory bandwidth.
Now for a quick disclaimer on yet-another "sensitive" issue of Intel vs. AMD:
Before anybody yells at me for drawing a conclusion that Intel has faster arithmetic than AMD, this is merely my guesstimate based on the benchmarks. In no way does it indicate that Intel or AMD is better.
Since the vast majority of the program was written and tuned on Pentium and Harpertown (which is a Core 2), I'd expect there so be some favoring towards Intel.
As for the memory bandwidth issue, I've noticed that the program scales pretty poorly on Core 2 Quads... But, the only ones I've played with is Q6600 and Q9400 - both of which have significantly smaller cache than Harpertown.
If we to want throw out the bandwidth factor to determine which (Core 2 or K10) has better arithmetic throughput for this program, we'll need to do a single-threaded benchmark comparison between a Core 2 and a K10 at the same frequency.
My guess is that Core 2 will win (simply because I tuned for it), but I unfortunately don't have access to any K10s to try it.
Anyone have both and care enough to try that? :)
Don't worry about saying one performs better than the other. The truth is both AMD and Intel uarchs have their strengths and weaknesses. If they were both good and bad at the same tasks and performed exactly the same it would make for a pretty boring conversation. Differences are good for this reason.
@Poke and Chumbucket: You two come to different conclusions as to where the performance and bottlenecks are on each platform. I suggest that you may both be correct. AMD has a great HyperTransport platform to work with, and needs to be pused at full load to shine. Core 2 gets choked up under load because of Chumbuckets' explanation and FSB bandwidth anemia.
PS: Dropping my memory from 1066 to 800 only increased my 25M score by less than 0.3 seconds. Not as bad of a hit as I was expecting.
EDIT: Single Threaded Phenom II 25M Test
http://i145.photobucket.com/albums/r...leThreaded.jpg
But then... I'm comparing Core 2 and K10. I've completely left Core i7 out of the equation.
Anyways...
Anyone got a Core 2 @ 3.2 GHz? My workstation is, but it's down and it's not coming back online for a few more weeks.
.3 seconds ~ 2%. That isn't much. Going from triple to dual channel @ 1600 MHz on my friend's i7 only made about 2% difference @ 100m... (the larger you go, the bigger the impact of memory bandwidth)
I know on dual-Harpertown, the difference between 667MHz and 800 MHz is huge - like 10% @ 500m... and it gets bigger and bigger as you scale up the size.
My rig,
Dual Xeon X5482 @ 3.2 GHz + 64 GB (16 x 4GB) @ 800 MHz
beats,
Dual Xeon X5470 @ 3.33 GHz + 128 GB (16 x 8GB) @ 667 MHz
by around 5 - 10% at 250m - 1b.
These aren't your normal desktops, so ignore the sheer quantity of ram.:rofl: It's the speed that matters. :yepp:
There's probably a threshold somewhere where extra bandwidth isn't going to help much. Though I haven't bothered to try to find it.
P.S.
Toradora = WIN!!!!! (sorry for looking at your desktop icons :ROTF:)
Yes, I like me some manga and anime :yepp:. I have no prob with anybody looking at my icons. If I did, I would have edited it out like the rest of you pr0n addicts. Ok jk, I really just wanted everybody to see how leet I am for having old games like Wing Commander Prophecy, Freespace 2, and Lock On which I still play with my Suncom F-15 E Talon+SFS Throttle HOTAS.
Also, I'm JohnnyNismo from www.Houston240sx.com if anyone cares. My drift b***h is once again my daily driver so no fun for me anymore.
here a core 2 quad at stock. One of the reasons for poor scaling is core 2's cache. it has no shared l3 so the cores only get 2mb l2 each.
http://i25.tinypic.com/2rp5oj9.jpg
Those were done using processor affinity right? Because there's no 3-thread mode, and 2-thread mode will use up to 4 threads.
Also I think a big reason is that the program was tuned with 3MB of cache per thread. So massive cache spilling on a bandwidth-limited system will have major penalties.
Here's Q9400 scaling... Also very bad. I also don't know what was causing all of the variation in the benchmarks.
http://www.numberworld.org/y-crunche...raph_small.jpg
Main Page
Here's some very old results with version 0.2.1 on my workstation:
Scaling seems to hit a wall at 7x. I'm almost certain it's the memory bandwidth.
http://www.numberworld.org/y-crunche...raph_small.jpg
Main Page
I need to integrate a bulk-bench option that will generate the data for these graphs without having to do each benchmark by hand.
I already have an automated benchmark add-on (hence how I did all these runs), but it has no interface yet - all options are set in the source code and I have to recompile it everytime I change a setting.
i turned off the cores in msconfig so no background tasks would be placed on other cores. i noticed there was no 3 thread mode b/c when i had 3 threads it said i had 4. i will have a speed up graph soon but i am a newb at openoffice.
Yes, the program will round up to the next power of two if it isn't.
The reason for limiting it to powers of two, is for ease of implementation and efficiency of code.
There's a crap-load of binary divide-and-conquering in virtually all the algorithms that are used. Stuff like that just don't work well with non-power of two threads...
I've also found that the penalty of running extra threads is relatively small. Assuming that most computers now (and future) will have either a power-of-two # of cores or a "clean multiple" of one, this restriction is worth the ease of implementation.
Heres a Yorkfield at 3.2 even though it says 3.8 in the program, I guess it calculates mhz using only the stock multi.
http://i131.photobucket.com/albums/p...apImage12m.jpg
Could you do that run single-threaded? We were trying to compare the two in single-threaded mode (no bandwidth bottleneck) to see which (Core 2 or K10) has faster arithmetic for this program.
And I like how you set the FSB and mult to match my workstation. :D
It seems like on Core 2 it uses the stock multiplier. On i7, it uses the actual maximum multiplier (as set in BIOS) before Turbo Boost - but it never uses more than the stock multiplier.
Anyhow... I've some insane results coming in from someone in Japan with a very well tuned Dual Xeon W5580 rig with 72GB of ram... Benchmark sizes going all the way up to 32G with the help of Swap Mode...
I'll post those later. But they all lack verification checksums.
here you go, affinity set to 1 core
http://i131.photobucket.com/albums/p...pImage15pi.jpg
Awesome :D
So the summary for Core 2 vs. Phenom II. (for y-cruncher)
Single-threaded (arithmetic speed test):
Phenom II @ 3.2 GHz - 55.1057
Core 2 Quad (12MB cache) @ 3.2 GHz - 49.5219
Multi-threaded (arithmetic + bandwidth test):
Phenom II @ 3.2 GHz - 15.455
Core 2 Quad (12MB cache) @ 3.2 GHz - 13.9467
If we did these runs on Q6600 @ 3.2 GHz, that'll also settle issue of cache size. :D
The two Q6600s that are already on the list are from v0.3.2.
I' running a P4 3,15Ghz
25M -->152.961s
http://img34.imageshack.us/img34/4516/25mb.th.jpg
50M -->344.768s
http://img199.imageshack.us/img199/7016/50mb.th.jpg
100M -->780.277s
http://img145.imageshack.us/img145/9734/100mb.th.jpg
:up:
I can do a 3.2GHz run on my i7 when I get home. It is 10:33am CST now, I should be able to get it run by 5:30pm.
Poke349: I finally got a new waterblock, the Heatkiller 3.0 CU. I can't believe the thing, 4.4GHz is 100% stable (Linx w/ 8 threads for 24 hours). That block with regular water is better than my old Apogee with ice water, no kidding. 65C full load at 4.2GHz, 1.3v. 75C full load at 4.4GHz, 1.38v. I'll give ice water a shot at some point, I really want to get a 4.6GHz run done. It would be nice if you could include some batch benchmarking.
For example you could set 3 loops then specify a range from X to Y. This way I could run 3 loops of each & save the fastest time and test times of 1m, 2m, 4m, 8m, 16m, etc digits as well as the 25, 50, 100, etc. Also outputting the fastest result to a file would be nice as not to need to copy/paste so much text. What do you think?
I can do them, but my rig is tied up for a few more days. Looks like spdy beat me to it. :D
I completely agree with you. I just need to find the time to polish up my bulk compute add-on and release it.
3 runs of each - Good idea. I'll probably set that as a default with an option to override it. And I'll add a size-limit to looped runs - say 10 min. Otherwise those massive single-threaded 10 and 12b runs on my workstation will take days. :rolleyes:
I can have it output the benchmarks to a separate text file.
Something like 3 categories:
Standard Sizes: 25m, 100m, 250m, etc... all validated - print the best times (with it's validation) into a text file.
SuperPi Sizes: 1M, 2M, 4M, etc... all validated, same as above
Multi-core Scaling: 1m, 1.2m, 1.5m, 2m, 2.5m, etc*...
- Manually select threading mode
- No validation
*These are the sizes I used to generate those fancy multi-core scaling graphs.
I'd love to see a multi-core scaling graph from a pair of Gainestowns... :D:D:D But I honestly doubt anyone will be patient enough to sit through single-threaded runs of 1b+.:shrug: For me, I just let it run while I'm at work, run overnight... :rolleyes:
I also need a way to enforce processor affinity. I can't manually force it because I wouldn't know which cores are real and which are virtual from HT.
As for that... Time for some insaneness....
Results from Japan: http://ja0hxv.calico.jp/pai/pietc.html
Google translate it if you can't read Japanese. (I can't either...)
2 x Intel Xeon W5580 Gainestown @ 3.2 GHz
72 GB (18 x 4 GB) DDR3
Windows Server 2008
25m - 6.92
50m - 13.31
100m - 28.14
250m - 76.34
500m - 166.07
1b - 365.20
2.5b - 1,025.05
5b - 2,307.18
10b - 4,961 (1 hour, 22 min, 41 secs)
25b - 19,415 (5 hours, 23 min, 35 secs) - Done using Swap Mode*
1M - 0.37
2M - 0.67
4M - 1.21
8M - 2.31
16M - 4.47
32M - 8.75
64M - 18.02
128M - 38.18
256M - 82.63
512M - 185.41
1G - 398.09
2G - 868.54
4G - 1,928.29
8G - 4,235 (1 hour, 10 min, 35 secs)
16G - 11,892 (3 hours, 18 min, 12 secs) - Done using Swap Mode*
32G - 31,061 (8 hours, 37 min, 41 secs) - Done using Swap Mode*
One thing I have to say... This guy is NUTs...
He gets new workstations like this about once every half a year.
The last few he had are:
2 x Intel Xeon X5470
128 GB (16 x 8 GB) DDR2 FB-DIMM
2 x Intel Xeon X5460
64 GB (16 x 4 GB) DDR2 FB-DIMM
Not only that... He ACTUALLY ran this program for 8+ hours just for a benchmark. That's a pretty good stress test... :rofl::rofl::rofl:
I've done longer runs than that (200+ hours), but that's because they were either tests, or were for size records. Not benchmarks... :shakes:
*Swap Mode requires less memory but is significantly slower.
There's no validation for it, and it's available under the Custom Compute option.
Lastly... Dave, if you're here, you've got some SERIOUS competition.
This guy knows how to tune these things... enough to make his W5580s faster than your W5590s.
i was just on google trends and japan is the #1 country to search core i7.
deleted
i7 @ 3.2GHz, 3.6GHz Uncore, Memory @ 1600 7-7-6-16.
Single:
Multi (with HT):Code:Benchmark Successful. The digits appear to be OK.
Program Version: 0.4.1 Build 7408 (x64 SSE3)
Processor(s): Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
CPU Frequency: 3,192,005,951 Hz (frequency may be inaccurate)
Thread(s): 1
Digits: 25,000,000
Total Time: 44.5555 seconds
Checksum: 506bd9db81dfe73a07ae66fb5da8af7e
Code:Benchmark Successful. The digits appear to be OK.
Program Version: 0.4.1 Build 7408 (x64 SSE3)
Processor(s): Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
CPU Frequency: 3,192,005,119 Hz (frequency may be inaccurate)
Thread(s): 8
Digits: 25,000,000
Total Time: 11.4947 seconds
Checksum: d2ec2f25569fffbd04422301296a783b
lolz... :rofl::rofl::rofl:
They always have the newest gadgets for just about everything except for maybe processors... since both Intel and AMD are US-based...
Nice... At some point, I'm gonna need to make a database on my site. But I don't have the time for all that... argh...
And you're hitting 4.6 on plain water? That's just insane... :eek: Because 5GHz is already LN2 territory. Those benches will be interesting and hard to beat. :D
Today, I got a nice look at a 96-core 16 x Dunnington machine with 512 GB ram at fair... Too bad it was too busy for me to try any benches... :(:(:(:(:(
Not bad considering I'm running SETI atm as well :D reported CPU frequency is wrong, its 3.6GHz. I'll be back with some proper results when my PC9200 turns up.
http://img.photobucket.com/albums/v1...pboard02-5.jpg
I got some new fans. :)
A pair of these:
http://www.newegg.com/Product/Produc...82E16835213009
(I didn't get them from newegg though.)
Speed controlled, they are just as quiet as my old ones with slightly more airflow. I run it at this speed normally...
At full power, I can't hear myself talk... :rofl::rofl::rofl:
Now I can safely hit 4.2GHz on air - with more room to spare. :)
This was stress test more than a benchmark. I intentionally left RealTemp and CPUz on to monitor it.
http://www.numberworld.org/y-crunche...2009_small.jpg
The temps peaked at 84C. They hit 90C when I benched 4GHz with my old fans.
Greetings, poke. I like your benchmark program--it's quite nice. Have you by chance experienced an issue where it doesn't seem to hit all cores very effectively? I've got a 12-core machine where it seems to stay in the 40-60% CPU range for smaller benchmarks (under 32M) and 60-80% for larger ones. It never actually "pegs" so to speak.
Yes, it's a fundamental issue with this type of task. Hence why it's taken while...
Pi - by it's very nature doesn't parallel as well as wprime, or any other "artificially made" task.
Why your cores aren't kept busy 100% of the time can be due to several reasons:
- Load imbalance. Most types of scientific computing like this don't split evenly (or at least it's not easy to do so). So some threads will finish before others. When this happens, the threads that are done need to wait for the others.
- Not every part of the computation is paralleled. Fast operations like additions and subtractions are limited by memory bandwidth so they will not benefit from multi-threading.
- Thread creation and destruction have a lot of overhead. When the working size for a particular operation is small enough, the overhead of thread creation becomes greater than the benefit of threading. At this point, the program doesn't parallel it - hence less than 100% cpu.
- Refresh rate of Task Manager. Task manager and other monitors average cpu usage over a period of time. If the computation is small, there won't any period of sustained 100% cpu long enough to average a 100%.
The larger the computation, the smaller the effect of these inefficiencies, and the higher the cpu usage.
With 12 cores, you're probably gonna need to go above 1 billion digits to get cpu usage averaging > 90%.
You WILL need to go up to several billion digits to achieve sustained 100% cpu that can last a few minutes. Most people don't have that kind of ram so it isn't suitable as a stress-test unless you run multiple instances.
CPU usage can be improved if I allow multi-threading to increase memory usage... But that gets prohibitive after a while. This type of computing is already enough of a memory hog as it is. So I prefer the ability to hit larger sizes.
So in some sense, computing Pi is a benchmark that "more closely" resembles real-life scientific computing.
EDIT:
So yes. Have fun with it. :) Tell all those Pi fanatics... they'll need to move away from those C2D's to stay competitive. :rofl::rofl::rofl: jk
Are these guys with 16-thread Xeon systems not hitting 100% CPU usage either?
Is it possible to specify a manual thread count? Inspired by HT people, I'd like to try a run at 24 threads to see if things stay busier.
I definitely wouldn't expect them to... But CPU usage as a whole can be fairly deceiving. Even though it "doesn't look" to be efficient, you're still getting massive speed up.
Here's the 8-core @ 1b screenie on my website:
I consider this a "good" graph. Mostly @ 100% but with dips every 10 - 20 seconds...
http://www.numberworld.org/y-crunche.../cpu_usage.jpg
In most cases is won't be as efficient as this. :(
The only time where I've gotten near sustained 100% cpu is during one of the world size-records I set back in April:
Same computer: 8 cores @ 31 billion digits of a different constant
(click to enlarge)
http://www.numberworld.org/nagisa_ru...2009_small.jpg
This kind of efficiency... is only achievable if you have either a REALLY SLOW computer :rolleyes:, or if you have a completely stupid amount of ram...:rofl:
Good thinking :clap::
The program actually already does that.:D When you run N threads, it will usually run 2N and occasionally 4N threads.
In any case, if you have a non-power of 2 cores, it rounds up. So on your rig, the program is running in 16-core mode which uses anywhere from 16 - 64 threads. (There's an option in task manager that shows how many threads a process is using.)
You can manually set your settings in the "Custom Compute a Constant" option. But there's no validation.
Just remember that higher % cpu usage doesn't always mean faster time.
Oh wow...I see you're creating and destroying threads during the computation cycle itself. In my own programming I've found it to be a good idea to create x number of threads and then use them all to process pieces of work dispatched from a synchronous controller. That may or may not be practical or applicable to your particular algorithm of course--I won't pretend to be familiar with your project. :) In any case, that does make sense now. I saw as few as 8 threads and as many as 60-some.
I've thought about using thread pools, but I decided against it for a few reasons:
- I couldn't figure out how to use that API. :ROTF:
- The program was written for extremely large computations (for breaking size-records). And on large computations, threading overhead is negligible.
- My intuition told me that a synchronous work-dispatcher might have problems scaling into "many" cores... by many, I mean tens or hundreds...
(Specifically, it would take linear time to dispatch N-loads of work for N cores, whereas recursive thread-creation would take only log(N) provided that the memory allocator was efficient.)- Lastly, I was just plain lazy... :rofl:
Only if there was Linux binaries...
If I had more experience with Linux, then there would be binary for it... :(
"Eventually", I'll have Linux binaries... whenever I get the time... :rolleyes:
The entire program has been written to be easily ported to Linux... So I don't expect it to be too hard to do so when the time comes.
Have you tried running it under Wine?
if you compiled for linux and the cell i could run this thing on my ps3.:D actually the cell has been beaten in flops by x86 by now though and its a PITA to work with.
I can't "just" compile it. :(
I would need to learn how to use the linux threading libraries first.
To compile for Linux without any code modification, I'd have to disable multi-threading... :mad: which pretty much defeats the purpose of the program.
Judging by the examples I found on how the linux "pthread" library is used, it should be a simple drop-in replacement for the Windows threading library...
But I don't have a machine with linux to try it, nor do I have the time.
As for cell processors... It's a different type of processor so much of the program would have to be rewritten and re-optimized to be efficient.
Have you looked at the boost library?
Thanks for the input.:cool: No, I haven't heard about it.
Seems interesting and it appears to be a "semi-standard" that VS supports...
I didn't know VS actually supported a thread-library other than itself. :eek:
If VS supported pthreads, I would've use pthreads from the start since it's more portable... but no... it just HAS to force you to use the windows one... :down::down::down:
The thing I don't like about Boost right now is that it's C++. The source code for the program is 99% C - and I kinda want to keep it that way.
It also doesn't look it can be drop in replacement for WinAPI because of all that object management stuff... :mad:
Wheras, WinAPI and pthreads have almost the same usage format.
Also, v0.4.2 should be out in a week or so...:D I added a batch mode as requested by a number of people. Right now, I'm still testing it...
The only thing is that the batch benchmarks aren't validated. It'd be kind of a mess if I generated a checksum for every single benchmark.
I might add validation for them in the future. But as of right now, I'll leave it out.
Got a test cpu, mobo & ram to goof around with. When I get bored with it I'll throw in a Q9550 or similar and give it to the wife. :)
Pentium 4 661 in Windows XP 64 bit running the SSE3 x64 executable @ 4.7GHz with and without HT:
25m no HT
25m+HTCode:Benchmark Successful. The digits appear to be OK.
Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3)
Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz
CPU Frequency: 4,717,549,367 Hz (frequency may be inaccurate)
Thread(s): 1
Digits: 25,000,000
Total Time: 73.1281 seconds
Checksum: 65cb672d996bc3240db0fda7909acc3b
50m no HTCode:Benchmark Successful. The digits appear to be OK.
Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3)
Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz
CPU Frequency: 4,717,524,888 Hz (frequency may be inaccurate)
Thread(s): 2
Digits: 25,000,000
Total Time: 65.6558 seconds
Checksum: 085dd47af97911d30c9fa6d03a9ce1e0
50m+HTCode:Benchmark Successful. The digits appear to be OK.
Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3)
Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz
CPU Frequency: 4,717,561,175 Hz (frequency may be inaccurate)
Thread(s): 1
Digits: 50,000,000
Total Time: 168.288 seconds
Checksum: cd31559715e07e3619136cc16cd2cf2b
100mCode:Benchmark Successful. The digits appear to be OK.
Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3)
Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz
CPU Frequency: 4,717,547,001 Hz (frequency may be inaccurate)
Thread(s): 2
Digits: 50,000,000
Total Time: 153.715 seconds
Checksum: 7b0843fca01a0b691b42d7d33fe11537
100m+HTCode:Benchmark Successful. The digits appear to be OK.
Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3)
Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz
CPU Frequency: 4,717,545,210 Hz (frequency may be inaccurate)
Thread(s): 1
Digits: 100,000,000
Total Time: 377.512 seconds
Checksum: ae6656d43ce984db08c62a84afc1ec02
Yep, even at 4.7GHz a P4 is still garbage. :yepp:Code:Benchmark Successful. The digits appear to be OK.
Program Version: 0.4.1 Build 7412 (fix 1) (x64 SSE3)
Processor(s): Intel(R) Pentium(R) 4 CPU 3.60GHz
CPU Frequency: 4,717,527,780 Hz (frequency may be inaccurate)
Thread(s): 2
Digits: 100,000,000
Total Time: 343.165 seconds
Checksum: 1f7061c8731d2da31a774b0d16de8145
Ooh...
64-bit P4... :clap:
Wouldn't shoving a C2Q into a an old 775 require more than a BIOS update?
I thought about putting a Q6600 into my Pentium D machine... but then I checked that the chipset doesn't support C2Q even with a bios update. :(
Here's a screenie of v0.4.2:
It should be out in a couple days. It isn't quite ready for release right now - there's (yet another) a bug in the VS compiler that I'm wrestling with... :down:
http://www.numberworld.org/y-crunche...0.4.2_peak.jpg
The text file can be copy and pasted into excel. :)
I've found 3 bugs in the VS compiler in the last 8 months... :yepp: 2 of them related to high memory usage in x64... I think it's safe to say that not too many people use VS for high-memory programming.
I was going to built her a quad setup anyway so I got the DFI DK P35-T2RS which supports the duo/quad chips. I saw a cheapo P4 661 so I figured I'd get it and test how far it'll go and how it compared with my old Opteron @ 3GHz. So far even at 4.7GHz and with some low timings on some Crucial BallistiX Tracer memory it isn't looking good vs the Opty. :D
Good news about the release of v0.4.2, I'm glad to see the batch mode added. :) Have you tried testing the Intel compiler? I had some decent luck with it even on AMD machines. The profile guided optimization can really give some programs a boost. I have a feeling most of your cpu time is spent on hand optimized assembly so PGO likely wouldn't be of any benefit but I figure I'd mention it just in case you haven't messed with it yet.Quote:
Originally Posted by poke349
Actually... There's is no assembly whatsoever in the entire program. :p:
As an undergraduate, my programming knowledge is too narrow to write good assembly. So that right there is a major potential for improvement.
Although I haven't run a profiler on it yet, my guess is that it spends most of it's time pipeline-stalling and waiting for memory (cache misses, or simply insufficient memory bandwidth...)
Specifically, I have a feeling that unprefeched cache misses are significant, but I haven't played with my memory timings nor have I run a profiler to determine its impact.
Version v0.4.2 is out!
Here's a screenie of that stress tester...
Temperature-wise, it's on par with Prime95.
But I don't know if it can detect stability issues as well as Prime95 - let alone Linpack.
http://www.numberworld.org/y-crunche...test_small.jpg
hey poke, there is a 16 core quad socket system in WCG section by jcool and i think it would pwn on this benchmark. its only got 4 gigs right now but it supports a junkload of memory. i wonder how much ram this would use idling in windows?
I think he already knows.
But the fact the there's no ram is gonna be severely limiting... With THAT many cores, you're gonna need a massive computation size to keep them busy. 4 GB won't be enough to keep those cores fed... But... it might be enough to break some of the current speed records anyway. :D
25,000,000 digits:
Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Intel(R) Xeon(R) CPU W3520 @ 2.67GHz
CPU Frequency: 2660015200
Thread(s): 2^3
Digits: 25000000
Total Time: 12.924
Checksum: c90910e6b1387d740fe4352132ee4855
2,500,000,000 digits:
Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Intel(R) Xeon(R) CPU W3520 @ 2.67GHz
CPU Frequency: 2660034352
Thread(s): 2^3
Digits: 2500000000
Total Time: 2404.996
Checksum: c2abaca2a09340b91922847dd0ffc278
Here is my Dual Shanghai System.
Code:Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 2900166909
Thread(s): 2^3
Digits: 25000000
Total Time: 11.406
Checksum: e195c35a015ca7c4079e7e2c4c727037
Code:Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 2900164958
Thread(s): 2^3
Digits: 50000000
Total Time: 23.549
Checksum: 8d2ea0a8ccbda4511edad184dd0c405f
Code:Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 2900166047
Thread(s): 2^3
Digits: 100000000
Total Time: 49.112
Checksum: 77fec20038094164021df20c23143c6c
Code:Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 2900172767
Thread(s): 2^3
Digits: 250000000
Total Time: 139.788
Checksum: e0befcb870941172171bfc300fa992bd
Edit: I will run more over the next couple of days. I have 8gb of ram currently, and when I get to 16gb I will run the larger onesCode:Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 2900162991
Thread(s): 2^3
Digits: 500000000
Total Time: 301.514
Checksum: cf5090c8ca14560219b83cc1de7f90e2
@ SamHughe
Nice 2.5b run. Haven't been seeing many of those. :yepp::D;)
Did you pretty much have to close everything to get it to fit into 12GB?
Or did you just let it page out?
Woah... the first AMD dualie. :cool:
Quick question: Is that running at 2.9 GHz as the program says or is it @ 3.3 GHz (in your siggy)?
Since it's the first time I've seen this program run on an AMD dualie, would you be able to do benchmarks for:
1,000,000
10,000,000
1,000,000,000
I'd love to add it to the comparison chart that's here: :D
http://www.numberworld.org/y-cruncher/#Benchmarks
The 1m and 10m aren't in the benchmark options, so you'll need to use either the batch-mode or the custom compute option.
As above here is my dual Shanghai 2389 @ 3.33ghz.
Code:Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336182273
Thread(s): 2^3
Digits: 25000000
Total Time: 9.720
Checksum: 409d8961151453331ebd8fd5b49ffab9
Code:Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336181988
Thread(s): 2^3
Digits: 50000000
Total Time: 20.084
Checksum: f18969bf2caaadc01bd3a3a05bdca6c0
Code:Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336180534
Thread(s): 2^3
Digits: 100000000
Total Time: 42.306
Checksum: 9af038030bc9e0370ee84f1c8ad8babf
Quote:
Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336181279
Thread(s): 2^3
Digits: 250000000
Total Time: 113.583
Checksum: c7a217871dbd8aac1fd1a0a86c0c6b23
Quote:
Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336180470
Thread(s): 2^3
Digits: 500000000
Total Time: 249.244
Checksum: 330ec95f5a05c4971e373ab94be5664d
Quote:
Program Version: 0.4.2 Build 7438 (x64 SSE3)
Processor(s): Quad-Core AMD Opteron(tm) Processor 2389
CPU Frequency: 3336183313
Thread(s): 2^3
Digits: 1000000000
Total Time: 552.905
Checksum: 52c37453ac4eb2ba0abec8f7913ea03b
Code:1,000,000 Digits
Writing Decimal Digits: 1,000,001 digits written
Total Computation Time: 0.617 seconds ( 0.000 hours )
Total Time (including writing digits): 0.709 seconds ( 0.000 hours )
Code:10,000,000 Digits
Writing Decimal Digits: 10,000,001 digits written
Total Computation Time: 4.288 seconds ( 0.001 hours )
Total Time (including writing digits): 4.659 seconds ( 0.001 hours )
heres mine, can do more later. this is 32m
4.51ghz, 3.87ghz Uncore. 32m 11.297s
http://i25.tinypic.com/2a5dt2p.png
Woah, nice. New single socket record! :D
Was turbo-throttling in effect during the run? Your temps are hitting 88C, but 75-80C is usually the point where turbo throttling starts.
Are you on air?
As for all the suggestions about the Intel Compiler.
I've tried it, and the results are interesting.
Initially, after messing with the compiler options, it couldn't do any better than 5% slower than the current build (with the Visual Studio Compiler).
After tweaking the source code a bit, I've gotten it to about 1% faster than Visual Studio. :D:D:D And I don't think I'm done yet. :yepp:
I'll need to finish my tweaks (and hopefully do better than 1%) and do a ton of other tests on my workstation to make sure it's consistently faster.
On the other hand... The binary (when compiled with ICC) is HUGE... :shocked:
I think I ran prime earlier thats why it said it was so hot. But my chip is already high leakage chip so it runs hot. My chip starts throttling at 100c, Oh and this is on water too. Ill try 4.6ghz later but my brand new DFI X58 UT just died, went pop!
You killed your DFI? :( Sorry to hear that... RMA?
The "main" throttling occurs at 100C, but Turbo mode gets disabled when you hit 80C or so. So the multiplier drops from 21 to 20.
It will actually bounce back and fourth between 20 and 21 many times per second. Once you get high enough (85+ C), then it pretty much stays at 20.
Core i7 920 @ 3.927GHz
SuperPi 32M - 9m31.187s
y-cruncher:
250M - 127.051s
500M - 280.013s
1 Billion - 616.635s
2.5 Billion - 4129.34s
http://i431.photobucket.com/albums/q...rsuperPi-1.png
It was trashing like crazy for 2.5b. I wasn't running anything else in the background other than my normal apps. I expected 2.5b to be hard on my system though.
Interesting. At least on my machine, even when I have more than 500MB of stuff open, the program will page it out enough to let the program have the full 11.5 GB for the entire run.
Win7 change in memory manager?
Anyone want to try and confirm this?
thx :)
Did you verify the 350m? Only the benchmarks and the two of the batch mode options will verify to see if the digits are correct. The "Custom Compute" option does NOT since there are no checksums for them.
Just because it finishes without warnings doesn't mean it finished without any errors.
The stress test, on the otherhand, DOES verify to see if everything is correct.
The way it works is that it cycles through all the major constants that the program can compute. For each constant, it runs two computations using different algorithms. Then it matches them to see if they are correct.
It uses two threads that run completely independently of each other. Because of "dips" in the CPU usage for each computation, one isn't quite enough to keep everything busy. But two does pretty well - at least up through 8 cores. ;)
http://img297.imageshack.us/img297/9...uncher500m.jpg
lame result but still...
nah... That looks about right for your processor/frequency. :cool:
Something interesting here:
http://www.itocp.com/bbs/thread-35879-1-1.html
The guy did a string table hack on the binaries. It'll print correctly only if you have your regional settings set to read ascii in some specially encoded way that works for Chinese characters.
And it isn't complete. Some of the other "secondary features" aren't translated at all.
Which gives me an idea. For next the release (v0.4.3), how about I link to an external .ini settings file that has the full string table in unicode?
That way it can be easily translated to any language by simply editing the .ini string table. :yepp::yepp::yepp:
It's kinda drastic. But, I plan on implementing a very aggressive anti-tamper protection into v0.4.3 that will pretty much block any changes to the binary.
Which would make translation (via string-table hack) impossible unless the protection itself is broken. So the only way to allow any sort of language support is to move the strings out of the binary. :D
Any other ideas?
Heres my score with Q8400 @ 3.6GHz, 485FSB. RAM @ 1165MHz
http://img.photobucket.com/albums/v1...pboard02-7.jpg
Little bit faster 4.6ghz, 32m
http://i30.tinypic.com/o7iv55.png
and 25k
http://i32.tinypic.com/4sm9gm.png
and 50k
http://i28.tinypic.com/sp7uw4.png
dude your records are on wikipedia.:clap:
Mine are? Link me :D I know I can go faster
you probly set a single socket speed record but poke set the record for most digits calculated on several constants. i couldn't find some of the other ones but they need to updated.
http://en.wikipedia.org/wiki/Euler-Mascheroni_constant
http://en.wikipedia.org/wiki/Catalan's_constant
looks like kondo beat you on e!
http://en.wikipedia.org/wiki/E_(mathematical_constant)
Where do you see the times at? BTW cool links, wish I was smart enough to understand math like that!
If anyone is still wondering what all that ram was for... ;)
Here's the one you missed:
http://en.wikipedia.org/wiki/Ap%C3%A9ry%27s_constant
There aren't any pages for the two Natural Logs.
As for e... well... :(
Here's my excuse: :rofl::rofl::rofl:
Using the Basic Swap Mode, all computations of N digits require a little more than 2N memory.
So with 64GB of ram, the largest computation I can do is 31 billion digits.
So until I write an Advanced Swap Mode (the thing that PiFast and QuickPi have), I'm stuck at 31 billion digits with my "puny" 64 GB of ram. :ROTF::(:rofl:
32GB would've been enough to break the records. But at the time we built the rig (September 2008), the memory requirement was expected to be 4N (I hadn't implemented it yet).
So 64GB was needed to break 10 billion. (which was what the records were at the time.)
Then in December, I realized a 2N algorithm - which effectively allowed us to hit 31 billion.
Any higher, and multiplication will no longer fit in ram. So it will require a completely different scheme for operating on HDs.
I've updated the "Fastest Times" section on my site with your new numbers... :up::up::up: Insane 4.6 GHz...
Woah :cool:
New single-socket records again!!! :D
EDIT: Is it safe to assume that's on water?
Single-socket is closing in on Dave's Gainestowns for 25m...:rolleyes:
Just curious, how stable is that OC? I noticed that turbo is disabled.
And if anyone has screenies of a failed benchmark. I'm curious to see them. :p:
Whether it fails because the digits don't match at the end, or if it catches and corrects an error....
Be kind of interesting to see what the most common non-crash/BSOD failure is. :):D;)
Yep water, pic is in sig. Turbo on GB board is just 24 multi all the time, 23 multi OC's better, so turbo is off.
4.6ghz is stable enough to run prime blend 20 mins without crashing, some with bigger cojones primed 4.6ghz for 8hrs posted in the i7 database intel thread.
4.65 stable to run chess computer stress, posted there.
4.69, ran 25 and 50 several times, never crashed. 100 tried to run once crashed (bsod) about 3/4 way through. Did not try upping vcore more, probably wont until winter and can set rad in cooler temps. It is between 4.6 and 4.7 where stability markedly changes.
hmmm looks like I'll have to try 4.75ghz now hehe. Kinda board limited though but haven't really pushed too hard
did batch run on rig 1 and rig 2 on winxpx64
hope this is valid enough
rig2
http://img10.imageshack.us/img10/1484/79440791.jpg
rig1
http://img186.imageshack.us/img186/1314/x41.jpg
downstairs with ~18-19C ambients got 100 to run, and little higher, 4.76ghz, 1.55vcore.
http://img44.imageshack.us/img44/3193/ycrunch476125.jpg
http://img154.imageshack.us/img154/6...unch476150.jpg
http://img156.imageshack.us/img156/3...nch4761100.jpg
Run with my Core i7920@4210Mhz
http://img30.imageshack.us/img30/2724/96969833.th.jpg
:up:
Woah... :eek: Is that voltage for real? Is that one of those 3845/3849 batches?
With 1.15v I can only get up to 3.9 GHz... and only marginally stable.
I need 1.225 to bench @ 4.2, 1.25 to "barely"pass 2G Pi, and 1.275 to get prime/linpack stable.
And just wondering, did you play with memory timings? I'm curious to see how sensitive the program is to timings.
Ah... Something like different timings for 25m and something larger like 250m. And maybe with and without HT. But I shouldn't be asking for anything.
Threading overhead and randomness makes the small benchmarks somewhat unreliable as an indicator of program efficiency.
HT has an effect of "hiding" latencies because when one thread stalls for memory access, the other thread gets the whole core to itself...
I haven't gotten around to trying it myself because my rig is usually busy with stuff that I can't pause.
Obviously this program is very memory intensive. And I know for sure that Core 2 bandwidth is bottlenecking. But I've never been able to gauge the effect of latencies.
Timing sensitivity, I would guess, is a good indicator to whether the program will scale well on EX servers.
you could buy vtune for like 1000 dollars.:D
http://software.intel.com/en-us/intel-vtune/
Was about to do 4.7ghz+ but then my board died again! damn u dfi
http://i599.photobucket.com/albums/t...1Sep062144.jpg
thought i'd join in..