Rewriting SuperPi

**MuffinFlavored** · 04-03-2008, 04:20 PM

Originally Posted by FUGGER

Our Super Pi will stay our version, muffinflavored pi or whatever you want to call it to avoid any confusion.

But go nuts on making your version, Im not one to hold anyone back from succeeding, just dont do it thinking you are replacing Super Pi.

That is not my goal.

Everyone might be seeing this as more than it really is.
I am just trying to calculate an x amount of digits of pi, and see how long it takes.
I do not know if the memory usage will be amazing, or anything.
Hopefully in the end, it will be that way.

If anyone knows anything about the Gauss-Legendre algorithm for calculating pi, let me know.

**i found nemo** · 04-03-2008, 04:44 PM

i hope it will be even faster cuz ram is faster than any hd.

**MuffinFlavored** · 04-03-2008, 05:27 PM

So far, I can calculate 16 digits of pi correctly.

Code:

	long double a = 1.0;
	long double b = 1.0/sqrt(2.0);
	long double t = 1.0/4.0;
	long double p = 1.0;
	long double temp, pi;

	int digits = 4;
	int n = 1;

	while (n <= digits) {
		temp = a;
		a = (a+b)/2.0;
		b = sqrt(temp*b);
		t = t - (p * pow(temp - a, 2.0));
		p = 2.0 * p;
		n++;
	}

	pi = pow(a+b, 2)/(4*t);
	printf("&#37;lf\n", pi);

I am trying to avoid the use of an external library to support large numbers.
"long double" allows only 128 bits (16 bytes) to be stored.

**~~Marvin_The_Martian~~** · 04-03-2008, 11:33 PM

Originally Posted by MuffinFlavored

So far, I can calculate 16 digits of pi correctly.

Code:

	long double a = 1.0;
	long double b = 1.0/sqrt(2.0);
	long double t = 1.0/4.0;
	long double p = 1.0;
	long double temp, pi;

	int digits = 4;
	int n = 1;

	while (n <= digits) {
		temp = a;
		a = (a+b)/2.0;
		b = sqrt(temp*b);
		t = t - (p * pow(temp - a, 2.0));
		p = 2.0 * p;
		n++;
	}

	pi = pow(a+b, 2)/(4*t);
	printf("&#37;lf\n", pi);

I am trying to avoid the use of an external library to support large numbers.
"long double" allows only 128 bits (16 bytes) to be stored.

What spi does, is save temporary results to hdd/ram which is why it's so tweakable. You won't get far if you only use built in data types you have to store intermediate results somewhere.

Edit:

Or, just combine data types, eg double 'long double' where the first one has prefaced 0's. Actually it's the same as I said above as you're doubling the variables which are used to hold the data, so it will double the ram usage, and at some point the os might start using the swap file for some reason.

**MuffinFlavored** · 04-04-2008, 01:13 PM

Originally Posted by Marvin_The_Martian

What spi does, is save temporary results to hdd/ram which is why it's so tweakable. You won't get far if you only use built in data types you have to store intermediate results somewhere.

Edit:

Or, just combine data types, eg double 'long double' where the first one has prefaced 0's. Actually it's the same as I said above as you're doubling the variables which are used to hold the data, so it will double the ram usage, and at some point the os might start using the swap file for some reason.

The way the algorithm I was using worked, it returned twice the amount of digits calculated each time. That is not good.

Trying a digit extracting method, I can get much better results.
But the method I have now, takes a very long time to calculate 16384 digits of pi (16kb)

But, I have this question.

What does everyone perfer:
a pi calculating benchmark
or a prime calculating benchmark

**~~Marvin_The_Martian~~** · 04-04-2008, 11:27 PM

I think the algorithm doesn't matter to most, calculating prime numbers well you can use P95 for that and calculating pi I think that's done already to idk how many digits.

People will care about how tweakable the benchmark is, and about how consistent the scaling will be ( eg predictability so you can weed out any scores which are simply bugged or cheated ). Best would be a cheat safe bench offcourse.

My .02c.

**MuffinFlavored** · 04-06-2008, 01:41 PM

Originally Posted by Marvin_The_Martian

I think the algorithm doesn't matter to most, calculating prime numbers well you can use P95 for that and calculating pi I think that's done already to idk how many digits.

People will care about how tweakable the benchmark is, and about how consistent the scaling will be ( eg predictability so you can weed out any scores which are simply bugged or cheated ). Best would be a cheat safe bench offcourse.

My .02c.

I would very well like to make something, new.
If anyone knows anything that would stress a CPU (I am now even trying to write some DirectX 9 applications), please let me know.
I will attempt to implement it.

**Calmatory** · 04-06-2008, 10:59 PM

So, you know HOW to do, but you don't know WHAT to do?

What about doing soemthing GRAPHICAL instead? Software rendering.

What about doing the "plasma effect" from oldie intros/demos? Don't use LUTs, calculate everything over and over again for every pixel, with cosines and sines.

Basic idea goes like this:
go trough every pixel in the screen (with for(){}'s).
calculate pixel color (e.g. c = 128cos((x*y)/120)*64+sin(x/50+y/30)*48+cos((y/x)*60)/16
set pixel on the appropriate point on the screen(direct framebuffer access would be way faster though).
loop for 2 minutes.
calculate the amount of rendered frames.

You can put huge amounts of load to a CPU if you don't use LUTs and use big resolutions. Think about 1600*1200*10 trigonometric functions in one second. Too fast? Add more complex formula.

It doesn't really matter what it renders, as long as it renders, with CPU.

I'd be interested to watch some kind of an effect rather than "Calculating... Please wait..."-text for over and over again. Oh, and whatever you do, please, do it cross-platform in mind.

Or what about image rotating? HUGE maze generation? 3D rendering? Fractal rendering? There is SO much what you can do, just be creative.

**kiikkuja** · 04-07-2008, 12:07 PM

Have you muffinflavored seen this thread?

Maybe you would be better off coding the stuff calmatory said.

I'm not against you rewriting SuperPi but it seems to me that there is already many versions of the same thing.

**MuffinFlavored** · 04-07-2008, 12:46 PM

You are right.
I need to create a DirectX 9 benchmark that heavily uses the CPU.

2 in 1 bonus.

I have already started.

**Calmatory** · 04-07-2008, 01:05 PM

Why to include DX?

DX with few cubes, textures, simple shaders and some primitive lighting techniques is about the same for the GPU as the 1+1 is for the CPU - VERY poor idea. Either you need to stress, stress alot, and even more, or not stress at all.

Besides, unless you really are going to STRESS the GPU, it will be CPU-bound, and then the DirectX implementation is just waste of time. You can get graphics with CPU aswell. Every 3DMark, including 3DMark06 are CPU bound, are you aiming alone higher than Futuremark went with full team of professionals?

So, make one good, instead of two poor benches, I am fairly certain that 1+1 = 2 isn't good in this case.

And no, I am not trying to put you down or anything, just saying what I think.

(Well, tbh, I believe SuperPI is being #1 because it has HWBot etc. support. It's de facto.

If you ask me, NucRus MultiCore benchmark, or wPrime would be better benches than SuperPI, but SPI is a legend, and nothing is going to take it away, no matter how good it is/what it stresses. Sad but true.

I'd personally be more interested in wPrime/NucRus etc. Or 3D Software renderer.)

**MuffinFlavored** · 04-07-2008, 08:19 PM

Originally Posted by Calmatory

Why to include DX?

DX with few cubes, textures, simple shaders and some primitive lighting techniques is about the same for the GPU as the 1+1 is for the CPU - VERY poor idea. Either you need to stress, stress alot, and even more, or not stress at all.

Besides, unless you really are going to STRESS the GPU, it will be CPU-bound, and then the DirectX implementation is just waste of time. You can get graphics with CPU aswell. Every 3DMark, including 3DMark06 are CPU bound, are you aiming alone higher than Futuremark went with full team of professionals?

So, make one good, instead of two poor benches, I am fairly certain that 1+1 = 2 isn't good in this case.

And no, I am not trying to put you down or anything, just saying what I think.

(Well, tbh, I believe SuperPI is being #1 because it has HWBot etc. support. It's de facto.

If you ask me, NucRus MultiCore benchmark, or wPrime would be better benches than SuperPI, but SPI is a legend, and nothing is going to take it away, no matter how good it is/what it stresses. Sad but true.

I'd personally be more interested in wPrime/NucRus etc. Or 3D Software renderer.)

I guess I must have said something wrong in my first post.

I am not trying to replace 3DMark.
I am not trying to replace SuperPi.
I am not trying to replace wPrime.
I am not trying to "replace" existing benchmarks.

I was just going to see if I could "program" a very stressing CPU benchmark, and then try and mutli-thread it.

I know 1+1 is not stressing.
I know drawing individual pixels is not stressing.

I will attempt to make something difficult, stressing, and worth while.

All I can do is try.

**Apokalipse** · 04-07-2008, 10:01 PM

Originally Posted by MuffinFlavored

I just need a way to benchmark the CPUs.
After that, I can make it less memory dependent, more CPU dependent, anything.

That will just make it more biased to Core 2 chips with their huge cache, but no onboard memory controller.

**~~GoThr3k~~** · 04-08-2008, 12:43 AM

try something like FFT or DFT? very intensive

**Calmatory** · 04-08-2008, 07:31 AM

Basically people "care" about how their CPU band is doing in benches. If AMD beats Intel in wPrime and Nucrus(don't heat up please, just an example!

) then Intelists shout for "real world benches" or tell that Intel > AMD since Intel is faster in SuperPI. Same goes if Intel > AMD. It just depends who you ask about, but generally people care about "real world" (read: Crysis, 3Dmark CPU benches, SuperPI & Pifast and maybe CINEBENCH.).

So what you might wanna do is to stress the CPU in multiple ways. Integer division and multiplying. Floating-point D/M, bitshifting, string manipulation. base10 to base16 conversions and vice versa, trigonometric functions...

If there is such a benchmark which really tests that all(Lets call it X), STILL people want to see "real world" benches, despite that the benchmark X tests for even WIDER variety of features than average "real world" bench.

Pretty much screwed, aye?

Well, not really unless one aims to make a benchmark which people care about and think it being the best meter for CPU speed (Which it could in reality be, thanks to wide variety of features tested). But after all, the main idea and motivator is the will to learn and experiment, right?

**MuffinFlavored** · 04-08-2008, 12:30 PM

Originally Posted by Calmatory

Basically people "care" about how their CPU band is doing in benches. If AMD beats Intel in wPrime and Nucrus(don't heat up please, just an example!

) then Intelists shout for "real world benches" or tell that Intel > AMD since Intel is faster in SuperPI. Same goes if Intel > AMD. It just depends who you ask about, but generally people care about "real world" (read: Crysis, 3Dmark CPU benches, SuperPI & Pifast and maybe CINEBENCH.).

So what you might wanna do is to stress the CPU in multiple ways. Integer division and multiplying. Floating-point D/M, bitshifting, string manipulation. base10 to base16 conversions and vice versa, trigonometric functions...

If there is such a benchmark which really tests that all(Lets call it X), STILL people want to see "real world" benches, despite that the benchmark X tests for even WIDER variety of features than average "real world" bench.

Pretty much screwed, aye?

Well, not really unless one aims to make a benchmark which people care about and think it being the best meter for CPU speed (Which it could in reality be, thanks to wide variety of features tested). But after all, the main idea and motivator is the will to learn and experiment, right?

Very true. From what you are saying, it seems the end result in comparison to other peoples end results almost matters more than the benchmark itself.

Yes, I am in it for the learning and experimentation, and the fellow members of XS could be, sort of "beta testers".

**Calmatory** · 04-08-2008, 02:33 PM

If you ever need a tester, feel free to count me in!

Shall I ask you what kind of C++ experience do you have? i.e how much? When did you start, what heve you done so far? Other languages? Just curious.

**MuffinFlavored** · 04-08-2008, 02:44 PM

Originally Posted by Calmatory

If you ever need a tester, feel free to count me in!

Shall I ask you what kind of C++ experience do you have? i.e how much? When did you start, what heve you done so far? Other languages? Just curious.

A tester? Didn't you send me the code of the ASM benchmarks? You need to be writing with me!

To be honest, I have, I guess what people would call "minimum" C/C++ experience. I kind of combined the two when I was learning, but now I am trying to stick with C.

I am 14 years old, I started by just reading a couple of tutorials, and I went from there.

I know a lot of PHP, Python, and C/C++ (I think of them as the close relatives

)

And ever since you sent me that CPU Bench 1.02 source code, I have been reading about ASM opcodes.

Do you want to do some sort of combined project?

**Calmatory** · 04-09-2008, 07:50 AM

Hmm, you must be mixing up me with someone else, since as far as I know, I haven't sent anyone anything related to ASM benchmarks, as I don't even know ASM well enough.

Why don't you want to use external libraries? The clock() in time.h is AWFULLY inaccurate(Running time for my plasmas were over 14 seconds, whereas it should have stopped right after 10 seconds were passed.) under heavy CPU load, thus there is no accurate way of measuring time.

**MuffinFlavored** · 04-09-2008, 05:39 PM

Originally Posted by Calmatory

Hmm, you must be mixing up me with someone else, since as far as I know, I haven't sent anyone anything related to ASM benchmarks, as I don't even know ASM well enough.

Why don't you want to use external libraries? The clock() in time.h is AWFULLY inaccurate(Running time for my plasmas were over 14 seconds, whereas it should have stopped right after 10 seconds were passed.) under heavy CPU load, thus there is no accurate way of measuring time.

Then who sent me those benchmarks?

I should find a better way to measure time, but by external libraries, I meant a bignum library, because I felt it would not provide the best performance.

**KTE** · 04-10-2008, 01:46 AM

Originally Posted by MuffinFlavored

Yeah, wPrime is an awesome benchmark.
But with the recent fluctuation between versions, I don't know.

The author commented quite clearly that official stable is still 1.55 and other newer builds having problems are not yet stable and unofficial for the exact reason. They will be improved and quickened up quite a bit before official release build according to him

As for something which creates heat and computers Pi, check SuperPrime developed by a fella on here. The best CPU/RAM test I've played with since Linpack, beats Linpack 32b for stress and heat and shows up errors very quickly. Cheat-preventative, picks up system details through CPUZ too [which was a problem for long], better than wPrime does. One major downside to it is, it only runs on Intel platforms.

I have no idea why Charles hasn't updated and made Super Pi more efficient, I know many fellas have complained and advised for it many times over the years. One good reason is, you get to compare from the beginning to the end across all platforms - single threaded, single channel performance. Why Charles doesn't release the source code is one major reason you should be careful with it too -> to limit the cheating.

No calculation that isn't a specific memory test out there yet that I've come across is RAM intensive or much sensitive until you get into the high memory footprint periods of code. Something like SysTool Pi is more RAM intensive than most apps around. PhotoWorxx in EVEREST and WinRAR is about the best I've seen that are quite RAM/FSB intensive, much more than others anyway. A very good memory subsystem benchmark is STREAM, all professionals incl. hardware firms use it and yeah, they tend to compete in it

Maybe you can take pointers from that since IIRC it is coded in C as well as Fortran, the source code is available and STREAM2 is being developed currently.

As for Intel/AMD, you should ignore zealot and bias individuals or those with little understanding of uarch functioning and neutralize your benchmark to be platform independent and still remain consistent across processor generations and core improvements - i.e. it can't be showing Pentium 4 faster per MHz than Penryn in integer calcs for instance. Check out Intel documentation on ILP, TLP, PLP at their website, their software developer community covers quite a lot of information and hints. Check out Linpack source code for good pointers too, I know this is how many other devs start off building good benchmarks.

That's what SPEC did when developing the CPU 2006 benchmark and hence why all professional firms compete and rely in it, since they govern the benchmarking affair officially. Both CPU MFG firms have different strengths, weaknesses and dependencies as all data analyzers and coders will know. If you use code optimized running instructions which favor AMD, it will win and if you use the opposite the Intel will win. You can see the instruction and bench types which favor Intel and AMD in this 2.83G Harpertown and 2.3G Barcelona comparison. If you took Gamess for instance, the AMD CPU would outclass the Intel one. But if you took PovRay for example, the Intel CPU would outclass the AMD one. Different code, different strengths, no one brush fits all

There really is a strong need to get some multi-threaded version that is RAM intensive out. All CPU MFGs are now moving to decrease data bottlenecks in their core uarchs in buses and increase cache, inter-core, intra-core and RAM bandwidth massively to enable fast computations with improved energy efficiency. Problem is, if the applications and benchmarks around are not coded to take benefit of them, their new instruction sets, their multi-cores, their memory, cache and bus performances, the end result will be useless to show the system perf. and the application might very well run slower on newer urachs then older ones per clock which were single threaded with little memory bandwidth. Everything is being improved for multithreaded parallelism, coding needs to take in account compilers/languages, their optimizations, operational semantics, functional languages, extensions, higher-order functions, polymorphism, non-determinism and so on that exploit parallelism best. Maybe have a look into lambda calculus too.

As for coding, my coding is v.weak now so I'm reserved in what I say, I have no interest in it nor the time. I quit in late 2005 and have not touched it since then apart from Firefox/Thunderbird related coding and whatever comes with its debugging.

EDIT: this might help you. QPi is utilizing the most efficient algorithm I've used, although there may well be better implmentations of it close to the actual: http://www.geocities.com/tsrmath/pi/piprogs.html#QPI

**MuffinFlavored** · 04-11-2008, 01:03 PM

Originally Posted by KTE

The author commented quite clearly that official stable is still 1.55 and other newer builds having problems are not yet stable and unofficial for the exact reason. They will be improved and quickened up quite a bit before official release build according to him

As for something which creates heat and computers Pi, check SuperPrime developed by a fella on here. The best CPU/RAM test I've played with since Linpack, beats Linpack 32b for stress and heat and shows up errors very quickly. Cheat-preventative, picks up system details through CPUZ too [which was a problem for long], better than wPrime does. One major downside to it is, it only runs on Intel platforms.

I have no idea why Charles hasn't updated and made Super Pi more efficient, I know many fellas have complained and advised for it many times over the years. One good reason is, you get to compare from the beginning to the end across all platforms - single threaded, single channel performance. Why Charles doesn't release the source code is one major reason you should be careful with it too -> to limit the cheating.

No calculation that isn't a specific memory test out there yet that I've come across is RAM intensive or much sensitive until you get into the high memory footprint periods of code. Something like SysTool Pi is more RAM intensive than most apps around. PhotoWorxx in EVEREST and WinRAR is about the best I've seen that are quite RAM/FSB intensive, much more than others anyway. A very good memory subsystem benchmark is STREAM, all professionals incl. hardware firms use it and yeah, they tend to compete in it

Maybe you can take pointers from that since IIRC it is coded in C as well as Fortran, the source code is available and STREAM2 is being developed currently.

As for Intel/AMD, you should ignore zealot and bias individuals or those with little understanding of uarch functioning and neutralize your benchmark to be platform independent and still remain consistent across processor generations and core improvements - i.e. it can't be showing Pentium 4 faster per MHz than Penryn in integer calcs for instance. Check out Intel documentation on ILP, TLP, PLP at their website, their software developer community covers quite a lot of information and hints. Check out Linpack source code for good pointers too, I know this is how many other devs start off building good benchmarks.

That's what SPEC did when developing the CPU 2006 benchmark and hence why all professional firms compete and rely in it, since they govern the benchmarking affair officially. Both CPU MFG firms have different strengths, weaknesses and dependencies as all data analyzers and coders will know. If you use code optimized running instructions which favor AMD, it will win and if you use the opposite the Intel will win. You can see the instruction and bench types which favor Intel and AMD in this 2.83G Harpertown and 2.3G Barcelona comparison. If you took Gamess for instance, the AMD CPU would outclass the Intel one. But if you took PovRay for example, the Intel CPU would outclass the AMD one. Different code, different strengths, no one brush fits all

There really is a strong need to get some multi-threaded version that is RAM intensive out. All CPU MFGs are now moving to decrease data bottlenecks in their core uarchs in buses and increase cache, inter-core, intra-core and RAM bandwidth massively to enable fast computations with improved energy efficiency. Problem is, if the applications and benchmarks around are not coded to take benefit of them, their new instruction sets, their multi-cores, their memory, cache and bus performances, the end result will be useless to show the system perf. and the application might very well run slower on newer urachs then older ones per clock which were single threaded with little memory bandwidth. Everything is being improved for multithreaded parallelism, coding needs to take in account compilers/languages, their optimizations, operational semantics, functional languages, extensions, higher-order functions, polymorphism, non-determinism and so on that exploit parallelism best. Maybe have a look into lambda calculus too.

As for coding, my coding is v.weak now so I'm reserved in what I say, I have no interest in it nor the time. I quit in late 2005 and have not touched it since then apart from Firefox/Thunderbird related coding and whatever comes with its debugging.

EDIT: this might help you. QPi is utilizing the most efficient algorithm I've used, although there may well be better implmentations of it close to the actual: http://www.geocities.com/tsrmath/pi/piprogs.html#QPI

Thank you very much for all of this information.

Hmm, you must be mixing up me with someone else, since as far as I know, I haven't sent anyone anything related to ASM benchmarks, as I don't even know ASM well enough.

Why don't you want to use external libraries? The clock() in time.h is AWFULLY inaccurate(Running time for my plasmas were over 14 seconds, whereas it should have stopped right after 10 seconds were passed.) under heavy CPU load, thus there is no accurate way of measuring time.

I have found a way to get the value returned by the high-precision timer of the processor.

Code:

void benchmark() {
	const unsigned int calculations = 4294967295; //The maximum amount an integer can hold, and the amount we will add up to
	unsigned int a, b, c, d;
	LARGE_INTEGER start, end, ticks;

	a = b = c = d = 0;

	QueryPerformanceFrequency(&ticks);
	QueryPerformanceCounter(&start); 

	__asm {
		mov edi, calculations

		mov eax, a
		mov ebx, b
		mov ecx, c
		mov edx, d

		$loop:
			add eax, 1
			add ebx, 1
			add ecx, 1
			add edx, 1

			dec edi
			jnz $loop
	}

	QueryPerformanceCounter(&end); 

	printf("&#37;0.9f seconds\n", (float)(end.QuadPart - start.QuadPart) / ticks.QuadPart);
}

I also apologize, spycpu had sent me all of the information.
He provided me with a benchmark that does SSE, SSE2, and 3DNow! operations, and returns results in MIPS (Millions of instructions per second)

**KTE** · 04-13-2008, 01:20 AM

Np

Have a look at this, coding properly for multi-core, multi-processor systems is not easy: http://books.google.co.uk/books?id=q...l=en#PPA198,M1
Maybe a bit simpler on loop transformations: http://en.wikipedia.org/wiki/Loop_optimization

This is basics, I expect you've already looked at it but if you haven't it may help: http://softwarecommunity.intel.com/a...s/eng/2589.htm

And this is what MC/MP specialists say: http://www.electronicsweekly.com/Art...ng-experts.htm

Last week, Chris Rowen, CEO of multi-processor specialist Tensilica, said: "The challenge of writing software for programming general purpose computing applications is generally recognised in the scientific computing community as the biggest single unsolved, and perhaps unsolvable, computing problem."

Will Intel and AMD ever get there? "They'll suddenly realise where they've been going wrong and take the right approach and then they'll say they invented it", replied Robertson, "I suspect Intel are doing it already, they're just too embarrassed to admit it."

**JumpingJack** · 04-13-2008, 11:00 AM

http://www.benchmarkhq.ru/english.html?/be_cpu.html

MaxPI is a multithreaded PI calculator, using a different algorithm altogether. It is very interesting to run this on a Phenom (9600 BE, B2) ... core 2 is always slower (by almost 10%)

Jack

**The0men** · 04-14-2008, 12:55 AM

Been following, thought I would subscribe.

Good luck MuffinFlavored, hope it goes well... No request's here, dont know much about it, would like a nice GUI though :P

Thread: Rewriting SuperPi

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions