PDA

View Full Version : Optimizing code: Brute force prime number generator



Calmatory
08-08-2010, 09:26 AM
I wrote a small benchmark few years ago which used to brute force method for finding prime numbers. Back then I knew very little about algorithm and code optimizing, so now with loads more experience under my belt I decided to give the algorithm a go, and see what I've learnt during the past 30 months or so. The unoptimized code resembles my early implementation quite well.



A glance towards prime number benchmarks. All written in C using gcc v. "4.5.0 20100610 (prerelease) (GCC)" and Intel Celeron M 420 x86 Processor running at 1.6 GHz, with 2 GB of RAM(533 MHz DDR2) and using 32-bit Arch Linux with linux kernel 2.6.34-ARCH.

All the runs are compiled with gcc using -Wall and particular -O<n> flags.
The run times are gathered by using the time tool found in practically all unix-like operating systems (time ./a.out).
Some source code can be found from the bottom.

Using basic brute force method of finding primes under 131072(or maybe better known as 128K, 2^17). There are 12251 such primes.

So let's see... The actual optimization can be seen [like this].

No optimizations:
-O0 92.317 s
-O1 93.001 s
-O2 93.894 s
-O3 94.237 s
-Os 91.467 s
Best: 91.467 s
Worst: 94.237 s
Difference between best and worst: 2.93 %

Only check non-even numbers:
[for(i=2;i<num;i+=2)]
-O0 46.160 s
-O1 46.724 s
-O2 46.404 s
-O3 46.990 s
-Os 45. 760 s
Best: 45.760 s
Worst: 46.990 s
Difference between best and worst: 2.61 %

Only check numbers against non-even numbers:
[for(j=3;j<i;j+=2)]
-O0 23.085 s
-O1 23.348 s
-O2 23.495 s
-O3 23.695 s
-Os 23.078 s
Best: 23.078 s
Worst: 23.695 s
Difference between best and worst: 2.60 %

Only check numbers up to the square root of the number:
[for(j=3;j*j<=i;j++]
-O0 0.270 s
-O1 0.263 s
-O2 0.260 s
-O3 0.260 s
-Os 0.253 s
Best: 0.270 s
Worst: 0.253 s
Difference between best and worst: 6.29 %


At this point I made a huge mistake and the rest of the measurements got messed up. What I did was not applying the "only check numbers against non-even numbers" optimization to the code(I had to do some debugging so I reverted it and so it was left out by accident).

There is still much more to optimize:
- Apply the "only check numbers against non-even numbers" optimization
- break out from the inner loop if the number is proven to be a non-prime
- Move the tmp=0 assignment to else-clause of the if(tmp==0)
- Store the i and j variables in the CPU registers for faster acces(register keyword)

With all of these applied I was able to squeeze the running time down to 0.030s. Quite a leap from the inital ~1.5 minutes. ;)

Here is some of the source code:

Basic brute force method with no optimizations:


for(i=2;i<max;i++) {
for(j=2;j<i;j++) {
if((i%j)==0) { tmp=1;}
}
if(tmp==0) { primes++; }
tmp=0;
}


Whole brute force function with all possible optimizations I can think of applied:



unsigned int brute_primes(unsigned int max) {
register unsigned int i=0;
register unsigned int j=0;
unsigned int tmp=0;
unsigned int primes=1; /* 2 is a special case so we start from 3 */
for(i=3;i<=max;i+=2) {
for(j=3;j*j<=i;j+=2) {
if((i%j)==0) { tmp=1; break;}
}
if(tmp==0) { primes++; } else tmp=0;
}
return primes;
}


Well, I admit that this post is a complete mess. My intention was to gather all the possible optimizations and apply them one-by-one to see the benefits. After I realized half of the measurements were wrong, I decided to post the results as they were, and exclude the bad measurements.

Maybe I'll make another post about another prime number algorithm - sieve of eratosthenes. The basic implementation can be made quite much faster too, mainly applying the same optimizations since they apply to that algorithm aswell.

Coldon
08-08-2010, 09:52 AM
be very careful with the break statement, sometime the cost of stalling the branch predictor is more than letting the loop finish.

Calmatory
08-08-2010, 10:02 AM
Hmm, never thought of that. Thanks for the info.

I'll try to write some test cases to produce that happening. :)

Coldon
08-08-2010, 12:49 PM
in general, you should trust your compiler, it will generate more optimal code than you would be able to, you must just watch out since in some case two completely different pieces of code will both generate the same assembler code due to optimizations. One area where a programmer can get true performance gains is in areas of memory access and cache miss reductions (again related to memory allocation and access).

Also it is often very difficult to check code optimality just by timing alone, try running your code through a profiler, I develop on windows so I'm not too familiar with linux options for profiling but IIRC valgrind should be perfect.

Calmatory
08-08-2010, 04:04 PM
Valgrind indeed is quite nice. Too bad it can't detect my processor's L2 size so I am left out the info from that regard. I'll take a look at few papers I found from google regarding optimizing by memory alignment and access times.

W1zzard
08-08-2010, 11:15 PM
don't optimize like that .. optimize by picking a smarter algorithm .. plenty to choose from for prime numbers. this is where the gains are

as mentioned before in the thread, the compiler is usually smarter than you, so don't waste your time with assembler or similar stuff until you have exhausted the search for smarter algorithms

Calmatory
08-09-2010, 06:40 AM
True. Great example would be the sieve of Eratosthenes. With the sieve of Eratosthenes, my machine can sieve primes under ~1 billion in about 55 seconds. With this brute force algorithmm I lost patience(after waiting for 2 hours) to get the primes under ~1 billion.

Coldon
08-09-2010, 06:50 AM
Something that is a little off topic, take complexity measures with a pinch of salt, just because an algorithm runs in linear times O(n) on paper doesn't mean it necessarily will run like that in code. I've had a situation where a O(n) algorithm ended up almost looking like O(n^2) due to the particular language and implementation.

There is a Donald Knuth quote that is relevant: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
Also my all time favorite quote: "In Theory, theory and practice are the same. In Practice, they never are!"

W1zzard
08-09-2010, 01:21 PM
I've had a situation where a O(n) algorithm ended up almost looking like O(n^2) due to the particular language and implementation.

you are saying for an O(n) problem because of language and implementation running a 10x bigger dataset made the program 100x slower ? what did you implement? and in what?

Coldon
08-09-2010, 11:01 PM
It was a pathfinding simulation and I did the prototype using an STL vector for the data storage and forgot to change it to an STL list after testing so during the growth of the algorithm when the vector size got filled and needed to increase in size, the time take to do so grew exponentially. So when I completed my testing on several parameters values I noticed an exponential growth in time with their change. When I realized that like an idiot I forgot to change the container, corrected it and reran the test, I now had near constant time across the various input parameters.

Now going from O(n) to O(n^2) was a bit extreme, I just used it as an example (a grossly over exaggerated one), my point was that sometimes a specific approach might not be better even if it looks like it on paper, since on paper data storage and memory traversal don't cost anything. If for example you iterate through a massive array randomly compared to in order, obviously the randomly accessed array will be slower due to the cache misses occurring on the CPU, even tho both algorithm have the same complexity on paper. So always think about the implementation cost in addition to the listed algorithm complexity.

Also I mentioned language choice since for some managed languages, you have little control over how it deals with memory and garbage collection, or even when garbage collection runs.

poke349
08-10-2010, 10:18 PM
First of all: use the Sieve of Eratosthenes as you mentioned:
http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

This has one major advantage: It gets rid of the need for divisions.
So if you want to find all the prime numbers below N, you need to create a bitstream of N bits and run the algorithm up to sqrt(N).

Here's where the not so obvious improvements come handy.
Due to the strided nature of the memory access, memory bandwidth and cache misses will be the bottleneck once your bitstream is larger than your cache.

Now here's a very powerful optimization that you can do (but it isn't easy). And that is to iterate multiple primes in the same pass.
So instead of doing a pass with p = 3 and then another pass using p = 5, you can do a single pass that crosses out multiples of both 3 and 5.

Take it to an extreme and you'll probably find that you need to iterate 10+ primes per pass to get rid of that memory bottleneck.
After that you'll find that SSE combined with careful loop-unrolling will add additional speed.


Messy, yes. But it's fast if done right. ;) I use the same tactics in y-cruncher. :rolleyes:

Calmatory
08-11-2010, 05:21 AM
True. I'll get myself through reading K&R and then try to get something out of the sieve of Eratosthenes. Though, first I need to make it online(segmented) and then make it threaded. Too bad I only have a single core processor to work with. :p:

Coldon
08-11-2010, 05:34 AM
I really dont like the K&R book especially since they introduced that horrific indenting scheme. Any programmer using that scheme should be castrated with a rusty spoon. Also if you want some good optimization books take a look at effective c++, more effective c++ and modern c++ design. they are all about c++ but a lot of concepts should be backwards compatible.

You should also look at c++ rather than c as a programming language as it is more robust than c and can offer pretty much the exact same performance. Plus most modern high speed math libraries are in C++.

Calmatory
08-11-2010, 06:20 AM
Well, actually I'm a C++ guy, but I never really learnt the fundamentals of C with it, which then got me into argument over IRC and I got convinced that first understanding ANSI C and then going to C++ is the right approach.

K&R isn't really nearly as interesting read as some more recent books I've read(Mainly some small books on Ruby and Beginning C++ Through Game Programming and C++ Primer Plus 5th ed.), the idea of introducing tons of new stuff in code and then just briefly explaining what they do surely scares many people off, despite being very informative. Oh, and the excercises on K&R seem to be quite valuable too. :)

Coldon
08-11-2010, 06:44 AM
its a misconception that you need to know c to understand c++, the way things are approached in each language is quite different and coding techniques differ greatly between the two. Certain ways of doing things in C aren't the correct C++ method, even though they work. A lot of programmers still code in a hybrid, I myself sometimes use the c STDIO library in c++ projects instead of the iostream equivalents.

the best introductory book on C++ is: learning c++ in 21 days by jesse liberty, it is an exceptional book and covers the language in detail. A more advanced book that is still great to read is: c++ for game programmers by noel llopis, this book covers a lot of the advanced topics from an optimization standpoint.

A lot of the K&R techniques aren't really necessary anymore since compilers have come a long way since then and trust me on this, the compiler is smarter than you,me and most programmers (no matter how big their e-penis is). A lot of guys have this chip on their shoulder and will try to always prove that they know more and that you are a noob instead of admitting that they might not know or take what you say into consideration.

I fell into the optimization trap myself and tried to get everything to run as fast as possible and you know what, you never get anything done that way. Rather get the whole thing working first, then identify the worst performance issues and optimize them then move downwards.

poke349
08-11-2010, 07:47 AM
Yes, don't over-optimize until you know what you're doing.

I actually look at the assembly that the compiler generates... to see what it's doing. (yes compilers are smart, but they do stupid things as well...)

cdolphin
08-18-2010, 03:41 PM
. I've had a situation where a O(n) algorithm ended up almost looking like O(n^2) due to the particular language and implementation.


I am not sure what you mean by this.
The description "Big O" or the complexity of a function describes the GROWTH of the function with respect to some size of data N and does not accurately describe the amount of time.

To me, the situation you just described would seem to imply that you were unknowingly invoking a method that is O(n^2) without knowing it...

Coldon
08-18-2010, 11:45 PM
In computer science, Big O is used often to define the time complexity of an algorithm (worst running time) (google it), so by saying that something is O(n) it means that the time taken for the algorithm will grow linearly in relation to N. If I say something is O(n^2) then I mean that the time taken grows exponentially with increasing N.

I explained the example in an early post, I forgot to change over from a vector to a list, and adding a value to an STL vector is an amortized O(1) operation but on that can potentially be O(n+1). I know I said O(N^2) but that's called a hyberbole, I was simply trying to make a point that time complexity on paper and reality are dependent on the implementation, and cannot be assumed.