The author commented quite clearly that official stable is still 1.55 and other newer builds having problems are not yet stable and unofficial for the exact reason. They will be improved and quickened up quite a bit before official release build according to him ;)
As for something which creates heat and computers Pi, check
SuperPrime developed by a fella on here. The best CPU/RAM test I've played with since Linpack, beats Linpack 32b for stress and heat and shows up errors very quickly. Cheat-preventative, picks up system details through CPUZ too [which was a problem for long], better than wPrime does. One major downside to it is, it only runs on Intel platforms. :(
I have no idea why Charles hasn't updated and made Super Pi more efficient, I know many fellas have complained and advised for it many times over the years. One good reason is, you get to compare from the beginning to the end across all platforms - single threaded, single channel performance. Why Charles doesn't release the source code is one major reason you should be careful with it too -> to limit the cheating.
No calculation that isn't a specific memory test out there yet that I've come across is RAM intensive or much sensitive until you get into the high memory footprint periods of code. Something like SysTool Pi is more RAM intensive than most apps around. PhotoWorxx in EVEREST and WinRAR is about the best I've seen that are quite RAM/FSB intensive, much more than others anyway. A very good memory subsystem benchmark is STREAM, all professionals incl. hardware firms use it and yeah, they tend to compete in it :p: Maybe you can take pointers from that since IIRC it is coded in C as well as Fortran, the source code is available and STREAM2 is being developed currently.
As for Intel/AMD, you should ignore zealot and bias individuals or those with little understanding of uarch functioning and neutralize your benchmark to be
platform independent and still remain consistent across processor generations and core improvements - i.e. it can't be showing Pentium 4 faster per MHz than Penryn in integer calcs for instance. Check out Intel documentation on ILP, TLP, PLP at their website, their software developer community covers quite a lot of information and hints. Check out Linpack source code for good pointers too, I know this is how many other devs start off building good benchmarks. ;)
That's what SPEC did when developing the CPU 2006 benchmark and hence why all professional firms compete and rely in it, since they govern the benchmarking affair officially. Both CPU MFG firms have different strengths, weaknesses and dependencies as all data analyzers and coders will know. If you use code optimized running instructions which favor AMD, it will win and if you use the opposite the Intel will win. You can see the instruction and bench types which favor Intel and AMD in this
2.83G Harpertown and
2.3G Barcelona comparison. If you took Gamess for instance, the AMD CPU would outclass the Intel one. But if you took PovRay for example, the Intel CPU would outclass the AMD one. Different code, different strengths, no one brush fits all :yepp:
There
really is a strong need to get some multi-threaded version that is RAM intensive out. All CPU MFGs are now moving to decrease data bottlenecks in their core uarchs in buses and increase cache, inter-core, intra-core and RAM bandwidth massively to enable fast computations with improved energy efficiency. Problem is, if the applications and benchmarks around are not coded to take benefit of them, their new instruction sets, their multi-cores, their memory, cache and bus performances, the end result will be useless to show the system perf. and the application might very well run slower on newer urachs then older ones per clock which were single threaded with little memory bandwidth. Everything is being improved for multithreaded parallelism, coding needs to take in account compilers/languages, their optimizations, operational semantics, functional languages, extensions, higher-order functions, polymorphism, non-determinism and so on that exploit parallelism best. Maybe have a look into lambda calculus too.
As for coding, my coding is v.weak now so I'm reserved in what I say, I have no interest in it nor the time. I quit in late 2005 and have not touched it since then apart from Firefox/Thunderbird related coding and whatever comes with its debugging.
EDIT: this might help you. QPi is utilizing the most efficient algorithm I've used, although there may well be better implmentations of it close to the actual:
http://www.geocities.com/tsrmath/pi/piprogs.html#QPI