This is all rubbish, I tested it myself. Conroe 64-bit code to Opteron 64 bit code has a about the same speed different as Conroe 32-bit code to Opteron 32 code. There is no weakness in EM64T in Conroe.
I had the info here and why not give the full cut'n'paste just in case:
http://www.xtremesystems.org/forums/...d.php?t=104646
[begin copy'n'paste]
I ran some 64 bit benchmarks in addition to my 32bit benchmarks.
In my 32 bit FreeBSD tests for C/C++ compilation a 2.40 Conroe with DDR533 is 5-8% faster than a 2.60 with DDR432 (*), or about 14-17% faster per MHz.
Running some similar C/C++ compilations on 64 bit Linux (all software 64 bit), the 2.40 GHz Conroe is 18-25% faster than a 2450 MHz Opteron.
I'd say the lack of 64 bit performance on Conroe is a myth.
This is preliminary. I have problems controlling the Conroe's clocks from Linux, I am not 100% sure I am actually at 2.4 GHz here. Damn BadAxe...
(*) Before anybody cries foul: the 5-8% of the 2.4 Conroe versus 2.6 Opteron are only for C/C++ compilation, Conroe goes to 20% advantage for scripting languages and Lisp and to 25-30% for low quality video. The number here are purely for C/C++ compilation. I full 64 bit version of my benchmarks is in the works.
%%
I ran some other not to be named stuff and I see more Conroe kicking.
I have two cases of Conroe being 100% faster per clockspeed (3.66 GHz DC Conroe three times faster than 2.4 GHz dual single-core Opteron - which has slow RAM).
One case is 32 bit code and one is 64 bit code, so that is really a non-issue.
The performance characteristic of the 32 bit case is very wired. If you are familiar with Common Lisp programs, you can run Common Lisp programs in "fast mode" and "safe mode". Safe mode adds all times of checks: integer overflow, extra type checking, argument checking etc. Without the extra checks the conroe is 50% faster per clockspeed, with the extra checks it goes towards 100%. It is almost like you get the checks for free.
The 64 bit case of 100% faster is just another random C/C++ compilation is a huge tree and I currently don't have an explanation why the difference is bigger than in my own benchmarks. The tree in question is certainly template-heavy, maybe it is the bigger cache.
%%
I am running out of time to analyse some of the more interesting/puzzling results and won't come to conclusions before the 4th of July weekend.
I'm throwing some raw data at you, numbers are CPU time used (this is single-thread):
Code:
Lisp build:
fast safe (fast/safe = target policy)
2.4 939-DC-Opt 505 826 (memory 218 MHz)
2.6 939-DC-Opt 469 731 (memory 216 MHz)
2.4 940-SC-Opt 537 846
2.4 Conroe 328 439
3.66 Conroe 222 296
This is compiling/building a Common Lisp system. (BTW, Netburst completely sucks here, I'm glad that episode is over.)
You can build the program two ways: safe-mode and fast-mode. The safe-mode build will insert a lot of extra checks like type checks, integer overflow checking, function parameter counting etc.
As you can see, the Conroe completely spanks the AMD64s in the safe-mode build, more so than in the safe-mode build. That is puzzling, I have never in my 15 years of Lisp programming seen anything like this. Suddenly you get the insertion of the extra checks almost for free, whereas so far you had to weight getting the checks against twiddling your thumbs for a while duing the build.
Here is a relative index, the first rows giving you:
- performance/MHz fast-mode
- (same, relative to 2.4 Opteron)
- performance/MHz safe-mode
- (same, relative to 2.4 Opteron)
Code:
fast (rel) safe (rel) fast safe
------------------------------------------------------------
1212.0 (1.0) 1982.4 (1.0) 2.4 939-DC-Opt 505 826
1219.4 (1.0) 1900.6 (1.0) 2.6 939-DC-Opt 469 731
1288.8 (0.9) 2030.4 (1.0) 2.4 940-SC-Opt 537 846
787.2 (1.5) 1053.6 (1.9) 2.4 Conroe 328 439
812.5 (1.5) 1083.4 (1.8) 3.66 Conroe 222 296
That means Conroe reaches a 50% advantage over AMD64 per clockspeed for the fast-mode build but goes up to 90% advantage for the save-mode build.
Possible explantions include:
- It's the larger cache. Since this compilation is single-thread the AMD64 has 1 MB, the Conroe has between 2 and 4 depending on how it is associated with the one core running the compiler
- Some parallel execution unit in the Conroe picks up the extra instructions to insert the safety checks
- We get the extra instructions cheaper than on AMD64 due to some clever rescheduling in Core2
- The extra instructions to insert the safety check in the compiler normally blow memory prefetch in AMD64 and Netburst but Core2's prefetch is clever enough to find the stuff in advance
- I am doing something wrong here (not likely, I re-ran it a few times)
- All of the above
I need to run a few programs that report cache hits and the like but I
don't think any of the performance counter programs that exist are
running on Conroe yet.