Linpack is compiled with all optimizations and what not by Intel and supplied as a binary for single processors. I don't think there is any way any of us could ever optimize it that good if we had the source. The 32-bit version isn't very optimized as it is a broader version for all x86 chips. The 64-bit binary is highly optimized for EM64T processors with SSE4.0 and above. That is the main difference between the two. 45Nm chips will work a little harder than their 65Nm counterparts in this test due to having SSE4.1.

Even if you changed the binary with a hex editor setting LMA aware flag, the binary was never compiled with more than 2GB addressing. This may or may not be a problem, but would introducing possible buffer overflows for the sake of using another 1GB memory be a wise thing to do on such a precision application? Without knowing whether Intel's source used signed long long int's or unsigned long long int's you may or may not introduce problems. If all long long int pointers are unsigned you should be fine. But I wouldn't use unsigned long long int's if I only needed to address and work with a value that a signed long long could contain, because sometimes the flexibility of negative values is necessary.