If you went to the trouble to optimize the performance of this math intensive app, it seems counterproductive to ignore the huge benefits from SSE/SSE2 that the compiler can give almost for free.

Presumably it was only a few inner loops that really needed turboing, and a runtime selection of different code paths wouldn't make much of a dent in the 650KB. Not as if it would require significant Q/A either.

Such is the PHB.