The CPUID instruction in modern PC microprocessors works in an awkward way. Originally, the CPUID gave 4 bits for family number and 4 bits for model number. This means that the maximum numbers are family 15 and model 15. When these numbers were exhausted, they added 4 more bits for the model number. The new 4 bits are concatenated with the old 4 bits to make an 8-bit number, so the maximum value for model number is 255, or FF hexadecimal.
It would be logical to do the same with the family number, but instead they have added 8 more bits called "extended family". The new 8 bits are not concatenated with the old 4 bits to make a 12-bit number. Instead they have specified that we must calculate the sum of the old 4-bit "family" number and the new 8-bit "extended family" number. This means that the same family number can be specified in more than one way - and I think I know why!
Here's my theory:
Intel have made a compiler to support the ever growing extensions to the instruction set. The Intel compiler puts a CPU-dispatcher into your code to check whether the CPU supports the SSE2, SSE3, SSE4 or whatever instruction set. The compiled program can contain more than one version of critical parts of the code, and the CPU-dispatcher automatically chooses the version that fits the available instruction set. So far so good. The problem is that the CPU-dispatcher makes its choice based on family numbers and not only based on the feature bits that tell whether SSE2 etc. is supported. And it will not recognize unknown family numbers. The consequence is that any future Intel CPU with a family number different from 6 or 15 will not be recognized and will run with all SSE instruction sets disabled or will not run at all. There are lots of software on the market that is compiled with the Intel compiler. All this software would fail to run efficiently on a new Intel CPU with a family number different from 6 unless it is recompiled. The CPU-dispatcher checks only the old 4-bit family number. They can make the old family number = 6 in order to fool the CPU-dispatcher and then make the extended family number = e.g. 10 to make the sum = 16 or whatever number the marketing department dictates.
So the awkward implementation of the CPUID instruction is to make up for a serious blunder made by the people who designed the Intel compiler.
Funny that AMD have accepted this scheme, but they probably had no choice. BTW, the CPU-dispatcher in the Intel compiler also checks the brand name in the CPU and disables all SSE extensions if the brand is anything but Intel. See
http://www.agner.org/optimize/optimizing_cpp.pdf for how to circumvent this and make the code compiled with the Intel compiler work on AMD processors........
I just tried Intel C++ compiler version 10.1 with option /QxO as you suggested. It generates the following versions of code for common mathematical functions: SSE2, SSE3, SSE4.1 and non-Intel SSE2. It doesn't work on any CPU prior to SSE2.
This is the only compiler option that makes it run reasonably on an AMD, but why are there two different SSE2 versions, one for Intel and one for AMD? When I hack the CPU-dispatcher and makes it believe that it is an Intel, it runs 50 - 100 % faster. This means that the Intel-SSE2 version is faster than the AMD-SSE2 version when running on an AMD processor!
There are also options that work on any processor. For example /QaxB. This options runs non-vectorized SSE2 code on Intel processors and old 8087 code on AMD processors. I measured this to be 5-10 times slower than the /QxO option on an AMD Opteron.