Quote Originally Posted by Apokalipse View Post
[snip]
You literally have no idea what you are talking about. I was going to bow out of this pointless discussion, but I don't want your misinformation poisoning good minds like poor Manicdan. That is how the game of internet-telephone starts. I haven't programed in assembly in a REALLY long time, but I figure anything is better then letting such obvious misinformation to spread.

Firstly, there are no 256-bit instructions. x86 instructions can be up to 15 bytes in length. Though the size of the instruction can vary because of the various prefixes, registers, etc - so most are smaller, sometimes significantly so.

If you don't believe me feel free to reference the AMD64 Architecture Programmer's Manual, Volume 3

On page 1:
An instruction can be between one and 15 bytes in length. Figure 1-1 shows the byte order of the
instruction format.
The AMD64 Architecture Programmer's Manual, Volume 4: 128-Bit and 256-Bit
Media Instructions
includes the same diagram and references the previous volumes.

So if instructions aren't 32, 64, 128, or 256 bits in length then what does it mean to have a 256-bit instruction?

They both say the following in the definitions section:
256-bit media instructions
Instructions that use the 256-bit YMM registers.
Essentially 64-bit/128-bit/256-bit instructions are instructions that add support for larger registers, new memory addressing modes, new capabilities, etc. When the processor breaks those down into one or two macro-ops it isn't because the instruction is actually too big for a single unit to handle. It does this because macro-ops have a simpler fixed length encoding that the processor can operate on more efficiently, among other things. A single x86 instruction might become two macro-ops because the instruction is actually telling the processor to do several steps that can be handled by different execution units.

Speaking of being handled by different execution units, some instructions can be handled by one of many units and some instructions have to be handled by a specific unit. This is just as true for bulldozer as it is for past processors.

Please reference the Software Optimization Guide for AMD Family 15h [ie, bulldozer] Processors.

In the integer core there are 4 pipelines, 2 INT and 2 AGU. But the units are not identical. Some instructions have to be processed by a particular unit. Examine the diagram on page 36 to see why. Refer to table 10 and you will find lots of instructions that can be done by either INT unit such as ADD or PUSH. While other instructions are done by a specific pipe, such as DIV on pipe 0 and MUL on pipe 1.

This brings us to your claim that either half of the FPU can do it's own AVX instruction simultaneously. This is wrong. On the optimization guide page 38:

Only 1 256-bit operation can issue per cycle, however an extra cycle can be incurred as in the case
of a FastPath Double if both micro ops cannot issue together.
In other words, the entire FPU can only operate on 1 AVX instruction in any given cycle. And that instruction can be delayed if the AVX instruction decodes into 2 macro-ops and one of them requires a pipe currently in use by another instruction. Examine figure 3 on the same page and table 8 on page 232 and you will see why this is the case.

Just as with the integer units, the FPU units have different capabilities. Some instructions can be done on one of many available pipes while other instructions need a specific pipe because it is the only one with that capability. Shuffles on pipe 1, AVX FPMAL on pipe 2 or 3, AVX FPFMA on pipe 0 only, and so on. Refer to table 12 for further examples.

The FP unit needs all of these execution units to execute the whole instruction set. It is an entire unit with it's own scheduler, retire, etc. To take out "half" of that would require a redesign of the unit. So no, the FPU unit isn't like two independent units connected by "reverse hyperthreading". It is a full floating point unit in its own right that can do work for either of the threads from either integer core.