Quote:
The structure of a packed VLIW passed for execution by a discrete block is made up of between 1 and 5 64-bit scalar ALU operations and at most 2 64-bit literal constants for a grand total of 7 64-bit words length. Control flow instructions are passed as separate 64-bit words assigned to the branch hardware. The 4 equal ALUs can handle 1 FP MAD/ADD/MUL/dot product per clock, or 1 INT ADD/AND/CMP/LSH*_INT (but not MUL!) per clock, to list just a few instructions.
For single precision floats, MUL and ADD are ― ulp IEEE using round-to-nearest-even rounding, and MAD is 1 ULP with the same rounding mode. For double precision, as already mentioned, these ALUs are fused in pairs of two to compute 1 DP MAD per cycle across all 4 of them. Only a limited set of instructions are supported in DP (no transcendentals being the obvious omission), and compliance with IEEE754 is relative: denorms are flushed to 0, only round-to-nearest rounding is supported, and a MAD produces different results from MUL+ADD due to rounding. Integer and float instructions can't be processed in parallel.
The transcendental unit (the Rys unit!) is different from its more silhouette conscious brethren: it's (surprisingly!!!) capable of handling transcendentals (cos, sin, log, exp, rcp et al.) at a rate of 1/cycle, INT MUL, due to a slightly higher internal precision (40-bit versus 32-bit, allowing expression of a 32-bit int in the FP exponent) than the other ALUs, and format conversions, all whilst not being able to process dot products or double precision work (so it's idle when double precision processing is happening).