Quote Originally Posted by Hans de Vries View Post
It's no coincidence that the architects behind the very long SIMD words
(256 bit, 512 bit and longer) are Doug Carmean and Eric Sprangle who joined
Intel from Ross technologies.

These are exactly the Hyperpipelining specialists at Intel:

(1) They co-authored the original hyperpipelining paper:
Increasing Processor Performance by Implementing Deeper Pipelines

(2) They leaded the original ~60 stage hyperpiplined Nehalem project.
http://www.theinquirer.net/inquirer/...em-slated-2005

(3) They initiated the Larrabee project. One of the main ideas behind
Larrabee is to achieve a theoretical maximum number of FLOPs on a
certain die with a limited number of transistors. A fourfold hiperpipelined
128 bit unit running at 4.8 GHz can produce 512 bit results at 1.2 GHz
using only 25%(+a bit) of the transistors of a non hyperpipelined unit.
ftp://download.intel.com/technology/...abee_paper.pdf
http://www.drdobbs.com/high-performa...ting/216402188
Your post is a pure speculations. You really don't know who has actually worked on architecture of SB. Also even if assume that those was the same ppl who worked on Netburst, still you have no even a litle bit of info how really AVX was implemented. I can recall you that a "real" 128 bit SSE was firstly implemented by a Haifa team which is curently works on SB.


The SIMD units are the easiest (of all units) to hyperpipeline. All instructions
which could cause problems for hyperpipelining have been systematically
left out of the AVX and LNI specifications. (for instance data shuffles
crossing 128 bit boundaries)
Your assumptions are not nessesary true. While a general multiplication algorithm looks relative easy to serialize it is not nessesary the case in a real life since multiple heuristic may be added to a HW algorithm to make it using less power/space, make it faster, e.t.c. I heard that it was a big chalenge for intel to implement fast radix-16 divider in Penryn. Also while I don't know how much space consumes an fp multiplier I may assume that hyperpipelining may consume more space (as an example -to save intermidate results in a multiplication loop) then implement additional multiplier. Any way even in Netburst Intel implemented "doublepumping" only for some integer ops and decided not to implement a double pumped alu for a compex integer operations such as divide/multiply.