Example: an FP multiplier has a latency of 1000 ps or 1 ns. So if I use it in one pipeline stage (for simplification we leave out operand catching etc.) I could clock it at 1 GHz, being able to feed it once per 1 ns and get a result at the same rate. Latency is just one cycle.
With 2 pipeline stages, some additional latch overhead and some inefficiency due to cutting the multiplier in two about equally fast pieces, the overall latency could become 1100 ps. But I could clock it at 1.8 GHz with two stages of 550 ps. I could feed the multiplier at that rate and get results at that rate. Latency would be 2 cycles. Another slight disadvantage would be, that there could be up to 2 multiplications going on at any time vs. one in the 1 cycle version. Two muls mean more power consumption. OTOH I don't increase energy per instruction.
That's the principle.
Bookmarks