Quote:
Pat specifically mentioned some performance numbers for memory disambiguation. According to him, moving loads ahead of stores can up to 20-25% on some workloads. Since he is not (by own his own admission) an architect, these numbers should come with a grain of salt. That being said, Pat is known for an impressive memory; he cited the precise number of transistors for two Intel CPUs off the top of his head. The 25% mark seems like a good upper bound, and the average is probably in the 15-20% range. He also gave some rough figures for how well macro-op fusion (i.e. combining x86 instructions) and micro-op fusion (combining uops) work. Apparently, those two techniques combined make the CPU effectively 40% wider in some situations, which is a very impressive gain in IPC, even if the average is lower.
Quote:
If you've seen the above diagram before, then you'll notice that there's a new number in there. From what I know, the Tokyo presentation is the first time that Intel has disclosed this much information about Core 2's fetch and predecode phases:
18-deep instruction queue
6 instructions can be written per cycle (by PreDecode)
5 instructions can be read per cycle
Implements a single "Macro-fusion" per cycle
Delivers complete instructions to the Decode stage