There is always going to be significant IPC differences between various types of workloads. Some algorithms will naturally make better use of certain architectures than others.

The way I read "constant IPC" was as maintaining a consistently high IPC within any given workload. As opposed to, say, blazing through the arithmetic only to stall hard because of a missed branch predict. The new branch prediction, prefetch, beefy decoding, cache hierarchy, etc I think backs that up. It's to keep the reduced number of ALUs consistently fed. If they were going for maximum IPC they could have added extra execution units as well. It would benefit certain types of code with high ILP, but the extra units would be hard to keep fed on average and so IPC would fluctuate more.