Plz correct if I'm wrong..
Amdahl's law states that if P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 − P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is
( (1-P) + P/N )^-1
Therefore, doesn't matter if you got Fermi, G200, or G92, if "only" 90% of code can be parallelized, anything beyond 10 SP or CPU wont be faster.
Thus with 512 "cores", maximum efficiency requires 99.8% parallelization.
Seeing how vast majority of programs, and even games are barely even optimized for 2 cores, you can imagine how much ingenuity this requires on GPU side.