I did read a jolly good article about the fermi/GF100/GT300 in a local Indian Mag. They focused more on the super computer application and cache design rather than anything else.

They positioned AMD's evergreen against Nvidia's Fermi, they started of saying both support IEEE 754 2008 with FMA but fermi has ECC while evergreen is limited to ECD "A little bit on EDC vs ECC in a real environment, etc".. The interesting part was the cache sub sytems.

Lets start of with L1, Fermi has 16 SM's with 64KB memory attached. This 64KB can be configured as 16KB shared memory and 48KB L1 cache or vice verse. Now comes the evergreen it has 32KB shared and 8 KB L1, this can not interchange. Now comes the L2 Evergreen has 4*128kb blocks meaning total of 512kb L2 and fermi has 768kb L2. To this mix Evergreen also has a 64kb data share.

Extract:


Total Shared cache in Fermi- 256kb or 768kb "16kb or 48kb each"
Total L1 cache in Fermi- 768kb or 256kb "48kb or 16kb each"
Total L2 cache in Fermi- 768Kb "Shared"

Total Shared cache in Evergreen- 640kb "32kb each"
Total L1 cache in Evergreen- 160kb "8kb each"
Total L2 cache in Evergreen- 512kb "4 blocks of 128kb each"
Global data share 64kb "Connected to L1 and L2"

This means fermi has a big advantage in terms of L1 and L2 flexibility. The L2 fermi has is big enough to store overhead for retracing, etc. The evergreen is also not a bad machine at all but it seems to me that a fermi may perform better in GPU-CPU operations that need flexibility than Evergreen.