Quote Originally Posted by trinibwoy View Post
The funny thing is that a lot of HPC workloads are bandwidth limited and can't even make use of all the flops because they can't get data to the cores fast enough. That's why caching and the use of shared memory is so important. For example, a lot of compute workloads just don't play nice with HD4xxx cards because the LDS there doesn't really function like it should. So people resort to a lot of other trickery like using the texture units instead to pump data into the cores but that's obviously not a scalable approach. Things should be much better with HD5xxx but I haven't seen independent confirmation yet.
yes physics simulations will be a lot faster on fermi than previously. i do remember seeing a slide on the cache hierarchy and it was 3x faster than gt200. it was CFD i think.
Quote Originally Posted by Piotrsama View Post
No, like the jump from 933 GFLOPS from the actual C1060 to the 1040 GFLOPS of the C2050 that is coming on 2Q of 2010.

Which accounts to: 1.11X
well first off you compared the old high end tesla to the new low end tesla. secondly theoretical flops are not a measure of real world performance (ie. larrabee). the mul unit in gt200 makes the card look faster than it really is. it can be used but not as often as it should. you might want to take other factors into consideration like 6GB of ram, fma, memory hierarchy, more bandwidth, etc.