Quote Originally Posted by trinibwoy View Post
The funny thing is that a lot of HPC workloads are bandwidth limited and can't even make use of all the flops because they can't get data to the cores fast enough. That's why caching and the use of shared memory is so important. For example, a lot of compute workloads just don't play nice with HD4xxx cards because the LDS there doesn't really function like it should. So people resort to a lot of other trickery like using the texture units instead to pump data into the cores but that's obviously not a scalable approach. Things should be much better with HD5xxx but I haven't seen independent confirmation yet.
Yes they are 960 dwords/clock.

Offcourse these are advertised performance i would say the worst they can do is around 960/3 dwords/clock which is still a lot more than RV770.

Now fermi has 16 LDS if i am not mistaken and work with 32 execution's/clock given the huge bandwidth i would say it is quite a bit more than what Evergreen can offer not so sure about dual evergreen tough.