Well, good job at pointing out some of the mechanics of how tesselation works, but you did fail to mention the singular reason why the 4xx Nvidia hardware will ALWAYS have a huge advantage over the ATI 5xxx series in tesselation. You did part of the work for me, mentioning that this was a 3 stage processes but left out the most important detail, in ATI 5xxx hardware, this is a multipass operation which requires it to go out to board memory and back the core at least once, and more depending on how much tesselation is used, this slows down the process considerably. Now on Nvidia 4xx hardware, it is also a multipass operation, but due to a unified cache architecture it doesn't require a trip out to main memory for each pass, only a trip to on chip cache making very significant gains over the ATI 5xxx series on tesselation, this also caries over to big gains on any multi pass shader such as some of them used in Crysis.
EDIT: The multi pass shader operation I thinking of in Crysis is called "parallax occlusion mapping" in case anyone was wondering.







Bookmarks