That ~10-15% number was also my estimate. It's of course dependent on the actual code running. It will be possible to construct cases were Cayman will be slower than Cypress (if Cayman doesn't have significantly more than 1920 SPs). But generally, it will gain the most in situations where the VLIW5 architecture fared worst in comparison to nvidia.
Actually it is already known how the VLIW4 units will be organized. The codepath for that arch in the driver is functional since Catalyst 10.4, I've posted some stuff about that over at B3D 10 days ago.
The transcendental functions are done by the xyz units working together (just like it is done for double precision already now, only that it takes 3 slots), so 3 of the 4 slots of the VLIW unit are used to calculate a transcendental. The fourth slot (w) does not take part in that and is still free to use in the same cycle. That means a good part of the t unit got split up in three parts and is distributed to the x, y and z units.
Another function of the t unit was doing format conversions and roundings. This functionality got replicated to all subunits. That means for this kind of stuff Cayman will fly.
24bit integer arithmetics are now fully supported by Cayman and can be done in all 4 slots (Evergreen had only partial support which was not really used).
A 32Bit integer multiplication will unfortunately block all 4 slots (could be done by the t unit with the xyzw slots free for use by other instructions in Evergreen), but this is probably the price to pay to get some transistor savings from the change.
All other integer instructions can again be done in all 4 slots (as before).
Double precision instructions behave the same way as in Cypress. Everything involving a multiplication (MUL, FMA) takes 4 slots while the other stuff (like ADD and conversions) takes 2 slots. That means the DP:SP ratio is 1:4.




Reply With Quote
Bookmarks