awesome job hans! thx for sharing!

Quote Originally Posted by Hans de Vries View Post
-It would be faster even if it's still using 128 bit hardware for the 256 bit
operations since typically many time slots are unused in FP units.
does that mean an fpu boost even for x87 and sse code? sounds like it... any idea how much faster? 10%?

Quote Originally Posted by Hans de Vries View Post
The second level TLB units for the data cache have been doubled from
512 entries to 1024 entries.
higher virtualization perf?

Quote Originally Posted by Hans de Vries View Post
There is extra integer logic. A good guess would be a faster version
of the Integer divider. One that can produce multiple result bits/cycle
like the ones in the Core2 and Nehalem architecture.
that would be nice! a preview of whats to come in bulldozer?

Quote Originally Posted by Hans de Vries View Post
(Note that in this kind of cases there is no advantage from HT for Sandy
Bridge since a single thread already utilizes 100% of the resources)
Regards, Hans
hmmmm really? i didnt know that...
hmmm do you remember when people started talking about reverse hyper threading? intel can split the fpu, ie hyper threading... amd is going to use one fpu for 2 integer cores... this is what people could have interpreted or misunderstood as reverse hyper threading right?

does anybody know how much work needs to be done to offload fpu code like avx to the gpu cores? any idea?

Quote Originally Posted by terrace215 View Post
Hooray! I must give full credit though-- it was YOUR example, after all, I just had to compare the 256bit hardware implementation, and not let you get away with obscuring the difference through the initial latency
you come off extremely rude... just an fyi...