Quote Originally Posted by madcho View Post
i seriously hope that amd will deliver same amount of BD modules than intel cores. But i'm think this is not possible.

Why ? Because BD cores a lot bigger than older cores, even 32nm will not be enough for that, but BD modules are better than 1 intel cores with HyperThreading, that's easy too understand without any doubt.
On K10.5 6MB L3 cache consumes almost same space as four cores with their proprietary 512kB L2 cache

Bulldozer cores will be lighter (maybe smaller L1 cache per "core" that will total up to same size of 64+64kB inside dual core module as in K7-K10.5 releases And L2 will be same or 1024kB per module so they could easily squeeze up to 6 that kind of modules and again double L3 cache to 12MB (originally they mentioned that 16MB L3 is projected for Bulldozer CPU afair) and still be inside same dimensions as previous CPU generations, K10@65nm/K10.5@45nm ~250mm2. And above all that to mention HKMG which supposedly should serve as huge MHz jump and they even manage to squeeze 4x3.4MHz inside 90W TDP on active 45nm process

Yep two separate cores are always better than two threads inside one core considering power/performance ratio and better utilization and easier optimization for simpler core than to proprietary derived HyperThreading which evolved from HTT(1) inside P4-HT to HTT(2) inside Nehalem, and probably to some variation of HTT(3) in Sandy Bridge. So previous optimizations usually doesnt work and you need to recompile your work yet again and optimize for HTT beside SSE/AVX native code optimizations. But in the end SMT should serve intel as much as CMT to AMD. just CMT has brighter future regarding power wise orientation (according to AMDs bragging)

Quote Originally Posted by madcho
AMD's way is beautyfull, a real bruteforce in integer; the big lack in K8, and the big problem of the K10 as my mind. And my best hope, that is AMD will enable with SSE5 to use FPUs of the GPU for CPU calcs. That's maybe why the FPU on BD is lower than on K10 ( lower with the same amounth of threads ) same if you concider that a BD is a new core.
SSE5 is part of CPU "module" and until GPU part of CPU doesnt get inside "module" it wouldn't serve as GPU optimization and that will probably never happen. Integrated GPU (which is not part of Bulldozer btw) will communicate over PCIe (and i hoped for HT/HTX bus) and that way will only serve for better integration and better HTPC (low end server design?). Only performance boost that "SSE5" could done would be some packing that shrink bandwidth needed for PCIe communication or something but that would benefit to any device connected to that PCIe(3.0?) bus (ex. discrete GPU card) and dont think they even think about that kind of tweaks when they designed SSE5.


Quote Originally Posted by Helmore View Post
One limiting factor in core scaling though, is that AMD has apparently designed its cache structure in such a manner to only allow 4 modules to share their L3 cache. For AMD this means they will have 2 separate L3 cache pools when they put 8 modules on a single die and I don't think we will be seeing 8 BD modules single die CPUs for their first generation BD chips. I could be wrong though on this one, have to read up on it again.
And what about 6-core revisions of K10(.5) CPUs, couldn't every core use ondie L3 cache and it's still 48-bit wide as in quad-core (Deneb, PII X4)?

I think L3 sharing is pretty easy to upgrade to more than 4-cores when TLB works properly in the first place (famous pre-B3 K10 revisions).


Quote Originally Posted by JF-AMD View Post
HT originally came about in P4 because they had a very long pipeline and one cache miss had lots of penalty associated. But as they shortened the pipeline (i.e. Core2) they tossed out HT because they no longer needed that band-aid.
Excellent two hits w/o miss hope more of it will come, it's refreshing to see someone on forums that knows the real matter behind all HT mess mixups