I'm wondering what's the difference between an AMD decode unit and an Intel "simple decoder" unit. It seems from the RWT link from my previous post that the AMD decoder unit is more complex than the Intel counterpart (1-2uops instead of just 1). Also, AMD does have some code fusion, although I don't think it's as heavy as Intel's version.
As for the "serious improvement rebuild", I have on good word that Bulldozer is a complete redesign which should "put AMD back into the lead". Until then, Shanghai and its derivatives are band-aids to stem off the bleeding until it arrives.
Sidenote: the necessity of uop fusion just proves how out-of-date x86 has become... yes I know that x86 is Intel's biggest asset and will never die out...
My personal theory is that they'll double the issue width to 6-way with parallel 3-instruction packets (instead of the current single-issue "packet"). Each packet has a single thread-ID for multithreading. I think that this will put AMD in the lead while keeping it a logical evolution of their back-end.I keep thinking that with the threading taking off in the software community, Hyperthreading is a must for everbody now, this is why i am convinced they will implement it too.
Pardon me for saying so, but AMD's architecture has always been much more aggressive than Intel's, especially after Intel's P4 "mistake". This is because AMD needs to make up for their 20% clock speed deficiency due to manufacturing. IIRC AMD's K8 had a similar FO4 delay to Northwood (about 10-ish), despite its obvious lead in IPC. Currently Intel has the more evolved architecture, so to speak, but that's probably the fault of AMD's execution lately rather than their architects' design aggressiveness. I'm not trying to downplay the awesome work done by Ronak and the rest of the guys in ORCA but as far as their general architecture is concerned, it's pretty conservative especially when compared to academia or even the DEC Alphas from the 1990's: same Tomasulo algorithms, not even a physical register file (although with a new matrix scheduler, very nice)Doing it the way the Intel guys did it is very complexe, it toke many stepping and try error to figure out from the P4 to Nehalem. I think AMD will try a more brutal approche, and duplicate the decoders, because the lack of time to design it. They should have started at P4 time frame, when it showed some promissing improvement for 5% transistor in the core.![]()
Bookmarks