Intel Core i5 Performance

**Shadowmage** · 12-15-2008, 09:21 AM

Originally Posted by Drwho?

decoders are not highly parrallel if you try to extract early some code fusion, like it is done since Conroe. Phenom I/II is limited by its 3 large decodes, Conroe/Penryn and Nehalem are up to 5 large ... with code fusion. That is a severe difference that they pay.

decoders are not so cold, if highly efficent. The problem is to feed your out of order buffers, early enough to extract parralelism. At this, AMD is really late. They did catch up when they acquired the design of Athlon, but they now need to get into a serious improvement rebuild, and that is not easy, it takes years.

I'm wondering what's the difference between an AMD decode unit and an Intel "simple decoder" unit. It seems from the RWT link from my previous post that the AMD decoder unit is more complex than the Intel counterpart (1-2uops instead of just 1). Also, AMD does have some code fusion, although I don't think it's as heavy as Intel's version.

As for the "serious improvement rebuild", I have on good word that Bulldozer is a complete redesign which should "put AMD back into the lead". Until then, Shanghai and its derivatives are band-aids to stem off the bleeding until it arrives.

Sidenote: the necessity of uop fusion just proves how out-of-date x86 has become... yes I know that x86 is Intel's biggest asset and will never die out...

I keep thinking that with the threading taking off in the software community, Hyperthreading is a must for everbody now, this is why i am convinced they will implement it too.

My personal theory is that they'll double the issue width to 6-way with parallel 3-instruction packets (instead of the current single-issue "packet"). Each packet has a single thread-ID for multithreading. I think that this will put AMD in the lead while keeping it a logical evolution of their back-end.

Doing it the way the Intel guys did it is very complexe, it toke many stepping and try error to figure out from the P4 to Nehalem. I think AMD will try a more brutal approche, and duplicate the decoders, because the lack of time to design it. They should have started at P4 time frame, when it showed some promissing improvement for 5% transistor in the core.

Pardon me for saying so, but AMD's architecture has always been much more aggressive than Intel's, especially after Intel's P4 "mistake". This is because AMD needs to make up for their 20% clock speed deficiency due to manufacturing. IIRC AMD's K8 had a similar FO4 delay to Northwood (about 10-ish), despite its obvious lead in IPC. Currently Intel has the more evolved architecture, so to speak, but that's probably the fault of AMD's execution lately rather than their architects' design aggressiveness. I'm not trying to downplay the awesome work done by Ronak and the rest of the guys in ORCA but as far as their general architecture is concerned, it's pretty conservative especially when compared to academia or even the DEC Alphas from the 1990's: same Tomasulo algorithms, not even a physical register file (although with a new matrix scheduler, very nice)

Thread: Intel Core i5 Performance

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions