Great, thanks!
Regarding SMT and single threaded improvements... SMT in itself indeed does nothing to improve single threaded performance. You must look at the whole design process and the decisions made. If you look at the Nehalem case, some strategy planning occurred first: if I recall correctly, the question was whether to do relatively narrow cores, which would be very suitable for server loads, focus all of their attention on making a broad, single thread oriented core or do a broad one that utilises SMT to make the efficiency acceptable for servers (whether or not to use SMT was indeed a question; the chip designer admitted it's quite hard to pull off SMT, so it would require a lot of resources (finances and time)). They chose the latter, so that their architecture would be suitable for both server and client. Servers have high margins - if it weren't for SMT, they couldn't have justified building a very broad core, capable of handling single threaded so well.SMT does zero for single threaded workloads so I don't know what you mean by this. In Bulldozer the shared front end is doing all the work for instruction dispatching and the integer schedulers do the actual work according to the dispatchers orders. FP unit in Bulldozer module is full on SMT approach due to latency tolerant nature of instructions it deals with.L2 is shared dynamically by 2 running threads(so SMT in essence as AMD describes it in one of the slides). Everything else is in the module is of vertical MT organization(switching back and forth between the threads).
So while integer cores do have common L1 instruction cache and front end ,they are fully independent.They do can do opportunistic prefecth in the L2 and the data can be then used by either of the cores. FP unit is dedicated or shared. Dedicated it can assign a whole 256bit FMA to one core,or it is SMT organized and shared by 2 cores as 2 x 128bit FMA(being able to even do 2x ADD or 2x MUL,a feat not possible by any of today's x86 cores).
In the case of BD, AMD could have invested the transistor budget of the second INT cluster in making the design broader, and implementing SMT to make it more efficient in multithreaded environments. You could end up with a core / module of the same size, but better for singlethreaded workloads, though with the downsides you already mentioned.
I'm not sure whether Intel suffers that. The integrated memory controller, HyperTransport-like bus, even the three-layer cache structure all were implemented by AMD first, and Intel later adopted them. One could say the Core 2 architecture was inspired by thinking that lead to Athlon too - a relatively short pipeline and a focus on IPC rather than frequency. In fact, all this lead Tom's Hardware to title their Nehalem review "Architecture by AMD?" ( http://www.tomshardware.co.uk/Intel-...iew-31375.html ).Intel is suffering from "not invented here" syndrome. They don't like using ideas that are developed by their competitors unless they really have to(think AMD64 ISA).They had some of the people behind "CMT" approach working in intel back in the day,ie Andy Glew,but they never really backed that idea up. Since Glew moved to AMD he presented them with the same concept.Coincidentally some of AMD folk were looking into the similar direction and they decided to pursue CMT approach ,roughly in 2005. 1 year before that Glew left AMD .Chuck Moore held a presentation back in 2005 describing CMT as the best choice for next gen. of multithreaded MPUs from perf./watt/mm^2 perspective.Also Fred Webber hinted back in 2005 where AMD was heading with their next gen of multithreaded CPUs.
If there's a reason why Intel chose not to implement CMT, I'd guess it's because Intel never separated INT and FP the way AMD did and still does, so they can't do the FlexFP trick as easily. This would reduce the gains in area efficiency, I suppose.
I do dare to guess that we'll see some form of CMT going on with Haswell...![]()




Reply With Quote

Bookmarks