Decoders are less important than you think
Quote:
Originally Posted by ethernal
Actually, it seems that the Core architecture would be a superior system to deploy HyperThreading on compared to the P4.
It seems to be a common misconception that the purpose of HyperThreading is to deal with pipeline stalls. This is not entirely accurate (though it helps, I guess) The real purpose of hyperthreading is to maximize the use of the processor's parallel instruction execution. Afterall, in the P4, it's a 3 issue core that can theoretically do 3 instructions in parallel in one cycle. However, it is very very rare that a single thread is capable of using all 3 execution units at once.
Hyperthreading allows multiple threads to mix together in a single clock cycle to try to maximize the use of the execution units in parallel. This is a rather poor explanation (and wrong on so many levels) but I think it makes the most sense. In a traditional CPU, let's say the processor manages to use OOE to run two operations at once. Let's say it uses the integer execution unit and FPU execution unit (a gross oversimplification, but go with it.) However, let's say the processor has another integer execution unit. It has to go idle, because the processor couldn't find anything to fill it with.
With hyperthreading, however, it is possible for the CPU to take another thread and say "Well, hrm, this has an integer operation I could sneak in with.. let's run this in parallel with the other thread to fill all three of my execution units! Sweet!" This actually increases the efficiency of the processor, because it's doing more work in a single cycle. This is how a processor can magically do a bit more work with HT enabled in many cases. In regular processing, CPU's are extremely wasteful. Even with the most advanced OOE algorithms, most of the time much of the CPU's execution units go unused.
The Core architecture would be able to take advantage of the 4-issue core (technically 5 issue if you include micro/macro-ops fusion) with HT much moreso than the Pentium 4's 3-issue core. Once again, this is a gross oversimplification, but you get the general idea.
I would assume there are numerous reasons why Intel did not include HT in the new architecture. First and foremost, I think that they figured that for everything besides servers, dual cores are enough to deal with all of the threads that an average CPU would run. Afterall, how often do you max out both cores? The only thing I can think of is things like rendering and encoding, which is something that the average user simply doesn't do. Even in multi-threaded games, there is usually a heavy bias on one core or the other and there is still plenty of idle time on the extra core. In short, it was better to use the transistor space for other things to increase single-threaded performance. The second reason has already been mentioned. Perhaps the advanced micro/macro-ops fusion wrecked havok on trying to use HT for whatever reason.
I wouldn't really expect HT to show up on the XE CPU's. I would expect them to show up on Woodcrest CPU's, because that is where you are most likely to gain performance from HT - heavily threaded multi-user environments.
Then again, who knows what Intel is thinking. I'm sure there was a good reason as to why not to include HT. Afterall, HT has been known to slow some things down. Maybe adding HT cluttered up the processor too much - a lot of overhead in different things that added unnecessary complexity. Who knows.
I was under the impression that in a processor like the p4 or athlon the decoders decode a maximum of 3 CISC aka x86 instructions. All modern CPUs are RISC, whose instructions contain only a single operation, while CISC which all code is written in contains 2-4 like load, add, and store in a single instruction. Assuming 3 x86 instructions are decoded to 6-10 RISC micro-ops, you could take up all execution units on the Athlon, P4, or pentium m. This would also include the load & store.
Read this by Johan De Gelas from Anandtech...
"A first for the x86 world, the Core architecture is equipped with four x86 decoders, 3 simple decoders and 1 complex decoder. The task of the decoders - for all current x86 CPUs - is not only to decipher the incoming instruction (opcode, addresses), but also to translate the 1 to 15 byte variable length x86 instructions into - easier to schedule and execute - fixed length RISC-like instructions (called micro-ops).
The most common x86 instructions are translated into a single micro-op by the 3 simple decoders. The complex decoder is responsible for the instructions that produce up to 4 micro-ops "(just one decoder, and athlons have 3 complex decoders)". The really long and complex x86 instructions are handled by a microcode sequencer. This way of handling the complex most CISC-y instructions has been adopted by all modern x86 CPU designs, including the P6, Athlon (XP and 64), and Pentium 4."
You can get even more info on Wikipedia by doing RISC and CISC searches.
This would explain why Intel put 2 simple integer units running at 2x the cpu clock on the p4, having a maximum of 4 intructions per clock, above 3. It would make sence for them to only run it at the normal clock rate if 3 intruction were the max, since if would be so unlikely for every instruction to be a simple integer calculation and it would probably increase yields. K8L is going to have twice as many FPU/SSE units and they will be 128bit, quadrupling how many instructions it can do, giving a total of 12 maximum 64bit instructions, and thats just floating point and SIMD, integer units are another 3 and I don't know how many the L&S can do. Isn't that extremely inneficient if there are only 3 operations at the most, why didn't they add any more decoders? Many critics have been skeptical that Conroe's extra decoder will increase performance, and I think this explains why.
Sorry this was a little off topic, but I just wanted to say that.
On topic, there's been hints from Anandtech and the INQ that AMD has something special in store to cempete on the high end, most likely a very high clocked FX, possibly 65nm. AMDs roadmaps have always been terrible so I wouldn't be surprised at all. Maybe this will spur Intel to release a faster CPU. The thing is higher official clocks won't really change what the chip is capable of, so it really will just mean a decrease in price.