That's an interesting point about the decode bandwidth, especially since AMD increased the I-cache bandwidth to 256bits. Why doesn't Intel have a similar problem? You seem to be implying that AMD is bottlenecked by the front-end. That seems like some low-hanging fruit though: increasing the number of decoders is simple. They don't need to double the number of decoders: why not just add one more? Both AMD and Intel chips are heavily optimized, so I doubt that the bottleneck is huge.
Also, although I doubt that they will need to double the number of decoders, let's assume that this is the best method for performance and area for now. Why would this be a "power catastrophy"? First of all, let me acknowledge t hat decoders use up tons of power in the CPU (~20%, last time I checked). However, decoders are highly parallel, unlike the back-end of the CPU. They can also easily be gated when not in use. In addition, designers can optimize them for low power by removing dynamic logic and high Vt transistors and keep high clock speeds by adding another pipeline stage (since each macro-op is independent of each other, so no slowdown other than branch mispredictions due to a longer pipeline).
In summary, two main points:
1. Decoders will be gated when not in use.
2. Decoders can be made to be power efficient.
I agree that AMD needs to add SMT though, or use some sort of clustering or shared resource technique.
http://www.realworldtech.com/page.cf...1607033728&p=3