Hi cmaier,
as you seem to have a lot of insight into the development of Bulldozer, you might find interesting the following quote with many bits of information, which I received from someone (who’s not a native English speaker as it seems), which I don’t want to disclose. Maybe you can confirm it. He also sent me a slide, which has to be kept secret, so I can’t post it here. I also heard of an unchanged instruction cache, 16k 4-way L1 data caches per core and up to 2M L2 per module.
“So as you can see, there are 4 full integer pipelines per core, capable of doing up to 4 instructions per cycle or in
the case of less utilization can run two branches eliminating branch misprediction.
It can fetch from several threads (program pointers) alternatingly including possible branch targets. For that to
work the branch prediction unit (BPU) tries to identify branches and their targets and controls the working of the
IFU. If the instruction queues of the units to be fed are already filled at high levels, the IFU/BPU pair tries to
prefetch code to avoid idle cycles. Having prefetched the right code bytes in 50% of all fetches is still better than
having no code ready at all. In reality this number is even better.
After a block of 32 code bytes is fetched and queued in an instruction fetch queue, the decode unit receives such
a packet each cycle for decoding. To decode it quickly, it has four dedicated decode subunits, where each of
them can decode most x86 and SIMD instructions on it’s own and quickly (1 per cycle and subunit). More
seldomly used or complex instructions are decoded using microcode storages (ROM and SRAM). This can
happen in parallel with the decoding of the „simple“ instructions. There are no inefficiencies like in K10. XOP and
AVX instructions are either decoded one per subunit (if operand width is <= 128 bit) or one per two subunits (256
bit, similar to the double decode of SSE and SSE2 instructions in K8). The result are „double mops“ (pairs of
micro ops, similar to the former MacroOps). After finishing decoding, the double mops (which can have one
unused slot) are sent to the dispatch unit, which prepares packets of up to four double mops (dispatch packet)
and dispatches them to the cores or the FPU depending on their scheduler fill status. Already decoded mops are
also written to the corresponding trace cache to be used later if the code has to be executed again (e.g. in loops).
Thanks to these caches, the actual decoding units are free and can be used to decode code bytes further down
the program path. If a needed dispatch packet is already in the cache, the dispatcher can dispatch that packet to
the core needing it and in parallel dispatch another packet (from the decoders or the other trace cache) to the
second core. So there won't be any bottleneck here.
The schedulers in the cores or FPU select the mops ready for execution by the four pairs of ALUs and AGUs per
core, depending on available execution resources and operand dependencies. While doing that, there is more
flexibility than was in the K10 with it’s separate lanes and the inability of OOps to switch lanes to find a free
execution ressource. To save power, the execution units are only activated, if mops needing them become ready
for execution. This is called wakeup.
The integer execution units - arithmetic logic units (ALUs) and address generation units (AGUs) - are organized in
four pairs - one per instruction pipeline. They can execute both x86 integer code, memory ops (also for FP/SIMD
ops) and, which is the biggest change, can be combined to execute SSE or AVX integer code. This increases
throughput significantly and frees the FP units somewhat. The general purpose register file (GPRF) has been
widened to 128 bit to allow for such a feature. The registers will be copied between GPRF and the floating point
register file (FPRF) if an architectural SIMD register (the registers specified by the ISA) is used for integer first
and floating point later or vice versa. Since this doesn't happen often, it has practically no impact on performance.
Instead the option to use the integer units for integer SIMD code (SSE, XOP and AVX) the overall throughput of
SIMD code increases dramatically.
The FPU contains the already known two 128 bit wide FMAC units. These are able to execute either one of the
new fused multiply add (FMA) instructions or alternatively an floating point add and mul operation (or other types
of operations covered by the internal fpadd and fpmul units). This ability provides both a lower energy
consumption and higher throughput for the simpler operations. As AMD already stated, the two 128 bit units will
be either used in parallel by the two threads running on the integer cores but could in cycles, where one core
doesn't need the FPU, both be used by only one thread, increasing it's FP throughput. This happens on a per
cycle basis and resembles some form of SMT. The FPU scheduler communicates with the cores, so that they can
track the state of each instruction belonging to the threads running on them.
Both the integer and the floating point units need data to work with. This is provided by the two 16k L1 data
caches. Each core has it's own data cache and load store unit (LSU). The load store unit handles all memory
requests (loads and stores) of the thread running on the same core and the shared FPU. It is able to serve two
loads and one store per cycle, each of them up to 128 bit wide. This results in a load bandwidth of 32B/cycle and
a store bandwidth of 16B/cycle - per core. A big change compared to the LSU of the K10 is the ability to do data
and address speculation. So even without knowing the exact address of a memory operation (which isn't known
earlier than after executing the mop in an AGU), the unit uses access patterns and other hints to speculate, if
some data is the same as some other data, where the address is already known. And finally the LSU is also able
to do execute all memory operations out of order, not only loads. To make all this possible with not too big effort
the engineers at AMD added the ability to create checkpoints at any point in time and go back to this point and
replay the instruction stream in case of a misspeculation.
To reduce the number of mispredicted branches and the latency of the resulting fetch operations, the branch
predictors have been improved. They are able to predict multiple branches per cycle and can issue prefetches of
code bytes, which might be needed soon. Together with the trace caches, it is often possible, that even after a
branch misprediction (which is only known after executing the branch instruction), the correct dispatch packets
are already in the trace cache and can be dispatched from there with low latency.
One big feature, which improves performance a lot, is the ability to clock units at different clock frequencies
(provided by flexible and efficient clock generators), to power off any idle subunit and to adapt sizes of caches,
TLBs and some buffers and queues according to the needs of the executed code. A power controller keeps track
of load and power consumption of each of the many subunits and adapts clocks and units as needed. Further it
increases throughput and power consumption of heavily loaded units as long as the processor doesn't exceed it's
power consumption and temperature limits. For example if the queues and buffers of core 0 are filled and the
FPU is idle, then the power controller will switch off the FPU (until it will be waked up for executing FP code) and
increase the clock frequency of core 0. If core 0 has not that many memory operations (less pressure on cache),
the cache might be downsized to 8kB, 2-way by switching off 2 of the 4 ways it has. This way the power, the
processor is allowed to use, will be directed to where it is needed and not to drive idle units. This is called
Application Power Management as you might heard in some rumors on the net.“
At least the details don’t sound like the architecture would be a miss. The guy also told me, that first samples (don’t know, if already 32nm) run very well with really good performance and power characteristics, outperforming their fastest desktop chips already.
Bookmarks