Results 1 to 25 of 87

Thread: Ex AMD designer: Bulldozer to disappoint

Threaded View

  1. #11
    Xtreme Member
    Join Date
    Apr 2010
    Posts
    145
    Quote Originally Posted by Dresdenboy View Post
    Even the 180 nm single core Opteron dies (no, they weren't sold) were large.
    Are there any die sizes or die pictures of these?

    Another post from a source of unknown veracity on the same thread:

    Hi cmaier,

    as you seem to have a lot of insight into the development of Bulldozer, you might find interesting the following quote with many bits of information, which I received from someone (who’s not a native English speaker as it seems), which I don’t want to disclose. Maybe you can confirm it. He also sent me a slide, which has to be kept secret, so I can’t post it here. I also heard of an unchanged instruction cache, 16k 4-way L1 data caches per core and up to 2M L2 per module.

    “So as you can see, there are 4 full integer pipelines per core, capable of doing up to 4 instructions per cycle or in
    the case of less utilization can run two branches eliminating branch misprediction.
    It can fetch from several threads (program pointers) alternatingly including possible branch targets. For that to
    work the branch prediction unit (BPU) tries to identify branches and their targets and controls the working of the
    IFU. If the instruction queues of the units to be fed are already filled at high levels, the IFU/BPU pair tries to
    prefetch code to avoid idle cycles. Having prefetched the right code bytes in 50% of all fetches is still better than
    having no code ready at all. In reality this number is even better.
    After a block of 32 code bytes is fetched and queued in an instruction fetch queue, the decode unit receives such
    a packet each cycle for decoding. To decode it quickly, it has four dedicated decode subunits, where each of
    them can decode most x86 and SIMD instructions on it’s own and quickly (1 per cycle and subunit). More
    seldomly used or complex instructions are decoded using microcode storages (ROM and SRAM). This can
    happen in parallel with the decoding of the „simple“ instructions. There are no inefficiencies like in K10. XOP and
    AVX instructions are either decoded one per subunit (if operand width is <= 128 bit) or one per two subunits (256
    bit, similar to the double decode of SSE and SSE2 instructions in K8). The result are „double mops“ (pairs of
    micro ops, similar to the former MacroOps). After finishing decoding, the double mops (which can have one
    unused slot) are sent to the dispatch unit, which prepares packets of up to four double mops (dispatch packet)
    and dispatches them to the cores or the FPU depending on their scheduler fill status. Already decoded mops are
    also written to the corresponding trace cache to be used later if the code has to be executed again (e.g. in loops).
    Thanks to these caches, the actual decoding units are free and can be used to decode code bytes further down
    the program path. If a needed dispatch packet is already in the cache, the dispatcher can dispatch that packet to
    the core needing it and in parallel dispatch another packet (from the decoders or the other trace cache) to the
    second core. So there won't be any bottleneck here.
    The schedulers in the cores or FPU select the mops ready for execution by the four pairs of ALUs and AGUs per
    core, depending on available execution resources and operand dependencies. While doing that, there is more
    flexibility than was in the K10 with it’s separate lanes and the inability of OOps to switch lanes to find a free
    execution ressource. To save power, the execution units are only activated, if mops needing them become ready
    for execution. This is called wakeup.
    The integer execution units - arithmetic logic units (ALUs) and address generation units (AGUs) - are organized in
    four pairs - one per instruction pipeline. They can execute both x86 integer code, memory ops (also for FP/SIMD
    ops) and, which is the biggest change, can be combined to execute SSE or AVX integer code. This increases
    throughput significantly and frees the FP units somewhat. The general purpose register file (GPRF) has been
    widened to 128 bit to allow for such a feature. The registers will be copied between GPRF and the floating point
    register file (FPRF) if an architectural SIMD register (the registers specified by the ISA) is used for integer first
    and floating point later or vice versa. Since this doesn't happen often, it has practically no impact on performance.
    Instead the option to use the integer units for integer SIMD code (SSE, XOP and AVX) the overall throughput of
    SIMD code increases dramatically.
    The FPU contains the already known two 128 bit wide FMAC units. These are able to execute either one of the
    new fused multiply add (FMA) instructions or alternatively an floating point add and mul operation (or other types
    of operations covered by the internal fpadd and fpmul units). This ability provides both a lower energy
    consumption and higher throughput for the simpler operations. As AMD already stated, the two 128 bit units will
    be either used in parallel by the two threads running on the integer cores but could in cycles, where one core
    doesn't need the FPU, both be used by only one thread, increasing it's FP throughput. This happens on a per
    cycle basis and resembles some form of SMT. The FPU scheduler communicates with the cores, so that they can
    track the state of each instruction belonging to the threads running on them.
    Both the integer and the floating point units need data to work with. This is provided by the two 16k L1 data
    caches. Each core has it's own data cache and load store unit (LSU). The load store unit handles all memory
    requests (loads and stores) of the thread running on the same core and the shared FPU. It is able to serve two
    loads and one store per cycle, each of them up to 128 bit wide. This results in a load bandwidth of 32B/cycle and
    a store bandwidth of 16B/cycle - per core. A big change compared to the LSU of the K10 is the ability to do data
    and address speculation. So even without knowing the exact address of a memory operation (which isn't known
    earlier than after executing the mop in an AGU), the unit uses access patterns and other hints to speculate, if
    some data is the same as some other data, where the address is already known. And finally the LSU is also able
    to do execute all memory operations out of order, not only loads. To make all this possible with not too big effort
    the engineers at AMD added the ability to create checkpoints at any point in time and go back to this point and
    replay the instruction stream in case of a misspeculation.
    To reduce the number of mispredicted branches and the latency of the resulting fetch operations, the branch
    predictors have been improved. They are able to predict multiple branches per cycle and can issue prefetches of
    code bytes, which might be needed soon. Together with the trace caches, it is often possible, that even after a
    branch misprediction (which is only known after executing the branch instruction), the correct dispatch packets
    are already in the trace cache and can be dispatched from there with low latency.
    One big feature, which improves performance a lot, is the ability to clock units at different clock frequencies
    (provided by flexible and efficient clock generators), to power off any idle subunit and to adapt sizes of caches,
    TLBs and some buffers and queues according to the needs of the executed code. A power controller keeps track
    of load and power consumption of each of the many subunits and adapts clocks and units as needed. Further it
    increases throughput and power consumption of heavily loaded units as long as the processor doesn't exceed it's
    power consumption and temperature limits. For example if the queues and buffers of core 0 are filled and the
    FPU is idle, then the power controller will switch off the FPU (until it will be waked up for executing FP code) and
    increase the clock frequency of core 0. If core 0 has not that many memory operations (less pressure on cache),
    the cache might be downsized to 8kB, 2-way by switching off 2 of the 4 ways it has. This way the power, the
    processor is allowed to use, will be directed to where it is needed and not to drive idle units. This is called
    Application Power Management as you might heard in some rumors on the net.“

    At least the details don’t sound like the architecture would be a miss. The guy also told me, that first samples (don’t know, if already 32nm) run very well with really good performance and power characteristics, outperforming their fastest desktop chips already.
    Last edited by iMacmatician; 04-27-2010 at 04:23 AM.

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •