MMM
Results 1 to 25 of 56

Thread: Macro/Micro architectural imporvements of K8L(K10) over K8

Threaded View

  1. #11
    Xtreme Member
    Join Date
    Jul 2006
    Posts
    146
    Quote Originally Posted by vitaminc
    He should not need to, as DDB is merely posting second-hand information.

    EDIT, if DDB wrote an article about it without posting slides, then yes, Gojdo should credit him. No one but AMD should receive credit for posting slides online.
    To be clear, Dresdenboy did not post slides. He compiled and posted this list yesterday:

    Quote Originally Posted by DDB_WO
    * Comprehensive Upgrades for SSE
    - Dual 128-bit SSE dataflow
    - Up to 4 dual precision FP OPS/cycle
    - Dual 128-bit loads per cycle
    - Can perform SSE MOVs in the FP “store” pipe
    - Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
    - FP Scheduler can hold 36 Dedicated x 128-bit ops
    - SSE Unaligned Load-Execute mode
    Remove alignment requirements for SSE ld-op instructions
    Eliminate awkward pairs of separate load and compute instructions
    To improve instruction packing and decoding efficiency
    * Advanced branch prediction
    - Dedicated 512-entry Indirect Predictor
    - Double return stacksize
    - More branch history bits and improved branch hashing
    * 32B instruction fetch
    - Benefits integer code too
    - Reduced split-fetch instruction cases
    * Sideband Stack Optimizer
    - Perform stack adjustments for PUSH/POP operations “on the side”
    - Stack adjustments don’t occupy functional unit bandwidth
    - Breaks serial dependence chains for consecutive PUSH/POPs
    * Out-of-order load execution
    - New technology allows load instructions to bypass:
    Other loads
    Other stores which are known not to alias with the load
    - Significantly mitigates L2 cache latency
    * TLB Optimisations
    - Support for 1G pages
    - 48bit physical address
    - Larger TLBs key for:
    Virtualized workloads
    Large-footprint databases and
    transaction processing
    - DTLB:
    Fully-associative 48-way TLB (4K, 2M, 1G)
    Backed by L2 TLBs: 512 x 4K, 128 x 2M
    - ITLB:
    16 x 2M entries
    * Data-dependent divide latency
    * More Fastpath instructions
    – CALL and RET-Imm instructions
    – Data movement between FP & INT
    * Bit Manipulation extensions
    - LZCNT/POPCNT
    * SSE extensions
    - EXTRQ/INSERTQ,
    - MOVNTSD/MOVNTSS
    * Independent DRAM controllers
    - Concurrency
    - More DRAM banks reduces page conflicts
    - Longer burst length improves command efficiency
    * Optimized DRAM paging
    - Increase page hits
    - Decrease page conflicts
    * History-based pattern predictor
    * Re-architect NB for higher BW
    - Increase buffer sizes
    - Optimize schedulers
    - Ready to support future DRAM technologies
    * Write bursting
    - Minimize Rd/Wr Turnaround
    * DRAM prefetcher
    - Track positive and negative, unit and non-unit strides
    - Dedicated buffer for prefetched data
    - Aggressively fill idle DRAM cycles
    * Core prefetchers
    - DC Prefetcher fills directly to L1 Cache
    - IC Prefetcher more flexible
    2 outstanding requests to any address
    * Shared L3
    - Victim-cache architecture maximizes efficiency of cache hierarchy
    - Fills from L3 leave likely shared lines in the L3
    - Sharing-aware replacement policy
    Last edited by oldblue; 02-09-2007 at 11:21 AM.

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •