* Comprehensive Upgrades for SSE
- Dual 128-bit SSE dataflow
- Up to 4 dual precision FP OPS/cycle
- Dual 128-bit loads per cycle
- Can perform SSE MOVs in the FP “store” pipe
- Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
- FP Scheduler can hold 36 Dedicated x 128-bit ops
- SSE Unaligned Load-Execute mode
Remove alignment requirements for SSE ld-op instructions
Eliminate awkward pairs of separate load and compute instructions
To improve instruction packing and decoding efficiency
* Advanced branch prediction
- Dedicated 512-entry Indirect Predictor
- Double return stacksize
- More branch history bits and improved branch hashing
* 32B instruction fetch
- Benefits integer code too
- Reduced split-fetch instruction cases
* Sideband Stack Optimizer
- Perform stack adjustments for PUSH/POP operations “on the side”
- Stack adjustments don’t occupy functional unit bandwidth
- Breaks serial dependence chains for consecutive PUSH/POPs
* Out-of-order load execution
- New technology allows load instructions to bypass:
Other loads
Other stores which are known not to alias with the load
- Significantly mitigates L2 cache latency
* TLB Optimisations
- Support for 1G pages
- 48bit physical address
- Larger TLBs key for:
Virtualized workloads
Large-footprint databases and
transaction processing
- DTLB:
Fully-associative 48-way TLB (4K, 2M, 1G)
Backed by L2 TLBs: 512 x 4K, 128 x 2M
- ITLB:
16 x 2M entries
* Data-dependent divide latency
* More Fastpath instructions
– CALL and RET-Imm instructions
– Data movement between FP & INT
* Bit Manipulation extensions
- LZCNT/POPCNT
* SSE extensions
- EXTRQ/INSERTQ,
- MOVNTSD/MOVNTSS
* Independent DRAM controllers
- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency
* Optimized DRAM paging
- Increase page hits
- Decrease page conflicts
* History-based pattern predictor
* Re-architect NB for higher BW
- Increase buffer sizes
- Optimize schedulers
- Ready to support future DRAM technologies
* Write bursting
- Minimize Rd/Wr Turnaround
* DRAM prefetcher
- Track positive and negative, unit and non-unit strides
- Dedicated buffer for prefetched data
- Aggressively fill idle DRAM cycles
* Core prefetchers
- DC Prefetcher fills directly to L1 Cache
- IC Prefetcher more flexible
2 outstanding requests to any address
* Shared L3
- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy
Bookmarks