Macro/Micro architectural imporvements of K8L(K10) over K8

**oldblue** · 02-09-2007, 11:12 AM

Originally Posted by vitaminc

He should not need to, as DDB is merely posting second-hand information.

EDIT, if DDB wrote an article about it without posting slides, then yes, Gojdo should credit him. No one but AMD should receive credit for posting slides online.

To be clear, Dresdenboy did not post slides. He compiled and posted this list yesterday:

Originally Posted by DDB_WO

* Comprehensive Upgrades for SSE

- Dual 128-bit SSE dataflow
- Up to 4 dual precision FP OPS/cycle
- Dual 128-bit loads per cycle
- Can perform SSE MOVs in the FP “store” pipe
- Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
- FP Scheduler can hold 36 Dedicated x 128-bit ops
- SSE Unaligned Load-Execute mode

Remove alignment requirements for SSE ld-op instructions
Eliminate awkward pairs of separate load and compute instructions
To improve instruction packing and decoding efficiency

* Advanced branch prediction

- Dedicated 512-entry Indirect Predictor
- Double return stacksize
- More branch history bits and improved branch hashing

* 32B instruction fetch

- Benefits integer code too
- Reduced split-fetch instruction cases

* Sideband Stack Optimizer

- Perform stack adjustments for PUSH/POP operations “on the side”
- Stack adjustments don’t occupy functional unit bandwidth
- Breaks serial dependence chains for consecutive PUSH/POPs

* Out-of-order load execution

- New technology allows load instructions to bypass:

Other loads
Other stores which are known not to alias with the load

- Significantly mitigates L2 cache latency

* TLB Optimisations

- Support for 1G pages
- 48bit physical address
- Larger TLBs key for:

Virtualized workloads
Large-footprint databases and
transaction processing

- DTLB:

Fully-associative 48-way TLB (4K, 2M, 1G)
Backed by L2 TLBs: 512 x 4K, 128 x 2M

- ITLB:

16 x 2M entries

* Data-dependent divide latency
* More Fastpath instructions

– CALL and RET-Imm instructions
– Data movement between FP & INT

* Bit Manipulation extensions

- LZCNT/POPCNT

* SSE extensions

- EXTRQ/INSERTQ,
- MOVNTSD/MOVNTSS

* Independent DRAM controllers

- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency

* Optimized DRAM paging

- Increase page hits
- Decrease page conflicts

* History-based pattern predictor
* Re-architect NB for higher BW

- Increase buffer sizes
- Optimize schedulers
- Ready to support future DRAM technologies

* Write bursting

- Minimize Rd/Wr Turnaround

* DRAM prefetcher

- Track positive and negative, unit and non-unit strides
- Dedicated buffer for prefetched data
- Aggressively fill idle DRAM cycles

* Core prefetchers

- DC Prefetcher fills directly to L1 Cache
- IC Prefetcher more flexible

2 outstanding requests to any address

* Shared L3

- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy

Thread: Macro/Micro architectural imporvements of K8L(K10) over K8

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions