Macro/Micro architectural imporvements of K8L(K10) over K8

Printable View

Show 100 post(s) from this thread on one page

02-09-2007, 02:37 AM
gOJDO

Macro/Micro architectural imporvements of K8L(K10) over K8

Quad-core
- Native quad-core design
- Redesigned and improved crossbar(northbridge)
- Improved power management
- New level of cache added, L3 VICTIM
Power management - DICE(Dynamic Independent Core Engagement)
- Supports separate CPU core and memory controller power planes to allow CPU to lower its power state while the memory controller is running full bore
- Enhanced AMD's PowerNow allows individual core frequencies to lower while other cores may be running full bore
Virtualization improvements
- Nested Paging(NP):
* Guest and Host page tables both exist in memory.(The CPU walks both page tables)
* Nested walk can have up to 24 memory acesses! (Hardware caching accelerates the walk)
* "Wire-to-wire" translations are cached in TLBs
* NP eliminates Hypervisor cycles spent managing shadow pages(As much as 75% Hypervisor time)
- Reduced world-switch time by 25%:
* World-switch time: round-trup to Hypervisor and back
Dedicated L1 cache
- 256bit 64kB (32kB instruction/32kB data)
- 2 x 128bit loads/cycle
- lowest latency
Dedicated L2 cache
- 128bit 512kB
- 128bit bus to northbridge
- reduced latency
- eliminates conflicts common in shared caches - better for virtualization
Shared L3 cache
- 128bit 2MB
- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy
- Expandable
Independent DRAM controllers
- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency
- Dual channel unbuffered 1066 support(applies to socket AM2+ and s1207+ QFX only)
- Channel Interleaving
Optimized DRAM paging
- Increase page hits
- Decrease page conflicts
Re-architect northbridge for higher bandwidth
- Increase buffer sizes
- Optimize schedulers
- Ready to support future DRAM technologies
Write bursting
- Minimize Rd/Wr Turnaround
DRAM prefetcher
- Track positive and negative, unit and non-unit strides
- Dedicated buffer for prefetched data
- Aggressively fill idle DRAM cycles
Core prefetchers
- DC Prefetcher fills directly to L1 Cache
- IC Prefetcher more flexible
* 2 outstanding requests to any address
HyperTransport 3
- up to four 16bit cHT links
- up to 5200MT/s per link
- un-ganging mode: each 16bit HT link can be divided in two 8bit virutal links

CPU Core IPC Enhancements:
Advanced branch prediction
- Dedicated 512-entry Indirect Predictor
- Double return stacksize
- More branch history bits and improved branch hashing
History-based pattern predictor
32B instruction fetch
- Benefits integer code too
- Reduced split-fetch instruction cases
Sideband Stack Optimizer
- Perform stack adjustments for PUSH/POP operations “on the side”
- Stack adjustments don’t occupy functional unit bandwidth
- Breaks serial dependence chains for consecutive PUSH/POPs
Out-of-order load execution
- New technology allows load instructions to bypass:
* Other loads
* Other stores which are known not to alias with the load
- Significantly mitigates L2 cache latency
TLB Optimisations
- Support for 1G pages
- 48bit physical address (256TB)
- Larger TLBs key for:
* Virtualized workloads
* Large-footprint databases and
* transaction processing
- DTLB:
* Fully-associative 48-way TLB (4K, 2M, 1G)
* Backed by L2 TLBs: 512 x 4K, 128 x 2M
- ITLB:
* 16 x 2M entries
Data-dependent divide latency
Additional fastpath instructions
– CALL and RET-Imm instructions
– Data movement between FP & INT
Bit Manipulation extensions
- LZCNT/POPCNT
SSE extensions
- EXTRQ/INSERTQ (SSE4A)
- MOVNTSD/MOVNTSS (SSE4A)
- MWAIT/MONITOR (SSE3)
Comprehensive Upgrades for SSE
- Dual 128-bit SSE dataflow
- Up to 4 dual precision FP OPS/cycle
- Dual 128-bit loads per cycle
- New vector code, SSE128
- Can perform SSE MOVs in the FP “store” pipe
- Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
- FP Scheduler can hold 36 Dedicated x 128-bit ops
- SSE Unaligned Load-Execute mode:
* Remove alignment requirements for SSE ld-op instructions
* Eliminate awkward pairs of separate load and compute instructions
* To improve instruction packing and decoding efficiency

K8L(K10) large die Shot

Source: AMD slides, and other sources from Internet

P.S. Any additional data or informations will be highly appreciated
02-09-2007, 02:50 AM
NinjaWreck

A source would be nice.
02-09-2007, 02:55 AM
Ubermann

A link would be nice
02-09-2007, 03:04 AM
krille

I thought we had just come to realize that K8L doesn't exist, or rather has been with us for quite some time (K8 Low-Power, that is Turion 64). Instead the AMD dual-cores (X2) were K9 and what we used to refer to as K8L is in reality the K10. (According to Inq it was their own Charlie who made up the K8L name.)

(Thread here.)

Yes, and source, please?

~ Kris
02-09-2007, 03:53 AM
Carfax

Quote:

Originally Posted by LOE

gOJDO - you got no info on L1 and L2 cache interface - I heard it is going to be twice as wide as K8

As far as I know, the K10's L1 and L2 cache interface will be 256bits, but dual split like the K8.

In other words, 2x128. Right now, the K8 has a 128 bit cache bus, which is dual split (2x64).

The C2D just has one big 256bit wide interface.

As for which implementation is better, I don't know.

*Edit* Gojdo just added some more stuff. According to the updated info, the L2 cache is only 128-bits wide! But the L1 cache is 256-bits? WTH?
02-09-2007, 04:02 AM
flippin_waffles

Sounds impressive if true. I have no idea what it all means though. heh
On the other hand, gOJDO might be just trying to make a point that without a source, it doesn't mean it's true.
02-09-2007, 05:15 AM
doompc

L1 cache size remains the same, 64KB for Instructions and 64KB for Data.

Yes, L1 bus is 2x128bit and L2 is 1x128bit.
02-09-2007, 05:28 AM
Carfax

Quote:

Originally Posted by doompc

L1 cache size remains the same, 64KB for Instructions and 64KB for Data.

Yes, L1 bus is 2x128bit and L2 is 1x128bit.

You'd think they would have expanded the bus to full 256 bit so as not to starve the execution engines..

Then again, I'm no engineer and there is probably a good reason for them choosing to stay with a 128-bit data bus on the L2 cache.
02-09-2007, 05:29 AM
gOJDO

Quote:

Originally Posted by Carfax

*Edit* Gojdo just added some more stuff. According to the updated info, the L2 cache is only 128-bits wide! But the L1 cache is 256-bits? WTH?

No info about 256bit L2. Most probably is 128bit. The bus to the crossbar is 128bit, thus the access to the L3 is 128bit also. Because the L3 is going to be used as exclusive, most likely the L2 is 128bit.
Can anyone confirm this?
02-09-2007, 06:34 AM
krille

Quote:

Originally Posted by gOJDO

Source: Internet

LOL!
02-09-2007, 06:50 AM
Shadowmage

Reduced latency L1???? Does that mean it's 2 cycles now? I don't really believe this.

Also, for the D-TLB, do you mean 48-entry?

- DTLB:
* Fully-associative 48-way TLB (4K, 2M, 1G)
02-09-2007, 06:51 AM
largon

AMD Analyst Day slides would seem to point to 128bit data paths for L2.

Quote:

Originally Posted by Shadowmage

Also gODO just took it from Ace's Hardware:

http://www.aceshardware.com/forums/r...5159&forumid=1

Quote:

Originally Posted by Carfax on Ace's

This is something I saw on Xtremesys. The guy who posted this, did not post a source, but if he does, I'll edit this post and include it.

;)
02-09-2007, 07:02 AM
doompc

The path to the northbridge has been widened to 128 bit.
Athlons have 2 individual L1 buses, both has been widened to 128 bit. L1 must have 256 bit bus in order to allow 32Byte Instruction Fetch.

L2 seems to have not changed, since it is exclusive victim cache and the CPU loads data from memory to L1 cache, it's ok here.

No word about L3 cache, but it must be managed by the crossbar, with agressive data prefetch during idle cycles and a large WriteBack bufer.
AMD said that the L3 cache will not be exclusive nor inclusive, beeing a shared victim cache it will be a bit of both.
02-09-2007, 07:03 AM
Shadowmage

Quote:

Originally Posted by largon

;)

That's why I edited that out when I saw that :p
02-09-2007, 07:18 AM
oldblue

Quote:

Originally Posted by gOJDO

Source: Internet

As far as I can tell, most of this is from Dresdenboy's list at SI.

Ironically, DDB posted a link to it at Ace's yesterday, and it was mostly ignored. Then gOJDO posted it here. Then Carfax saw it here and posted it on Ace's again, which started a big discussion.

It's all in the formatting. :)
02-09-2007, 07:47 AM
informal

Oldblue is right,the majority of this info was posted by DDB on SI board,then he linked it to his post at Ace's ,which gojdo copy/pasted here without crediting DDB.
Oh well..
02-09-2007, 07:57 AM
nn_step

you forgot the migration bus for the IMC
02-09-2007, 08:03 AM
Carfax

Quote:

Originally Posted by doompc

The path to the northbridge has been widened to 128 bit.
Athlons have 2 individual L1 buses, both has been widened to 128 bit. L1 must have 256 bit bus in order to allow 32Byte Instruction Fetch.

L2 seems to have not changed, since it is exclusive victim cache and the CPU loads data from memory to L1 cache, it's ok here.

No word about L3 cache, but it must be managed by the crossbar, with agressive data prefetch during idle cycles and a large WriteBack bufer.
AMD said that the L3 cache will not be exclusive nor inclusive, beeing a shared victim cache it will be a bit of both.

Great explanation DoomPC.. You sound like a man who knows what he's talking about :toast:
02-09-2007, 08:07 AM
gOJDO

@oldblue
Most of Dresdenboy's list at SI is from AMD slides:
http://img266.imageshack.us/img266/6...celona1fa6.jpg
http://img255.imageshack.us/img255/7...celona2gm6.jpg
http://img255.imageshack.us/img255/4...celona3gh6.jpg
http://img255.imageshack.us/img255/1...celona4zr0.jpg
http://img253.imageshack.us/img253/7...celona5uu5.jpg
http://img253.imageshack.us/img253/8...celona6gs8.jpg

In the list there is data from other sources also, that I have collected from internet and saved on HD. If it is very important for someone, I can find out the links of the sources and post here. My goal is to make a list of K8L(K10) vs K8 differences and improvements. Thats how people will be able to find all informations about K8L(K10) Macro/micro architecture on one place, instead googling and opening tens of useless documents. So, if anyone can add something to the list, will be appreciated.

Thank you.
02-09-2007, 08:24 AM
oldblue

Quote:

Originally Posted by gOJDO

@oldblue
Most of Dresdenboy's list at SI is from AMD slides:

I'm not questioning the veracity of most of this information, and I think this list is useful. I hope that as people update it, they list their sources. Hyperlinking each entry would help with this. Otherwise it will become difficult to separate internet rumor from AMD-supplied information.

But most of your list was clearly copied and pasted from Dresdenboy's list, and I think he should get some credit for that.
02-09-2007, 08:34 AM
gOJDO

Quote:

Originally Posted by oldblue

I'm not questioning the veracity of most of this information, and I think this list is useful. I hope that as people update it, they list their sources. Hyperlinking each entry would help with this. Otherwise it will become difficult to separate internet rumor from AMD-supplied information.

But most of your list was clearly copied and pasted from Dresdenboy's list, and I think he should get some credit for that.

Yes, I copy/pasted from there, rearranged the list and added other info.
Anyway, 100% of what I copy pasted from Dresdenboy's list is copy pasted from the AMD slides that I already have on my HD. Also, there is some info from metro.cl from chilehardware, hypertransport.org, other AMD slides. So, credits goes to all of them.
02-09-2007, 10:04 AM
Revv23

wow sounds like a monster.

good post.
02-09-2007, 10:31 AM
biohead

yeah gOJDO thanks for the effort :thumbsup: :)
02-09-2007, 10:42 AM
doompc

Thank you, Carfax. I'm still learning all this stuff. But it does look good.
Stephen said K10 is about 10% faster than Core2 in general use, that's what I was specting.

Internaly both look quite similar. 4x DP FLOP / 2x 128 bit SSE per cycle, etc. They are very diferent design but do the same work. Then the IMC will it's job. Point for AMD.
For quad-core parts the monolitic design plus the L3 cache may speed things up significantly. Another point for AMD.
And 2P, 4P or even 8P (fully connected by 8 bit HTT3.0 links) systems will perform so strong, that Sun might go back in it's Intel partnership. :D

But Intel has better manufaturing process, it may rule on desktop systems due to higher overclocking rates, and in 2P servers due to lower power consumption.

gOJDO, can you tell me where to get that presentation ?

Show 100 post(s) from this thread on one page