Macro/Micro architectural imporvements of K8L(K10) over K8

**oldblue** · 02-09-2007, 11:12 AM

Originally Posted by vitaminc

He should not need to, as DDB is merely posting second-hand information.

EDIT, if DDB wrote an article about it without posting slides, then yes, Gojdo should credit him. No one but AMD should receive credit for posting slides online.

To be clear, Dresdenboy did not post slides. He compiled and posted this list yesterday:

Originally Posted by DDB_WO

* Comprehensive Upgrades for SSE

- Dual 128-bit SSE dataflow
- Up to 4 dual precision FP OPS/cycle
- Dual 128-bit loads per cycle
- Can perform SSE MOVs in the FP “store” pipe
- Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
- FP Scheduler can hold 36 Dedicated x 128-bit ops
- SSE Unaligned Load-Execute mode

Remove alignment requirements for SSE ld-op instructions
Eliminate awkward pairs of separate load and compute instructions
To improve instruction packing and decoding efficiency

* Advanced branch prediction

- Dedicated 512-entry Indirect Predictor
- Double return stacksize
- More branch history bits and improved branch hashing

* 32B instruction fetch

- Benefits integer code too
- Reduced split-fetch instruction cases

* Sideband Stack Optimizer

- Perform stack adjustments for PUSH/POP operations “on the side”
- Stack adjustments don’t occupy functional unit bandwidth
- Breaks serial dependence chains for consecutive PUSH/POPs

* Out-of-order load execution

- New technology allows load instructions to bypass:

Other loads
Other stores which are known not to alias with the load

- Significantly mitigates L2 cache latency

* TLB Optimisations

- Support for 1G pages
- 48bit physical address
- Larger TLBs key for:

Virtualized workloads
Large-footprint databases and
transaction processing

- DTLB:

Fully-associative 48-way TLB (4K, 2M, 1G)
Backed by L2 TLBs: 512 x 4K, 128 x 2M

- ITLB:

16 x 2M entries

* Data-dependent divide latency
* More Fastpath instructions

– CALL and RET-Imm instructions
– Data movement between FP & INT

* Bit Manipulation extensions

- LZCNT/POPCNT

* SSE extensions

- EXTRQ/INSERTQ,
- MOVNTSD/MOVNTSS

* Independent DRAM controllers

- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency

* Optimized DRAM paging

- Increase page hits
- Decrease page conflicts

* History-based pattern predictor
* Re-architect NB for higher BW

- Increase buffer sizes
- Optimize schedulers
- Ready to support future DRAM technologies

* Write bursting

- Minimize Rd/Wr Turnaround

* DRAM prefetcher

- Track positive and negative, unit and non-unit strides
- Dedicated buffer for prefetched data
- Aggressively fill idle DRAM cycles

* Core prefetchers

- DC Prefetcher fills directly to L1 Cache
- IC Prefetcher more flexible

2 outstanding requests to any address

* Shared L3

- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy

**uOpt** · 02-09-2007, 11:14 AM

I stay with my prediction.

Per core at the same clockspeed, for real-world integer programs that are not unrealistically memory starved (i.e. no SuperPi, and not using decompression as benchmark) this will bring 5%. Let's take compilation (of C/C++ programs) as the benchmark when we want to check on this later.

**~~gOJDO~~** · 02-09-2007, 11:54 AM

Originally Posted by doompc

gOJDO, can you tell me where to get that presentation ?

I don't remeber the site(korean) and I can't find them googling, but I'll UL the rest of the slide images:

@oldblue
As you can notice, every word from DDB list is copy-pasted from the AMD slides and I have the slides long time ago before yesterday. DDB just gave me the idea to make a list(like he did) with all the informations about K8L(K10) improvements, but I didn't found any additional info than the info I already had, form his list.

**oldblue** · 02-09-2007, 12:05 PM

Originally Posted by doompc

gOJDO, can you tell me where to get that presentation ?

Originally Posted by gOJDO

I don't remeber the site(korean) and I can't find them googling

I believe they're from a Japanese site. Hiroshige Goto posted the slides last October.

**doompc** · 02-09-2007, 12:21 PM

Any details about this ?
Dual Independent Single Channel Memory Controllers would be great, but I don't think so...

**largon** · 02-09-2007, 12:39 PM

"Independant MCs" means QC parts have one 64bit controller dedicated for each pair of cores. K8 ofcourse has equally two 64bit'ers but the difference is they'r in ganged mode.

2x128bit can't naturally be done without adding about a hundred extra socket pins.

**nn_step** · 02-09-2007, 01:29 PM

Originally Posted by largon

"Independant MCs" means QC parts have one 64bit controller dedicated for each pair of cores. K8 ofcourse has equally two 64bit'ers but the difference is they'r in ganged mode.

2x128bit can't naturally be done without adding about a hundred extra socket pins.

actually there are more than plenty of pins for that on Socket F but not sufficient on socket AM2

**Carfax** · 02-09-2007, 02:14 PM

Originally Posted by brentpresley

Where this CPU REALLY has potential to shine are multimedia apps that have been optimized for SSE/2/3/4a and have predictable memory access patterns. The L3 cache in that situation, on top of the double-wide SSE units should make this chip untouchable for C2D in encoding/encrypting/etc.

On the flipside, since K8L/K10 is still a 3-decoder core that doesn't look like it implements anything similar to micro-ops fusion, C2D will probably maintain a slim lead there (2-5% or so in 32-bit code, and that may be reversed in 64-bit code since C2D doesn't implement fusion in 64-bit).

Just speculation on my part. Flame me if you will.

C2D has higher theoretical SIMD throughput, but the keyword is theoretical. I think C2D can issue a max of 6 SSE instructions per cycle, but it's very hard to do so.

It will be very interesting to see which chip has the lead in vectorized apps..

I'm thinking that C2D will have an edge like you, but not very much..

Whats even more interesting, will be Barcelona's gaming performance.

Games seem to favor integer performance over FP, and C2D is the master of integer so......

Either way, we the consumer will win!

**Dresdenboy** · 02-14-2007, 07:24 AM

Thanks for all the hype, guys

My original source were these slides:
http://public.cranfield.ac.uk/SOE_te...don_071206.pdf
They also contain a lot of other interesting stuff.

I compiled this list to answer a posting from someone, who seems to deny any IPC improvements in K10's core except for some FP related stuff. Throwing a huge list at him seemed to be a better argument than just saying ("It has improvements"). This was a good motivation

And don't forget my comment:

Originally Posted by DDB

However, regarding IPC.. Have a look at this list and think about the design efforts and if they'd be worth it, if most of the more general changes wouldn't cause IPC improvements of ~1% or more per core modification?

This thought was my reason for assuming, that we'll see significant IPC improvements on any code (INT/FP) over K8.

The L1 sizes are still 64kB as confirmed by AMD:

Originally Posted by Johan@aceshardware

At our last phone call, Damon Muzny repeated this at least 3 times that the figure with Data cache might have confused a lot of people: but the L1 is still 64KB D + 64 KB I, just like it was before.

http://aceshardware.com/forums/read_...8309&forumid=1
This has been discussed months ago and BTW the die plot's have never shown smaller L1 caches.

The clock gating stuff now officially confirmed by AMD first appeared in Yager's heavily critized article on computerworld. My take on this was:

Originally Posted by DDB

First I didn't read the full article (because it sounded like simple PR), but then I saw this sentence quoted above. If that is true, then (and probably together with an Intelish typ. TDP) AMD would really yield more sellable parts and could maybe even offer HE variants right from start.

I wrote about AMD's patents regarding such fine grained power saving a while ago. Interestingly the 2 inventors (Filippo, Pickett) involved in power saving stuff are well known names on many other patents (like Pat. No. 7,165,167, where you'll find both together with Ben "Barcelona Man" Sander), which fit nicely to Barcelona's new features. So for the interested ones:

Pat. No. 6,826,704 - Microprocessor employing a performance throttling mechanism for power management
Pat. No. 6,983,389 - Clock control of functional units in an integrated circuit based on monitoring unit signals to predict inactivity
Pat. No. 6,976,182 - Apparatus and method for decreasing power consumption in an integrated circuit

And again the filing dates of most of these patents (2002/2003) should make clear, that Barcelona is not an answer to Intels new architecture. People knowing the design lifecycle of such a highly integrated CPU never thought so.

(http://www.siliconinvestor.com/readm...msgid=23261707)

If AMD is really going to use Intelish TDP numbers (based on measurements while running some "power virus" software) and as it has these HW based p-state transitions, clock gating and maybe some form throttling, it may increase Barcelona's yield, because they can sell parts, which'll never break the given TDP, while without this power related stuff would have exceeded the currently used TDP "thermal envelope".

AMD's patents show a lot of interesting details regarding K10, like clock gating, details of a 128 bit wide operating FPU (Pat. No. 6,944,744), the new memory controller (e.g. patents about reducing page conflicts), zonal monitoring of temperature on the die, HW based power state transitions, thermal throttling etc.

So it's never bad to have a look at them. BTW, some of the patents by guys, who seem to be K10 architects, are trace cache related.

**~~gOJDO~~** · 02-14-2007, 07:36 AM

@Dresdenboy

Thanks for your input and info!
I am trying to make a good collection of informations about K8L(K10):
http://forumz.tomshardware.com/hardw...501779#1501779

I will appreciate if you can add something to the list.

**Ubermann** · 02-14-2007, 07:37 AM

Dresdenboy.. plz stay here =)

**Nedjo** · 02-15-2007, 02:34 PM

Must check out:

http://translate.google.com/translat...%3Den%26sa%3DG

**Dresdenboy** · 02-19-2007, 05:56 AM

Some more thoughts on IPC:

What this big list shows, are things, we'll see coming with K10. But it's difficult to estimate the effect on different kinds code, especially if the situation on K8 wasn't close to being optimal. For example the K8 FPU can't do 128 bit SSE loads in FMUL or FADD unit, while it can do 64 bit loads in any unit. This limited capability turned out to be an important bottleneck for Prime95 code (see http://mersenneforum.org/showthread.php?t=1072 and http://mersenneforum.org/showthread.php?t=2362#15).

I'm not sure right now, if the FP scheduler can select any MacroOp from any lane to send them to the subunits or if there is kind of a link to certain subunits like it happens for the INT units. This could mean, that the decoders produce triples of MacroOps with only one slot used (e.g. if the ops can only go to FMISC). This would leave a lot of FP scheduler slots unused.

However - if such limitations (one is also, that SSE register moves were only handled by FADD and FMUL) are no more present, the effect on performance could theoretically reach levels above that of simply widened units and datapaths.

And we should also not forget, that 3 MacroOps in K8 already mean up to 6 µOps (practically up to 3 ALU/FP ops and 2 AGU ops). So what seems to be a max of 3 µOps is actually a max of 5 µOps (even 6 would be possible, but this might not be reachable). This might lead to cases (e.g. complex address ops with more than one dependency, multiple complex ops or RMW-ops), where the decoding rate of K8 and K8L might be on par or even higher than that of C2D.

Recently I found a free cycle exact µArchitecture simulator (http://www.ptlsim.org/), which has been used to model modern CPUs (even with stuff like OOO loads). If time permits, I could try to apply the known architectural changes of K10 to their already existing K8 like setup, enable stuff like OOO loads and L3 cache and see, how it fares. This might be more of an accurate estimate, than anything else based on our personal theories

**LowRun** · 02-19-2007, 06:40 AM

Thanks for the nice bits of info DDB

**informal** · 02-19-2007, 06:41 AM

Thanks for the post and the link Dresdenboy

**Lightman** · 02-19-2007, 08:38 AM

Thanks for your thoughts DDB and I can't wait for analysis using PTLSIM...
Hope you will have time shortly to play with it!
PS. CeBIT is coming quickly and we might find out some real performance numbers then.

**oldblue** · 02-19-2007, 08:52 AM

Thanks Dresdenboy. You're probably already aware of this, but Agner Fog's optimization manuals include an independent analysis of the K8 microarchitecture and he suggests some bottlenecks. It's not from AMD, so there might be some mistakes, but it might be a good place to start. It's a pretty good read for anyone interested in the subject.

It would be interesting to see how these bottlenecks are addressed in Barcelona.

**Lightman** · 02-20-2007, 06:41 AM

Originally Posted by oldblue

Thanks Dresdenboy. You're probably already aware of this, but Agner Fog's optimization manuals include an independent analysis of the K8 microarchitecture and he suggests some bottlenecks. It's not from AMD, so there might be some mistakes, but it might be a good place to start. It's a pretty good read for anyone interested in the subject.

It would be interesting to see how these bottlenecks are addressed in Barcelona.

Great (but loooong) read! Thanks OLDBLUE! It seems that K10 is addressing most of bottlenecks of K8 micro architecture (from what we know now). It is possible that AMD made more mini improvements we don't know about yet.

Only time will tell

**nn_step** · 02-20-2007, 09:00 AM

Originally Posted by Lightman

Great (but loooong) read! Thanks OLDBLUE! It seems that K10 is addressing most of bottlenecks of K8 micro architecture (from what we know now). It is possible that AMD made more mini improvements we don't know about yet.

Only time will tell

of course, because it would be illogical to leave problems and not improve

**Dresdenboy** · 02-21-2007, 12:07 AM

Originally Posted by oldblue

Thanks Dresdenboy. You're probably already aware of this, but Agner Fog's optimization manuals include an independent analysis of the K8 microarchitecture and he suggests some bottlenecks. It's not from AMD, so there might be some mistakes, but it might be a good place to start. It's a pretty good read for anyone interested in the subject.

It would be interesting to see how these bottlenecks are addressed in Barcelona.

Thanks. I was aware of these documents and even just had a look at them a day before you posted this link here

Unfortunately these manuals based on experiments and collected information were not available, when George Woltman and I tried to find the reasons for the performance of Prime95 on K8 in 2003/04. At least we found some kinds of behaviour which fits nicely to what is described in that manual.

I think, that some of the bottlenecks just derived from the special handling of double decoded SSE ops, which shouldn't happen anymore with dedicated 128 bit µOps. Just think of a MOVAPD, which fills 2 consecutive FMISC slots in two groups of 3 MacroOps. An SSE instruction, which follows this MOVAPD, might not fill the 4 free slots in these groups optimally, because it's behind in program order. Such situations might be eased up in Barcelona. That's, where I think Barcelona will win some performance not only by increasing throughput, but also by better handling of the 128 bit ops.

**Wrench** · 02-21-2007, 12:42 AM

Originally Posted by Carfax

C2D has higher theoretical SIMD throughput, but the keyword is theoretical. I think C2D can issue a max of 6 SSE instructions per cycle, but it's very hard to do so.

It will be very interesting to see which chip has the lead in vectorized apps..

I'm thinking that C2D will have an edge like you, but not very much..

Whats even more interesting, will be Barcelona's gaming performance.

Games seem to favor integer performance over FP, and C2D is the master of integer so......

Either way, we the consumer will win!

Uhm.. Games prefer integer ops? Are you serious.. All 3d Games REQUIRE FP ops.. You know how inaccurate the 3d engines would be if they used Int ops. I think you should go and check up on your game engine specifics.. This has to be one of the most uninformed posts I have ever read.

**Carfax** · 02-21-2007, 01:10 AM

Originally Posted by Wrench

Uhm.. Games prefer integer ops? Are you serious.. All 3d Games REQUIRE FP ops.. You know how inaccurate the 3d engines would be if they used Int ops. I think you should go and check up on your game engine specifics.. This has to be one of the most uninformed posts I have ever read.

Yes, all games require "FP ops," and the GRAPHICS CARD handles the vast majority of those ops

Before you criticize my posts, you should get yourself a clue..

**Wrench** · 02-21-2007, 04:17 AM

Originally Posted by Carfax

Yes, all games require "FP ops," and the GRAPHICS CARD handles the vast majority of those ops

Before you criticize my posts, you should get yourself a clue..

Dood I did do a fair bit of programing and seting up all the geometry is handled by the cpu still as well as physics and AI. Graphic cards handle things like textures, T&L, Vertex shader, and pixel shaders. Most 3D games require Floating point math unlike 2d games.

I don't know about you but I did a bit of OpenGL programming and you can setup a cubic enviornment easily with Integer but as soon as you get into tris its imposible to have a cohesive geometry with integer calculations. Cubes have straight lines so its easy enough to figure that out but how do you figure out lets say the edges of triangle with out the pythagorean theorem, sure their might be a case where you get lucky to figure out one side with ints but thats gone be damn hard.

This is why the Northwood was a lame duck in most games compared to the A64.. A64 had a very strong FPU and the Netburst architecture didn't. Now by your reasonging the Netburst should have whomped the a64 since the Netburst architecture had almost an equivalent integer functionality to the a64 (the c2d has a damn good FPU and a short pipeline, since its based of the p3 FPU which was a damn good FPU compared to the Netburst, it whipes the floor with the a64). I haven't programmed in about a year and half but things haven't changed to hell of a lot.

**Carfax** · 02-21-2007, 11:17 AM

Originally Posted by Wrench

Dood I did do a fair bit of programing and seting up all the geometry is handled by the cpu still as well as physics and AI. Graphic cards handle things like textures, T&L, Vertex shader, and pixel shaders. Most 3D games require Floating point math unlike 2d games.

I'm not saying games don't use floating point math, just that most game code is integer based. The heavy FP grunt work is done by the GPU, and the CPU probably uses FP for things like physics..

AI code is Integer though as far as I know.

One guy once told me that the reason why games use INT more than FP is because historically, x86 CPUs have always had a weak FP unit.

So game developers were forced to use integer rather than FP..

This is why the Northwood was a lame duck in most games compared to the A64.. A64 had a very strong FPU and the Netburst architecture didn't. Now by your reasonging the Netburst should have whomped the a64 since the Netburst architecture had almost an equivalent integer functionality to the a64 (the c2d has a damn good FPU and a short pipeline, since its based of the p3 FPU which was a damn good FPU compared to the Netburst, it whipes the floor with the a64). I haven't programmed in about a year and half but things haven't changed to hell of a lot.

Actually, the P4 had a very powerful FPU (SSE2) but it was harder to optimize for due to the peculiarities of the Netburst architecture. It's x87 FPU was weak, but Intel did that on purpose so that developers would be forced to use SSE2.

Anyway, if your theory is correct, then how do you explain Dothan and Yonah beating the K8 in gaming?

Dothan and Yonah had much weaker FPUs than the K8, yet they still managed to beat the K8 in gaming clock for clock, once the FSB was raised a bit.

And while the C2D has a strong FPU, it is FAR stronger in INT.

Just check the Spec scores.. C2D opens a can of whoopass on the K8 in INT..

**nn_step** · 02-21-2007, 12:57 PM

umm P6 architecture has always been strong for Integer math

Thread: Macro/Micro architectural imporvements of K8L(K10) over K8

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions