MMM
Page 2 of 3 FirstFirst 123 LastLast
Results 26 to 50 of 56

Thread: Macro/Micro architectural imporvements of K8L(K10) over K8

  1. #26
    Xtreme Member
    Join Date
    Jul 2006
    Posts
    146
    Quote Originally Posted by vitaminc
    He should not need to, as DDB is merely posting second-hand information.

    EDIT, if DDB wrote an article about it without posting slides, then yes, Gojdo should credit him. No one but AMD should receive credit for posting slides online.
    To be clear, Dresdenboy did not post slides. He compiled and posted this list yesterday:

    Quote Originally Posted by DDB_WO
    * Comprehensive Upgrades for SSE
    - Dual 128-bit SSE dataflow
    - Up to 4 dual precision FP OPS/cycle
    - Dual 128-bit loads per cycle
    - Can perform SSE MOVs in the FP “store” pipe
    - Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
    - FP Scheduler can hold 36 Dedicated x 128-bit ops
    - SSE Unaligned Load-Execute mode
    Remove alignment requirements for SSE ld-op instructions
    Eliminate awkward pairs of separate load and compute instructions
    To improve instruction packing and decoding efficiency
    * Advanced branch prediction
    - Dedicated 512-entry Indirect Predictor
    - Double return stacksize
    - More branch history bits and improved branch hashing
    * 32B instruction fetch
    - Benefits integer code too
    - Reduced split-fetch instruction cases
    * Sideband Stack Optimizer
    - Perform stack adjustments for PUSH/POP operations “on the side”
    - Stack adjustments don’t occupy functional unit bandwidth
    - Breaks serial dependence chains for consecutive PUSH/POPs
    * Out-of-order load execution
    - New technology allows load instructions to bypass:
    Other loads
    Other stores which are known not to alias with the load
    - Significantly mitigates L2 cache latency
    * TLB Optimisations
    - Support for 1G pages
    - 48bit physical address
    - Larger TLBs key for:
    Virtualized workloads
    Large-footprint databases and
    transaction processing
    - DTLB:
    Fully-associative 48-way TLB (4K, 2M, 1G)
    Backed by L2 TLBs: 512 x 4K, 128 x 2M
    - ITLB:
    16 x 2M entries
    * Data-dependent divide latency
    * More Fastpath instructions
    – CALL and RET-Imm instructions
    – Data movement between FP & INT
    * Bit Manipulation extensions
    - LZCNT/POPCNT
    * SSE extensions
    - EXTRQ/INSERTQ,
    - MOVNTSD/MOVNTSS
    * Independent DRAM controllers
    - Concurrency
    - More DRAM banks reduces page conflicts
    - Longer burst length improves command efficiency
    * Optimized DRAM paging
    - Increase page hits
    - Decrease page conflicts
    * History-based pattern predictor
    * Re-architect NB for higher BW
    - Increase buffer sizes
    - Optimize schedulers
    - Ready to support future DRAM technologies
    * Write bursting
    - Minimize Rd/Wr Turnaround
    * DRAM prefetcher
    - Track positive and negative, unit and non-unit strides
    - Dedicated buffer for prefetched data
    - Aggressively fill idle DRAM cycles
    * Core prefetchers
    - DC Prefetcher fills directly to L1 Cache
    - IC Prefetcher more flexible
    2 outstanding requests to any address
    * Shared L3
    - Victim-cache architecture maximizes efficiency of cache hierarchy
    - Fills from L3 leave likely shared lines in the L3
    - Sharing-aware replacement policy
    Last edited by oldblue; 02-09-2007 at 11:21 AM.

  2. #27
    Xtreme Mentor
    Join Date
    Aug 2005
    Location
    Boston, MA, USA
    Posts
    2,883
    I stay with my prediction.

    Per core at the same clockspeed, for real-world integer programs that are not unrealistically memory starved (i.e. no SuperPi, and not using decompression as benchmark) this will bring 5%. Let's take compilation (of C/C++ programs) as the benchmark when we want to check on this later.

  3. #28
    Banned
    Join Date
    May 2006
    Location
    Skopje, Macedonia
    Posts
    1,716
    Quote Originally Posted by doompc
    gOJDO, can you tell me where to get that presentation ?
    I don't remeber the site(korean) and I can't find them googling, but I'll UL the rest of the slide images:










    @oldblue
    As you can notice, every word from DDB list is copy-pasted from the AMD slides and I have the slides long time ago before yesterday. DDB just gave me the idea to make a list(like he did) with all the informations about K8L(K10) improvements, but I didn't found any additional info than the info I already had, form his list.
    Last edited by gOJDO; 02-09-2007 at 11:57 AM.

  4. #29
    Xtreme Member
    Join Date
    Jul 2006
    Posts
    146
    Quote Originally Posted by doompc
    gOJDO, can you tell me where to get that presentation ?
    Quote Originally Posted by gOJDO
    I don't remeber the site(korean) and I can't find them googling
    I believe they're from a Japanese site. Hiroshige Goto posted the slides last October.

  5. #30
    Xtreme Enthusiast
    Join Date
    Apr 2006
    Location
    Brasil
    Posts
    534


    Any details about this ?
    Dual Independent Single Channel Memory Controllers would be great, but I don't think so...
    Last edited by doompc; 02-09-2007 at 12:26 PM.

  6. #31
    Xtreme Guru
    Join Date
    Jan 2005
    Location
    Tre, Suomi Finland
    Posts
    3,858
    "Independant MCs" means QC parts have one 64bit controller dedicated for each pair of cores. K8 ofcourse has equally two 64bit'ers but the difference is they'r in ganged mode.

    2x128bit can't naturally be done without adding about a hundred extra socket pins.
    You were not supposed to see this.

  7. #32
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by largon
    "Independant MCs" means QC parts have one 64bit controller dedicated for each pair of cores. K8 ofcourse has equally two 64bit'ers but the difference is they'r in ganged mode.

    2x128bit can't naturally be done without adding about a hundred extra socket pins.
    actually there are more than plenty of pins for that on Socket F but not sufficient on socket AM2
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  8. #33
    Xtreme Addict
    Join Date
    Jul 2004
    Location
    U.S of freakin' A
    Posts
    1,931
    Quote Originally Posted by brentpresley
    Where this CPU REALLY has potential to shine are multimedia apps that have been optimized for SSE/2/3/4a and have predictable memory access patterns. The L3 cache in that situation, on top of the double-wide SSE units should make this chip untouchable for C2D in encoding/encrypting/etc.

    On the flipside, since K8L/K10 is still a 3-decoder core that doesn't look like it implements anything similar to micro-ops fusion, C2D will probably maintain a slim lead there (2-5% or so in 32-bit code, and that may be reversed in 64-bit code since C2D doesn't implement fusion in 64-bit).

    Just speculation on my part. Flame me if you will.
    C2D has higher theoretical SIMD throughput, but the keyword is theoretical. I think C2D can issue a max of 6 SSE instructions per cycle, but it's very hard to do so.

    It will be very interesting to see which chip has the lead in vectorized apps..

    I'm thinking that C2D will have an edge like you, but not very much..

    Whats even more interesting, will be Barcelona's gaming performance.

    Games seem to favor integer performance over FP, and C2D is the master of integer so......

    Either way, we the consumer will win!

  9. #34
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Thanks for all the hype, guys

    My original source were these slides:
    http://public.cranfield.ac.uk/SOE_te...don_071206.pdf
    They also contain a lot of other interesting stuff.

    I compiled this list to answer a posting from someone, who seems to deny any IPC improvements in K10's core except for some FP related stuff. Throwing a huge list at him seemed to be a better argument than just saying ("It has improvements"). This was a good motivation And don't forget my comment:
    Quote Originally Posted by DDB
    However, regarding IPC.. Have a look at this list and think about the design efforts and if they'd be worth it, if most of the more general changes wouldn't cause IPC improvements of ~1% or more per core modification?
    This thought was my reason for assuming, that we'll see significant IPC improvements on any code (INT/FP) over K8.

    The L1 sizes are still 64kB as confirmed by AMD:
    Quote Originally Posted by Johan@aceshardware
    At our last phone call, Damon Muzny repeated this at least 3 times that the figure with Data cache might have confused a lot of people: but the L1 is still 64KB D + 64 KB I, just like it was before.
    http://aceshardware.com/forums/read_...8309&forumid=1
    This has been discussed months ago and BTW the die plot's have never shown smaller L1 caches.

    The clock gating stuff now officially confirmed by AMD first appeared in Yager's heavily critized article on computerworld. My take on this was:
    Quote Originally Posted by DDB
    First I didn't read the full article (because it sounded like simple PR), but then I saw this sentence quoted above. If that is true, then (and probably together with an Intelish typ. TDP) AMD would really yield more sellable parts and could maybe even offer HE variants right from start.

    I wrote about AMD's patents regarding such fine grained power saving a while ago. Interestingly the 2 inventors (Filippo, Pickett) involved in power saving stuff are well known names on many other patents (like Pat. No. 7,165,167, where you'll find both together with Ben "Barcelona Man" Sander), which fit nicely to Barcelona's new features. So for the interested ones:

    Pat. No. 6,826,704 - Microprocessor employing a performance throttling mechanism for power management
    Pat. No. 6,983,389 - Clock control of functional units in an integrated circuit based on monitoring unit signals to predict inactivity
    Pat. No. 6,976,182 - Apparatus and method for decreasing power consumption in an integrated circuit

    And again the filing dates of most of these patents (2002/2003) should make clear, that Barcelona is not an answer to Intels new architecture. People knowing the design lifecycle of such a highly integrated CPU never thought so.
    (http://www.siliconinvestor.com/readm...msgid=23261707)

    If AMD is really going to use Intelish TDP numbers (based on measurements while running some "power virus" software) and as it has these HW based p-state transitions, clock gating and maybe some form throttling, it may increase Barcelona's yield, because they can sell parts, which'll never break the given TDP, while without this power related stuff would have exceeded the currently used TDP "thermal envelope".

    AMD's patents show a lot of interesting details regarding K10, like clock gating, details of a 128 bit wide operating FPU (Pat. No. 6,944,744), the new memory controller (e.g. patents about reducing page conflicts), zonal monitoring of temperature on the die, HW based power state transitions, thermal throttling etc.

    So it's never bad to have a look at them. BTW, some of the patents by guys, who seem to be K10 architects, are trace cache related.
    Last edited by Dresdenboy; 02-14-2007 at 07:27 AM.

  10. #35
    Banned
    Join Date
    May 2006
    Location
    Skopje, Macedonia
    Posts
    1,716
    @Dresdenboy
    Thanks for your input and info!
    I am trying to make a good collection of informations about K8L(K10):
    http://forumz.tomshardware.com/hardw...501779#1501779

    I will appreciate if you can add something to the list.

  11. #36
    XS News
    Join Date
    Aug 2004
    Location
    Sweden
    Posts
    2,010
    Dresdenboy.. plz stay here =)

  12. #37
    Xtreme Mentor
    Join Date
    Apr 2005
    Posts
    2,550

  13. #38
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Some more thoughts on IPC:

    What this big list shows, are things, we'll see coming with K10. But it's difficult to estimate the effect on different kinds code, especially if the situation on K8 wasn't close to being optimal. For example the K8 FPU can't do 128 bit SSE loads in FMUL or FADD unit, while it can do 64 bit loads in any unit. This limited capability turned out to be an important bottleneck for Prime95 code (see http://mersenneforum.org/showthread.php?t=1072 and http://mersenneforum.org/showthread.php?t=2362#15).

    I'm not sure right now, if the FP scheduler can select any MacroOp from any lane to send them to the subunits or if there is kind of a link to certain subunits like it happens for the INT units. This could mean, that the decoders produce triples of MacroOps with only one slot used (e.g. if the ops can only go to FMISC). This would leave a lot of FP scheduler slots unused.

    However - if such limitations (one is also, that SSE register moves were only handled by FADD and FMUL) are no more present, the effect on performance could theoretically reach levels above that of simply widened units and datapaths.

    And we should also not forget, that 3 MacroOps in K8 already mean up to 6 µOps (practically up to 3 ALU/FP ops and 2 AGU ops). So what seems to be a max of 3 µOps is actually a max of 5 µOps (even 6 would be possible, but this might not be reachable). This might lead to cases (e.g. complex address ops with more than one dependency, multiple complex ops or RMW-ops), where the decoding rate of K8 and K8L might be on par or even higher than that of C2D.

    Recently I found a free cycle exact µArchitecture simulator (http://www.ptlsim.org/), which has been used to model modern CPUs (even with stuff like OOO loads). If time permits, I could try to apply the known architectural changes of K10 to their already existing K8 like setup, enable stuff like OOO loads and L3 cache and see, how it fares. This might be more of an accurate estimate, than anything else based on our personal theories

  14. #39
    Xtreme Addict
    Join Date
    May 2004
    Posts
    1,755
    Thanks for the nice bits of info DDB

  15. #40
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Thanks for the post and the link Dresdenboy

  16. #41
    Xtreme Mentor
    Join Date
    Nov 2005
    Location
    Devon
    Posts
    3,437
    Thanks for your thoughts DDB and I can't wait for analysis using PTLSIM...
    Hope you will have time shortly to play with it!
    PS. CeBIT is coming quickly and we might find out some real performance numbers then.
    RiG1: Ryzen 7 1700 @4.0GHz 1.39V, Asus X370 Prime, G.Skill RipJaws 2x8GB 3200MHz CL14 Samsung B-die, TuL Vega 56 Stock, Samsung SS805 100GB SLC SDD (OS Drive) + 512GB Evo 850 SSD (2nd OS Drive) + 3TB Seagate + 1TB Seagate, BeQuiet PowerZone 1000W

    RiG2: HTPC AMD A10-7850K APU, 2x8GB Kingstone HyperX 2400C12, AsRock FM2A88M Extreme4+, 128GB SSD + 640GB Samsung 7200, LG Blu-ray Recorder, Thermaltake BACH, Hiper 4M880 880W PSU

    SmartPhone Samsung Galaxy S7 EDGE
    XBONE paired with 55'' Samsung LED 3D TV

  17. #42
    Xtreme Member
    Join Date
    Jul 2006
    Posts
    146
    Thanks Dresdenboy. You're probably already aware of this, but Agner Fog's optimization manuals include an independent analysis of the K8 microarchitecture and he suggests some bottlenecks. It's not from AMD, so there might be some mistakes, but it might be a good place to start. It's a pretty good read for anyone interested in the subject.

    It would be interesting to see how these bottlenecks are addressed in Barcelona.

  18. #43
    Xtreme Mentor
    Join Date
    Nov 2005
    Location
    Devon
    Posts
    3,437
    Quote Originally Posted by oldblue
    Thanks Dresdenboy. You're probably already aware of this, but Agner Fog's optimization manuals include an independent analysis of the K8 microarchitecture and he suggests some bottlenecks. It's not from AMD, so there might be some mistakes, but it might be a good place to start. It's a pretty good read for anyone interested in the subject.

    It would be interesting to see how these bottlenecks are addressed in Barcelona.

    Great (but loooong) read! Thanks OLDBLUE! It seems that K10 is addressing most of bottlenecks of K8 micro architecture (from what we know now). It is possible that AMD made more mini improvements we don't know about yet.

    Only time will tell
    RiG1: Ryzen 7 1700 @4.0GHz 1.39V, Asus X370 Prime, G.Skill RipJaws 2x8GB 3200MHz CL14 Samsung B-die, TuL Vega 56 Stock, Samsung SS805 100GB SLC SDD (OS Drive) + 512GB Evo 850 SSD (2nd OS Drive) + 3TB Seagate + 1TB Seagate, BeQuiet PowerZone 1000W

    RiG2: HTPC AMD A10-7850K APU, 2x8GB Kingstone HyperX 2400C12, AsRock FM2A88M Extreme4+, 128GB SSD + 640GB Samsung 7200, LG Blu-ray Recorder, Thermaltake BACH, Hiper 4M880 880W PSU

    SmartPhone Samsung Galaxy S7 EDGE
    XBONE paired with 55'' Samsung LED 3D TV

  19. #44
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by Lightman
    Great (but loooong) read! Thanks OLDBLUE! It seems that K10 is addressing most of bottlenecks of K8 micro architecture (from what we know now). It is possible that AMD made more mini improvements we don't know about yet.

    Only time will tell
    of course, because it would be illogical to leave problems and not improve
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  20. #45
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by oldblue
    Thanks Dresdenboy. You're probably already aware of this, but Agner Fog's optimization manuals include an independent analysis of the K8 microarchitecture and he suggests some bottlenecks. It's not from AMD, so there might be some mistakes, but it might be a good place to start. It's a pretty good read for anyone interested in the subject.

    It would be interesting to see how these bottlenecks are addressed in Barcelona.
    Thanks. I was aware of these documents and even just had a look at them a day before you posted this link here

    Unfortunately these manuals based on experiments and collected information were not available, when George Woltman and I tried to find the reasons for the performance of Prime95 on K8 in 2003/04. At least we found some kinds of behaviour which fits nicely to what is described in that manual.

    I think, that some of the bottlenecks just derived from the special handling of double decoded SSE ops, which shouldn't happen anymore with dedicated 128 bit µOps. Just think of a MOVAPD, which fills 2 consecutive FMISC slots in two groups of 3 MacroOps. An SSE instruction, which follows this MOVAPD, might not fill the 4 free slots in these groups optimally, because it's behind in program order. Such situations might be eased up in Barcelona. That's, where I think Barcelona will win some performance not only by increasing throughput, but also by better handling of the 128 bit ops.

  21. #46
    Xtreme Member
    Join Date
    Jan 2004
    Posts
    243
    Quote Originally Posted by Carfax
    C2D has higher theoretical SIMD throughput, but the keyword is theoretical. I think C2D can issue a max of 6 SSE instructions per cycle, but it's very hard to do so.

    It will be very interesting to see which chip has the lead in vectorized apps..

    I'm thinking that C2D will have an edge like you, but not very much..

    Whats even more interesting, will be Barcelona's gaming performance.

    Games seem to favor integer performance over FP, and C2D is the master of integer so......

    Either way, we the consumer will win!
    Uhm.. Games prefer integer ops? Are you serious.. All 3d Games REQUIRE FP ops.. You know how inaccurate the 3d engines would be if they used Int ops. I think you should go and check up on your game engine specifics.. This has to be one of the most uninformed posts I have ever read.

  22. #47
    Xtreme Addict
    Join Date
    Jul 2004
    Location
    U.S of freakin' A
    Posts
    1,931
    Quote Originally Posted by Wrench
    Uhm.. Games prefer integer ops? Are you serious.. All 3d Games REQUIRE FP ops.. You know how inaccurate the 3d engines would be if they used Int ops. I think you should go and check up on your game engine specifics.. This has to be one of the most uninformed posts I have ever read.
    Yes, all games require "FP ops," and the GRAPHICS CARD handles the vast majority of those ops

    Before you criticize my posts, you should get yourself a clue..

  23. #48
    Xtreme Member
    Join Date
    Jan 2004
    Posts
    243
    Quote Originally Posted by Carfax
    Yes, all games require "FP ops," and the GRAPHICS CARD handles the vast majority of those ops

    Before you criticize my posts, you should get yourself a clue..

    Dood I did do a fair bit of programing and seting up all the geometry is handled by the cpu still as well as physics and AI. Graphic cards handle things like textures, T&L, Vertex shader, and pixel shaders. Most 3D games require Floating point math unlike 2d games.


    I don't know about you but I did a bit of OpenGL programming and you can setup a cubic enviornment easily with Integer but as soon as you get into tris its imposible to have a cohesive geometry with integer calculations. Cubes have straight lines so its easy enough to figure that out but how do you figure out lets say the edges of triangle with out the pythagorean theorem, sure their might be a case where you get lucky to figure out one side with ints but thats gone be damn hard.

    This is why the Northwood was a lame duck in most games compared to the A64.. A64 had a very strong FPU and the Netburst architecture didn't. Now by your reasonging the Netburst should have whomped the a64 since the Netburst architecture had almost an equivalent integer functionality to the a64 (the c2d has a damn good FPU and a short pipeline, since its based of the p3 FPU which was a damn good FPU compared to the Netburst, it whipes the floor with the a64). I haven't programmed in about a year and half but things haven't changed to hell of a lot.
    Last edited by Wrench; 02-21-2007 at 04:22 AM.

  24. #49
    Xtreme Addict
    Join Date
    Jul 2004
    Location
    U.S of freakin' A
    Posts
    1,931
    Quote Originally Posted by Wrench
    Dood I did do a fair bit of programing and seting up all the geometry is handled by the cpu still as well as physics and AI. Graphic cards handle things like textures, T&L, Vertex shader, and pixel shaders. Most 3D games require Floating point math unlike 2d games.
    I'm not saying games don't use floating point math, just that most game code is integer based. The heavy FP grunt work is done by the GPU, and the CPU probably uses FP for things like physics..

    AI code is Integer though as far as I know.

    One guy once told me that the reason why games use INT more than FP is because historically, x86 CPUs have always had a weak FP unit.

    So game developers were forced to use integer rather than FP..

    This is why the Northwood was a lame duck in most games compared to the A64.. A64 had a very strong FPU and the Netburst architecture didn't. Now by your reasonging the Netburst should have whomped the a64 since the Netburst architecture had almost an equivalent integer functionality to the a64 (the c2d has a damn good FPU and a short pipeline, since its based of the p3 FPU which was a damn good FPU compared to the Netburst, it whipes the floor with the a64). I haven't programmed in about a year and half but things haven't changed to hell of a lot.
    Actually, the P4 had a very powerful FPU (SSE2) but it was harder to optimize for due to the peculiarities of the Netburst architecture. It's x87 FPU was weak, but Intel did that on purpose so that developers would be forced to use SSE2.

    Anyway, if your theory is correct, then how do you explain Dothan and Yonah beating the K8 in gaming?

    Dothan and Yonah had much weaker FPUs than the K8, yet they still managed to beat the K8 in gaming clock for clock, once the FSB was raised a bit.

    And while the C2D has a strong FPU, it is FAR stronger in INT.

    Just check the Spec scores.. C2D opens a can of whoopass on the K8 in INT..

  25. #50
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    umm P6 architecture has always been strong for Integer math
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

Page 2 of 3 FirstFirst 123 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •