Page 4 of 5 FirstFirst 12345 LastLast
Results 76 to 100 of 101

Thread: Dresdenboys' blog: AMD Bulldozer - Patent based research part 2

  1. #76
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by informal View Post
    They had more than minimally functional Bobcat for a while now and you've heard it just a few days ago.And bobcat is not as big (in every possible sense of the word) as BD will be.
    Agreed, bobcat is about 1/10th the complexity of BD.

    Not sure about "a while now". In the best case interpretation of the CC remarks (that they concerned llano + bobcat as opposed to llano desktop + llano mobile), we know they had internal samples of bobcat at the time of the CC, from which they are "learning a lot." They did not say how long they have had such samples, at least not in the CC.

  2. #77
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by qcmadness View Post
    but then it could not explain the lower DDR-2 memory bandwidth when comparing K10 with K8?
    that could be a number of things. informal mentioned the memory controller which is a likely cause.

  3. #78
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by Chumbucket843 View Post
    the concept that ILP is what is going to give bulldozer its advantage is wrong. ILP has been mined out, you will be wasting transistors trying to improve it. its perf/mm2 and memory bandwidth.
    http://www.bloobble.com/broadband-pr...ns?itemid=2763
    see slide 6.
    Technically both true and false.

    For the standard case, you are absolutely correct but there still exists more than a dozen cases for which a couple orders of magnitude better performance can be had. [Cryptography for example]

    As for integer code, you are almost correct; approximately 10-15% better performance is theoretically possible for 64bit code [32bit code has only 1.3% remaining]
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  4. #79
    Xtreme Enthusiast
    Join Date
    Oct 2007
    Location
    Singapore
    Posts
    970
    Quote Originally Posted by qcmadness View Post
    The problem is the design of L3 cache.

    Since L3 cache is slower than dual-channel DDR-3, memory bandwidth is limited to L3 cache speed.
    L3 cache is slower than Dual-DDR3?
    Main Rig:
    Processor & Motherboard:AMD Ryzen5 1400 ' Gigabyte B450M-DS3H
    Random Access Memory Module:Adata XPG DDR4 3000 MHz 2x8GB
    Graphic Card:XFX RX 580 4GB
    Power Supply Unit:FSP AURUM 92+ Series PT-650M
    Storage Unit:Crucial MX 500 240GB SATA III SSD
    Processor Heatsink Fan:AMD Wraith Spire RGB
    Chasis:Thermaltake Level 10GTS Black

  5. #80
    Xtreme Enthusiast
    Join Date
    Oct 2007
    Location
    Singapore
    Posts
    970
    Quote Originally Posted by Manicdan View Post
    ok so because AM2 is duel channel, we had to make AM3 duel channel, and now AM3+ is also going to only be duel channel, thats pretty sucky.

    i understand backwords compatible is nice and all, but BD being limited because of the socket used by athlon x2s sounds like its going to hurt more than help. those 2 extra channels could be 5-10% more IPC

    i hope they split BD across 2 sockets, current and quad channel, let us decide if we want a 300$ cpu put into a 2 year old motherboard, or buy a new mobo and cpu at the same time for that extra perf

    perhaps 2%???
    extra memory bandwidth don't boost IPC a lot
    since L1 and L2 cache is fast enough to feed the processing units.
    compare i5 and i7 and you will see if triple channel is a lot more faster than double channel
    Main Rig:
    Processor & Motherboard:AMD Ryzen5 1400 ' Gigabyte B450M-DS3H
    Random Access Memory Module:Adata XPG DDR4 3000 MHz 2x8GB
    Graphic Card:XFX RX 580 4GB
    Power Supply Unit:FSP AURUM 92+ Series PT-650M
    Storage Unit:Crucial MX 500 240GB SATA III SSD
    Processor Heatsink Fan:AMD Wraith Spire RGB
    Chasis:Thermaltake Level 10GTS Black

  6. #81
    Xtreme Enthusiast
    Join Date
    Oct 2007
    Location
    Singapore
    Posts
    970
    Quote Originally Posted by xlink View Post
    looks like I won't be going for AMD the gen unless BD is truly exceptional.

    shame, I was hoping for a drop in replacement. ohh well.
    according to your spec, just drop in an 1090T BE in 2011 and it would be fine....
    Main Rig:
    Processor & Motherboard:AMD Ryzen5 1400 ' Gigabyte B450M-DS3H
    Random Access Memory Module:Adata XPG DDR4 3000 MHz 2x8GB
    Graphic Card:XFX RX 580 4GB
    Power Supply Unit:FSP AURUM 92+ Series PT-650M
    Storage Unit:Crucial MX 500 240GB SATA III SSD
    Processor Heatsink Fan:AMD Wraith Spire RGB
    Chasis:Thermaltake Level 10GTS Black

  7. #82
    Xtreme Addict
    Join Date
    May 2004
    Posts
    1,755
    Quote Originally Posted by haylui View Post
    L3 cache is slower than Dual-DDR3?
    Quote Originally Posted by haylui View Post

    perhaps 2%???
    extra memory bandwidth don't boost IPC a lot
    since L1 and L2 cache is fast enough to feed the processing units.
    compare i5 and i7 and you will see if triple channel is a lot more faster than double channel
    Quote Originally Posted by haylui View Post
    according to your spec, just drop in an 1090T BE in 2011 and it would be fine....
    There is a multi-quote button for a good reason dude.
    Crosshair IV Formula
    Phenom II X4 955 @ 3.7G
    6950~>6970 @ 900/1300
    4 x 2G Ballistix 1333 CL6
    C300 64G
    Corsair TX 850W
    CM HAF 932

  8. #83
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    Quote Originally Posted by haylui View Post

    perhaps 2%???
    extra memory bandwidth don't boost IPC a lot
    since L1 and L2 cache is fast enough to feed the processing units.
    compare i5 and i7 and you will see if triple channel is a lot more faster than double channel
    bold the word could next time, cause some apps love extra ram speeds, some like timings, the fact is 5-10% better perf can easily turn into massive profit margins depending on how close these are to the competition.

    this is extreme systems, and quad channel desktop memory is quite extreme id say.

  9. #84
    Xtreme Member
    Join Date
    Nov 2008
    Posts
    117
    I can't access his blog.
    When AMD had 64-bit and Intel had only 32-bit, they tried to tell the world there was no need for 64-bit. Until they got 64-bit.
    When AMD had IMC and Intel had FSB, they told the world "there is plenty of life left in the FSB" (actual quote, and yes, they had *math* to show it had more bandwidth). Until they got an IMC.
    When AMD had dual core and Intel had single core, they told the world that consumers don't need multi core. Until they got dual core.
    When intel was using MCM, they said it was a better solution than native dies. Until they got native dies. (To be fair, we knocked *unconnected* MCM, and still do, we never knocked MCM as a technology, so hold your flames.)
    by John Fruehe

  10. #85
    Xtreme Addict
    Join Date
    Jan 2009
    Posts
    1,445
    Quote Originally Posted by vietthanhpro View Post
    I can't access his blog.
    really? i just tried and it worked for me?


    here is a link to the latest entry.

    http://citavia.blog.de/2010/04/22/pr...lated-8429143/

    [MOBO] Asus CrossHair Formula 5 AM3+
    [GPU] ATI 6970 x2 Crossfire 2Gb
    [RAM] G.SKILL Ripjaws X Series 16GB (4 x 4GB) 240-Pin DDR3 1600
    [CPU] AMD FX-8120 @ 4.8 ghz
    [COOLER] XSPC Rasa 750 RS360 WaterCooling
    [OS] Windows 8 x64 Enterprise
    [HDD] OCZ Vertex 3 120GB SSD
    [AUDIO] Logitech S-220 17 Watts 2.1

  11. #86
    Xtreme Enthusiast
    Join Date
    Oct 2007
    Location
    Singapore
    Posts
    970
    Quote Originally Posted by Manicdan View Post
    bold the word could next time, cause some apps love extra ram speeds, some like timings, the fact is 5-10% better perf can easily turn into massive profit margins depending on how close these are to the competition.

    this is extreme systems, and quad channel desktop memory is quite extreme id say.

    can't deny the benefit of wider memory bandwidth, but I think AMD address that enough in server space, on G32 and G34 socket.
    As the timing concern, I don't think manufacturer would tune a tight timing for memory system. They rather use a higher frequency module to archive the same performance for tight timing module.
    In consumer end, not much apps are memory hungry. Mostly depends on efficient cache system, brunch predictions and instructions could fetch per cycle.
    Main Rig:
    Processor & Motherboard:AMD Ryzen5 1400 ' Gigabyte B450M-DS3H
    Random Access Memory Module:Adata XPG DDR4 3000 MHz 2x8GB
    Graphic Card:XFX RX 580 4GB
    Power Supply Unit:FSP AURUM 92+ Series PT-650M
    Storage Unit:Crucial MX 500 240GB SATA III SSD
    Processor Heatsink Fan:AMD Wraith Spire RGB
    Chasis:Thermaltake Level 10GTS Black

  12. #87
    Registered User
    Join Date
    Dec 2008
    Location
    Viet Nam
    Posts
    53
    Quote Originally Posted by vietthanhpro View Post
    I can't access his blog.
    You should use Opera 10 to access this page. Enabling Opera Turbo is required

    // Lão Thành dùng mạng vịt teo à, nhà em Vịt teo cũng cóc vào được trang BD blog, phải dùng Opera Turbo

  13. #88
    Xtreme Addict
    Join Date
    Jan 2009
    Posts
    1,445
    Quote Originally Posted by amdcian View Post
    You should use Opera 10 to access this page. Enabling Opera Turbo is required

    // Lão Thành dùng mạng vịt teo à, nhà em Vịt teo cũng cóc vào được trang BD blog, phải dùng Opera Turbo
    what??? i am using IE/firefox, both of which work just fine in accessing the blog.
    [MOBO] Asus CrossHair Formula 5 AM3+
    [GPU] ATI 6970 x2 Crossfire 2Gb
    [RAM] G.SKILL Ripjaws X Series 16GB (4 x 4GB) 240-Pin DDR3 1600
    [CPU] AMD FX-8120 @ 4.8 ghz
    [COOLER] XSPC Rasa 750 RS360 WaterCooling
    [OS] Windows 8 x64 Enterprise
    [HDD] OCZ Vertex 3 120GB SSD
    [AUDIO] Logitech S-220 17 Watts 2.1

  14. #89
    Xtreme Member
    Join Date
    Nov 2008
    Posts
    117
    Quote Originally Posted by amdcian View Post
    You should use Opera 10 to access this page. Enabling Opera Turbo is required

    // Lão Thành dùng mạng vịt teo à, nhà em Vịt teo cũng cóc vào được trang BD blog, phải dùng Opera Turbo
    vnpt, not vietel., next month, fpt

    1)
    Pipe 0 -> multiplier, simple ops (add, subtract, logical)
    Pipe 1 -> AGU-like, barrel shifter, branch (both direct & indirect), simple ops
    Pipe 2 -> ABM, simple ops
    Pipe 3 -> AGU-like, barrel shifter, branch (both types too), simple ops
    Pipe1,3: ALU+AGU, not AGU ?
    2) I can't see FMISC unit.
    Originally Posted by haylui View Post
    L3 cache is slower than Dual-DDR3?


    -----------------
    nguồn:http://www.freepatentsonline.com/6167503.html
    IDU: instruction distribution unit
    RRU:Register renaming unit
    IDB: Instruction dispatch buffer
    ISC:Instruction scheduling controller
    RFBC: Register file/ bypass circuit
    EU: execution unit
    TSB: transfer staging buffer
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	Bull module.JPG 
Views:	1338 
Size:	69.5 KB 
ID:	103428  
    Last edited by vietthanhpro; 04-22-2010 at 10:51 PM.
    When AMD had 64-bit and Intel had only 32-bit, they tried to tell the world there was no need for 64-bit. Until they got 64-bit.
    When AMD had IMC and Intel had FSB, they told the world "there is plenty of life left in the FSB" (actual quote, and yes, they had *math* to show it had more bandwidth). Until they got an IMC.
    When AMD had dual core and Intel had single core, they told the world that consumers don't need multi core. Until they got dual core.
    When intel was using MCM, they said it was a better solution than native dies. Until they got native dies. (To be fair, we knocked *unconnected* MCM, and still do, we never knocked MCM as a technology, so hold your flames.)
    by John Fruehe

  15. #90
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by vietthanhpro View Post
    I can't access his blog.
    http://support.mozilla.com/de/forum/1/653217

    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  16. #91
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by vietthanhpro View Post
    1)
    Pipe 0 -> multiplier, simple ops (add, subtract, logical)
    Pipe 1 -> AGU-like, barrel shifter, branch (both direct & indirect), simple ops
    Pipe 2 -> ABM, simple ops
    Pipe 3 -> AGU-like, barrel shifter, branch (both types too), simple ops
    Pipe1,3: ALU+AGU, not AGU ?
    2) I can't see FMISC unit.
    1) That was a suggestion in a comment by Wireloop. It might work this way, but we don't have any evidence for it. But absence of evidence is no evidence for absence

    I now have an even more revolutionary model of the execution units in mind. Please be patient for more details.

    2) Do we need a FMISC unit? It might be there but the things handled by the current FMISC unit could be handled by the other FP units as well.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  17. #92
    Registered User
    Join Date
    Sep 2009
    Posts
    77


    What does "RM" stand for & What effect wiil it take?

  18. #93
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by superrugal View Post
    What does "RM" stand for & What effect wiil it take?
    That stands for "Resource Monitor", which could track resource utilization, latencies and the like. You could use them for some adaptions.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  19. #94
    Registered User
    Join Date
    Sep 2009
    Posts
    77
    Quote Originally Posted by Dresdenboy View Post
    That stands for "Resource Monitor", which could track resource utilization, latencies and the like. You could use them for some adaptions.
    THX for your reply.
    "track resource utilization", is it very important for bulldozer? Does AMD ever make use of this "Resource Monitor" before?

  20. #95
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by superrugal View Post
    THX for your reply.
    "track resource utilization", is it very important for bulldozer? Does AMD ever make use of this "Resource Monitor" before?
    Well it's even not sure, if it will be used in BD. See an old blog entry of mine for more details.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  21. #96
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by Dresdenboy View Post
    1) That was a suggestion in a comment by Wireloop. It might work this way, but we don't have any evidence for it. But absence of evidence is no evidence for absence

    I now have an even more revolutionary model of the execution units in mind. Please be patient for more details.

    2) Do we need a FMISC unit? It might be there but the things handled by the current FMISC unit could be handled by the other FP units as well.
    1) lack of disproof is not proof

    2) You mean FSTORE; which does not appear to be all that important (its used for FST(P), FLD(CONST) and "miscellaneous" instructions.) but the reason its existence was that it very cheaply enabled a few extra instructions to be executed for each given clock cycle. In Those edge cases, the addition of FSTORE effectively gave a 50% better performance than if the other two FP units just handled its work. It effectively meant adding a little more traces and tweaking the scheduler, and adding few transistors to the die.



    what you accidentally stumbled upon is the ago old argument:
    If ALU/FPUs are symmetric and can execute almost any integer instruction or the ALU/FPUs are are not symmetric and are slightly more restrictive. Each of the lanes must be nearly identical for distributed schedulers and instruction grouping to work optimally to provide maximum performance. But if you wanted to save power and die space, non-symmetric with a centralized scheduler will get the job done more efficiently.

    Now what AMD did with their FPU with K7 and subsequently K8, and K10 is just keep a centralized scheduler for its FPU because the cost of all the extra transistors was deemed unacceptable.

    But Given that transistors are cheaper now and as we approach 22nm, if AMD wanted better performance it could, now afford to match their integer design and move their FPU into a distributive scheduler; which would enable them to finally reach the last bit of performance remaining in x86 and a large chunk of x86_64's remaining performance.
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  22. #97
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by nn_step View Post
    2) You mean FSTORE; which does not appear to be all that important (its used for FST(P), FLD(CONST) and "miscellaneous" instructions.) but the reason its existence was that it very cheaply enabled a few extra instructions to be executed for each given clock cycle. In Those edge cases, the addition of FSTORE effectively gave a 50% better performance than if the other two FP units just handled its work. It effectively meant adding a little more traces and tweaking the scheduler, and adding few transistors to the die.
    ...
    what you accidentally stumbled upon is the ago old argument:
    If ALU/FPUs are symmetric and can execute almost any integer instruction or the ALU/FPUs are are not symmetric and are slightly more restrictive. Each of the lanes must be nearly identical for distributed schedulers and instruction grouping to work optimally to provide maximum performance. But if you wanted to save power and die space, non-symmetric with a centralized scheduler will get the job done more efficiently.

    Now what AMD did with their FPU with K7 and subsequently K8, and K10 is just keep a centralized scheduler for its FPU because the cost of all the extra transistors was deemed unacceptable.

    But Given that transistors are cheaper now and as we approach 22nm, if AMD wanted better performance it could, now afford to match their integer design and move their FPU into a distributive scheduler; which would enable them to finally reach the last bit of performance remaining in x86 and a large chunk of x86_64's remaining performance.
    AMD also used the term FMISC for the FSTORE unit (e.g. in 25112.pdf).

    Several AMD patents already describe the included scheduler (included in the exemplary architecture) to have free choice. But since you have to take care of program order, dependencies, flags, speculative state etc. this might be more difficult having many units to issue to (I think, this has about quadratically increasing complexity). But with only 2 ALUs and only 2 AGUs such things could be resolved easily. Also the suggestion of Wireloop will make things easier, because for many instruction types, there are only one or two units to choose from. But these are details, for which we have to wait until AMD discloses them.

    And I think, the BD design is not about cheaper transistors but about more expensive energy consumption (with leakage becoming more important). And thanks to the SMT like execution on BD's FPU, this unit doesn't need to be of Speedy Gonzales' type and can afford to have somewhat longer latencies due to a more power efficient scheduling.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  23. #98
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by Dresdenboy View Post
    AMD also used the term FMISC for the FSTORE unit (e.g. in 25112.pdf).

    Several AMD patents already describe the included scheduler (included in the exemplary architecture) to have free choice. But since you have to take care of program order, dependencies, flags, speculative state etc. this might be more difficult having many units to issue to (I think, this has about quadratically increasing complexity). But with only 2 ALUs and only 2 AGUs such things could be resolved easily. Also the suggestion of Wireloop will make things easier, because for many instruction types, there are only one or two units to choose from. But these are details, for which we have to wait until AMD discloses them.

    And I think, the BD design is not about cheaper transistors but about more expensive energy consumption (with leakage becoming more important). And thanks to the SMT like execution on BD's FPU, this unit doesn't need to be of Speedy Gonzales' type and can afford to have somewhat longer latencies due to a more power efficient scheduling.

    well given the performance advantages of distributive scheduling, it seems likely that AMD will probably go that route for Bulldozer. [added to the fact that it makes adding more units simple and cheap]

    However I honestly am curious if for Bobcat AMD is going to go the other way and use a centralized scheduler to reduce transistors.
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  24. #99
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by vietthanhpro View Post
    -----------------
    nguồn:http://www.freepatentsonline.com/6167503.html
    IDU: instruction distribution unit
    RRU:Register renaming unit
    IDB: Instruction dispatch buffer
    ISC:Instruction scheduling controller
    RFBC: Register file/ bypass circuit
    EU: execution unit
    TSB: transfer staging buffer
    Yes, Norman Jouppi had and has nice ideas.

    How about David Witt (AMD) in 1998 (year of filing):

    http://www.freepatentsonline.com/6119223.html
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  25. #100
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by nn_step View Post
    well given the performance advantages of distributive scheduling, it seems likely that AMD will probably go that route for Bulldozer. [added to the fact that it makes adding more units simple and cheap]

    However I honestly am curious if for Bobcat AMD is going to go the other way and use a centralized scheduler to reduce transistors.
    It looks like we have stronger support for 4 ALUs + 4 AGUs per core now:
    http://citavia.blog.de/2010/04/28/an...again-8474038/

    And here is a teaser for the other stuff mentioned there:
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

Page 4 of 5 FirstFirst 12345 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •