MMM
Page 28 of 29 FirstFirst ... 182526272829 LastLast
Results 676 to 700 of 719

Thread: AMD cuts to the core with 'Bulldozer' Opterons

  1. #676
    Xtreme Enthusiast
    Join Date
    Oct 2008
    Posts
    678
    I have to say that AMDs work with the 128bit FPU seems to be more of a cut and paste job, it could be more space effective than it is, I guess you have to accept that giving the very impressive last minute solution Phenom seems to be. Meanwhile, It's hard to believe that Intel doubled their FPU with only very minor size increase.
    Talking about AMDs 128bit FPU job that is kind of AMDs thing at the moment, everything they are selling is the same old chip from 1999 with lots of addons bolted on. Intel got the money to rearrange the entire core every generation, and with that kind of resources it isn't hard to see why they currently are on top. Their cores are being twice as big and a bit more tuned for todays manufacturing capabilities.

    I'm glad that Bulldozer is around the corner.

  2. #677
    Xtreme Member
    Join Date
    Aug 2006
    Posts
    114
    I thought core was largely derived from the original Pentium family mainly the p3 mixed with a little of the front end of the p4 sorry that's probably a little too oversimplified but that's my basic understanding. No processor design is started from scratch you take what works and build on it... bulldozer will be the first largest step away from the base athlon architecture since the athlon 64 in along time.Hope it turns out well as well as the athlon 64 did *crosses fingers*
    phenom 2 940 stock
    gskill 4gb 1066 ddr2
    2 1.5Tb seagate hds in raid 0
    30gb ocz core series hd for os
    8800gts 640
    xigamatek 850w ps
    water cooling cpu: dtek fuzion 2, swiftech 320, 3 ultra kazes, d5 with detroit top
    custom acrylic case in progress :P

  3. #678
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Actually the whole Core generation is based on Yonah which in turn was the latest souped up P6 microarchitecture. They did invest massively in Yonah ->Merom change,as far as core is concerned(50% bigger core logic on the same process).Nehalem added another ~20% on top of that.Westmere to SB change looks pale to Penryn->Nehalem change and anemic when compared Yonah->Merom case.

  4. #679
    Xtreme Enthusiast
    Join Date
    Oct 2008
    Posts
    678
    Quote Originally Posted by jakefalcons View Post
    I thought core was largely derived from the original Pentium family mainly the p3 mixed with a little of the front end of the p4 sorry that's probably a little too oversimplified but that's my basic understanding. No processor design is started from scratch you take what works and build on it... bulldozer will be the first largest step away from the base athlon architecture since the athlon 64 in along time.Hope it turns out well as well as the athlon 64 did *crosses fingers*
    Base on, and bolt on isn't the same thing. Watch the dies from P6 to Dothan, Yonah, Conroe, Penryn and bloomfield. Ignore caches and IO, just watch the cores, and it's totally different almost every time. While a Phenom II core looks just like a K7 with an extra FPU.

    And I don't think Bulldozer will be comparable with K7 to K8.
    This is more like K6 -> K7. In other words totally different. I don't even think you can identify a part of the die that still is the same.

    We might actually now by the end of the month.


    JF-AMD? Any ideas on when we can see die shots?

  5. #680
    Xtreme Member
    Join Date
    Aug 2004
    Posts
    210
    Quote Originally Posted by -Boris- View Post
    I have to say that AMDs work with the 128bit FPU seems to be more of a cut and paste job, it could be more space effective than it is, I guess you have to accept that giving the very impressive last minute solution Phenom seems to be. Meanwhile, It's hard to believe that Intel doubled their FPU with only very minor size increase.
    No, it cant be more space efficient. If you change your 4 cylinder car engine with an 8 cylinder engine, then it will need doubled space, too (if each cylinder is of similar size).

    It is as easy as this. Double power -> double size.

    However if you just want more horsepowers, you can try to build in a turbo charger/intercooler setup instead of the +4 cylinders. That would be the rough equivalent to Sandy's hyper-pipelined FPU ;-)

  6. #681
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by savantu View Post
    Instead of trying to find some weirdo explanations, we could take his words at face value. The words he uses are pretty straightforward :
    I wouldn't be surprised of some intentional misleading done previously for deceiving the competition.
    Looking at Mark's answer again with a little more time, I see that he refers to using 128 bit wide EUs (point 1) by Eric) during 2 consecutive clock cycles as double pumping (and thus execute a 256 bit SIMD instruction). This is a different kind of double pumping using base clock cycles as smallest unit.

    Ok. Let me cite another Intel employee posting in the same thread (and has been quoted here already IIRC):
    It seems point 1) may have assumed it requires monolithic 256-bit hardware to achieve 1 cycle throughput for 256-bit AVX instructions. That's not true.
    With "misleading" you mean the first AVX LO/HI chart? I think there simply was not much time between publishing the first chart and the corrected one to have some meaningful effect.

    OTOH saving die area and thus leakage (which would limit performance otherwise) doesn't look to be a bad choice. So why are you defending a monolithic implementation so heavily?

    @Hans:
    Thanks for the links.
    Last edited by Dresdenboy; 08-11-2010 at 05:52 AM.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  7. #682
    Xtreme Enthusiast
    Join Date
    Oct 2008
    Posts
    678
    Quote Originally Posted by Opteron146 View Post
    No, it cant be more space efficient. If you change your 4 cylinder car engine with an 8 cylinder engine, then it will need doubled space, too (if each cylinder is of similar size).

    It is as easy as this. Double power -> double size.

    However if you just want more horsepowers, you can try to build in a turbo charger/intercooler setup instead of the +4 cylinders. That would be the rough equivalent to Sandy's hyper-pipelined FPU ;-)
    Well some parts are shared, not everything is doubled, so AMD has a chunk of silicon sitting near the second FPU pretty much unused.

    There is lot of empty space on AMDs chips nowdays.

  8. #683
    Xtreme Member
    Join Date
    Aug 2004
    Posts
    210
    Quote Originally Posted by -Boris- View Post
    Well some parts are shared, not everything is doubled, so AMD has a chunk of silicon sitting near the second FPU pretty much unused.
    The stuff that is not doubled is the 3Dnow! legacy area, no need to double, 'cause there is no need to double 3DNow! ;-)

    But the main pipelines, which are use up most of the FPU are are doubled. That's the important fact.

  9. #684
    Xtreme Addict
    Join Date
    Apr 2007
    Location
    canada
    Posts
    1,886
    good read since the last 3 pages ... keep it up guys
    WILL CUDDLE FOR FOOD

    Quote Originally Posted by JF-AMD View Post
    Dual proc client systems are like sex in high school. Everyone talks about it but nobody is really doing it.

  10. #685
    Xtreme Enthusiast
    Join Date
    Oct 2008
    Posts
    678
    Quote Originally Posted by Opteron146 View Post
    The stuff that is not doubled is the 3Dnow! legacy area, no need to double, 'cause there is no need to double 3DNow! ;-)
    What? But I want 128Bit 3DNow! Pro++.

    And there is still wasted space, if they had the resources for a complete overhaul on the layout they would save some, and possibly increase performance.

  11. #686
    Xtreme Enthusiast
    Join Date
    Dec 2009
    Posts
    846
    Quote Originally Posted by Opteron146 View Post
    No, it cant be more space efficient. If you change your 4 cylinder car engine with an 8 cylinder engine, then it will need doubled space, too (if each cylinder is of similar size).

    It is as easy as this. Double power -> double size.

    However if you just want more horsepowers, you can try to build in a turbo charger/intercooler setup instead of the +4 cylinders. That would be the rough equivalent to Sandy's hyper-pipelined FPU ;-)
    Wow, that is pretty much a description of the whole bulldozer architecture. No need to double everything, only the things that matter.
    While I work for AMD, my posts are my own opinions.

    http://blogs.amd.com/work/author/jfruehe/

  12. #687
    Xtreme Enthusiast
    Join Date
    Dec 2009
    Posts
    846
    Quote Originally Posted by -Boris- View Post

    JF-AMD? Any ideas on when we can see die shots?
    Die shots will not be at hot chips; we aren't disclosing those just yet.
    While I work for AMD, my posts are my own opinions.

    http://blogs.amd.com/work/author/jfruehe/

  13. #688
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by Opteron146 View Post
    However if you just want more horsepowers, you can try to build in a turbo charger/intercooler setup instead of the +4 cylinders. That would be the rough equivalent to Sandy's hyper-pipelined FPU ;-)
    Hmm.. with all these design options there are also several ways to implement a decode/dispatch unit for Bulldozer, which might not only decode up to 4 x86 ops per cycle but - if some requirements are met - even up to 8 ops per cycle. I'm thinking of the fast/slow path concept and also double pumping here. A fast path (for simple/double decode ops) could be used twice per base clock cycle if it's that fast.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  14. #689
    Xtreme Addict
    Join Date
    Apr 2007
    Location
    canada
    Posts
    1,886
    Quote Originally Posted by JF-AMD View Post
    Die shots will not be at hot chips; we aren't disclosing those just yet.


    i presume die shots would be available closer to actual release date am i right?
    WILL CUDDLE FOR FOOD

    Quote Originally Posted by JF-AMD View Post
    Dual proc client systems are like sex in high school. Everyone talks about it but nobody is really doing it.

  15. #690
    Xtreme Enthusiast
    Join Date
    Dec 2009
    Posts
    846
    Quote Originally Posted by Sn0wm@n View Post
    i presume die shots would be available closer to actual release date am i right?
    yes
    While I work for AMD, my posts are my own opinions.

    http://blogs.amd.com/work/author/jfruehe/

  16. #691
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by Dresdenboy View Post
    OTOH saving die area and thus leakage (which would limit performance otherwise) doesn't look to be a bad choice. So why are you defending a monolithic implementation so heavily?
    die area does not equate to leakage. transistors do.

    faster transistors also increase Ioff exponentially.

  17. #692
    Xtreme Addict
    Join Date
    Apr 2007
    Location
    canada
    Posts
    1,886
    cant wait for that particular moment ... so much anticipation being build since the last 3 years
    WILL CUDDLE FOR FOOD

    Quote Originally Posted by JF-AMD View Post
    Dual proc client systems are like sex in high school. Everyone talks about it but nobody is really doing it.

  18. #693
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by Chumbucket843 View Post
    die area does not equate to leakage. transistors do.

    faster transistors also increase Ioff exponentially.
    I admit having used a simplification. So the transistors populating the said die area will use less leakage.

    But why do you mention faster transistors? Pipelining per se is about cutting work into smaller pieces of work, not doing it with faster transistors. Then I probably wouldn't need to further pipeline the circuit in question.

    Example: an FP multiplier has a latency of 1000 ps or 1 ns. So if I use it in one pipeline stage (for simplification we leave out operand catching etc.) I could clock it at 1 GHz, being able to feed it once per 1 ns and get a result at the same rate. Latency is just one cycle.

    With 2 pipeline stages, some additional latch overhead and some inefficiency due to cutting the multiplier in two about equally fast pieces, the overall latency could become 1100 ps. But I could clock it at 1.8 GHz with two stages of 550 ps. I could feed the multiplier at that rate and get results at that rate. Latency would be 2 cycles. Another slight disadvantage would be, that there could be up to 2 multiplications going on at any time vs. one in the 1 cycle version. Two muls mean more power consumption. OTOH I don't increase energy per instruction.

    That's the principle.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  19. #694
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by Hans de Vries View Post
    It's no coincidence that the architects behind the very long SIMD words
    (256 bit, 512 bit and longer) are Doug Carmean and Eric Sprangle who joined
    Intel from Ross technologies.

    These are exactly the Hyperpipelining specialists at Intel:

    (1) They co-authored the original hyperpipelining paper:
    Increasing Processor Performance by Implementing Deeper Pipelines

    (2) They leaded the original ~60 stage hyperpiplined Nehalem project.
    http://www.theinquirer.net/inquirer/...em-slated-2005

    (3) They initiated the Larrabee project. One of the main ideas behind
    Larrabee is to achieve a theoretical maximum number of FLOPs on a
    certain die with a limited number of transistors. A fourfold hiperpipelined
    128 bit unit running at 4.8 GHz can produce 512 bit results at 1.2 GHz
    using only 25%(+a bit) of the transistors of a non hyperpipelined unit.
    ftp://download.intel.com/technology/...abee_paper.pdf
    http://www.drdobbs.com/high-performa...ting/216402188
    Your post is a pure speculations. You really don't know who has actually worked on architecture of SB. Also even if assume that those was the same ppl who worked on Netburst, still you have no even a litle bit of info how really AVX was implemented. I can recall you that a "real" 128 bit SSE was firstly implemented by a Haifa team which is curently works on SB.


    The SIMD units are the easiest (of all units) to hyperpipeline. All instructions
    which could cause problems for hyperpipelining have been systematically
    left out of the AVX and LNI specifications. (for instance data shuffles
    crossing 128 bit boundaries)
    Your assumptions are not nessesary true. While a general multiplication algorithm looks relative easy to serialize it is not nessesary the case in a real life since multiple heuristic may be added to a HW algorithm to make it using less power/space, make it faster, e.t.c. I heard that it was a big chalenge for intel to implement fast radix-16 divider in Penryn. Also while I don't know how much space consumes an fp multiplier I may assume that hyperpipelining may consume more space (as an example -to save intermidate results in a multiplication loop) then implement additional multiplier. Any way even in Netburst Intel implemented "doublepumping" only for some integer ops and decided not to implement a double pumped alu for a compex integer operations such as divide/multiply.

  20. #695
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by kl0012 View Post
    Your post is a pure speculations. You really don't know who has actually worked on architecture of SB. Also even if assume that those was the same ppl who worked on Netburst, still you have no even a litle bit of info how really AVX was implemented. I can recall you that a "real" 128 bit SSE was firstly implemented by a Haifa team which is curently works on SB.
    True. Sandy Bridge was designed by the Haifa team.

    Nehalem was Oregon, so it Hasswell.


    Your assumptions are not nessesary true. While a general multiplication algorithm looks relative easy to serialize it is not nessesary the case in a real life since multiple heuristic may be added to a HW algorithm to make it using less power/space, make it faster, e.t.c. I heard that it was a big chalenge for intel to implement fast radix-16 divider in Penryn. Also while I don't know how much space consumes an fp multiplier I may assume that hyperpipelining may consume more space (as an example -to save intermidate results in a multiplication loop) then implement additional multiplier. Any way even in Netburst Intel implemented "doublepumping" only for some integer ops and decided not to implement a double pumped alu for a compex integer operations such as divide/multiply.
    Intel double pumped the ALUs and turned them into "fireballs" by their own admission. Doubling pumping the FPU sounds even crazier.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  21. #696
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Dresdenboy View Post
    I admit having used a simplification. So the transistors populating the said die area will use less leakage.

    But why do you mention faster transistors? Pipelining per se is about cutting work into smaller pieces of work, not doing it with faster transistors. Then I probably wouldn't need to further pipeline the circuit in question.

    Example: an FP multiplier has a latency of 1000 ps or 1 ns. So if I use it in one pipeline stage (for simplification we leave out operand catching etc.) I could clock it at 1 GHz, being able to feed it once per 1 ns and get a result at the same rate. Latency is just one cycle.

    With 2 pipeline stages, some additional latch overhead and some inefficiency due to cutting the multiplier in two about equally fast pieces, the overall latency could become 1100 ps. But I could clock it at 1.8 GHz with two stages of 550 ps. I could feed the multiplier at that rate and get results at that rate. Latency would be 2 cycles. Another slight disadvantage would be, that there could be up to 2 multiplications going on at any time vs. one in the 1 cycle version. Two muls mean more power consumption. OTOH I don't increase energy per instruction.

    That's the principle.
    You contradict yourself in your own argument.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  22. #697
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by kl0012 View Post
    Your post is a pure speculations. You really don't know who has actually worked on architecture of SB. Also even if assume that those was the same ppl who worked on Netburst, still you have no even a litle bit of info how really AVX was implemented. I can recall you that a "real" 128 bit SSE was firstly implemented by a Haifa team which is curently works on SB.

    Your assumptions are not nessesary true. While a general multiplication algorithm looks relative easy to serialize it is not nessesary the case in a real life since multiple heuristic may be added to a HW algorithm to make it using less power/space, make it faster, e.t.c. I heard that it was a big chalenge for intel to implement fast radix-16 divider in Penryn. Also while I don't know how much space consumes an fp multiplier I may assume that hyperpipelining may consume more space (as an example -to save intermidate results in a multiplication loop) then implement additional multiplier. Any way even in Netburst Intel implemented "doublepumping" only for some integer ops and decided not to implement a double pumped alu for a compex integer operations such as divide/multiply.
    These observations about hyperpipelining are really far beyond any
    reasonable doubt.

    I designed IEEE compatible Floating Point units myself, not only the usual
    multiply/add ones but also much more complicated ones like fully pipelined
    FP complex function unit which can output each cycle the result of any of a
    square root, reciprocal, exponent, logarithm, sine/cosine, arcsine/arcosine,
    while having any mix of these in the pipeline simultaneously.

    So I think you may thrust me on this....


    Regards, Hans

  23. #698
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Well, a month left to autumn IDF when things will get clear on SB.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  24. #699
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by Hans de Vries View Post
    These observations about hyperpipelining are really far beyond any
    reasonable doubt.

    I designed IEEE compatible Floating Point units myself, not only the usual
    multiply/add ones but also much more complicated ones like fully pipelined
    FP complex function unit which can output each cycle the result of any of a
    square root, reciprocal, exponent, logarithm, sine/cosine, arcsine/arcosine,
    while having any mix of these in the pipeline simultaneously.

    So I think you may thrust me on this....


    Regards, Hans
    While I really believe you designed all that stuff (I designed by self some functional blocks, including adder and multiplier, during an university VLSI study), it is still a bit perkily to think that no one else can do it better (or at least in a different way) then you.
    I also trust an intel engineers about a difficulty in designing an efficient execution units.
    http://www.intel.com/technology/itj/...er/8-radix.htm

  25. #700
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by Dresdenboy View Post
    I admit having used a simplification. So the transistors populating the said die area will use less leakage.

    But why do you mention faster transistors? Pipelining per se is about cutting work into smaller pieces of work, not doing it with faster transistors. Then I probably wouldn't need to further pipeline the circuit in question.
    using die area as an estimation for static power can be very misleading. leakage increases linearly with transistors and exponentially with drive current.
    Example: an FP multiplier has a latency of 1000 ps or 1 ns. So if I use it in one pipeline stage (for simplification we leave out operand catching etc.) I could clock it at 1 GHz, being able to feed it once per 1 ns and get a result at the same rate. Latency is just one cycle.

    With 2 pipeline stages, some additional latch overhead and some inefficiency due to cutting the multiplier in two about equally fast pieces, the overall latency could become 1100 ps. But I could clock it at 1.8 GHz with two stages of 550 ps. I could feed the multiplier at that rate and get results at that rate. Latency would be 2 cycles. Another slight disadvantage would be, that there could be up to 2 multiplications going on at any time vs. one in the 1 cycle version. Two muls mean more power consumption. OTOH I don't increase energy per instruction.

    That's the principle.
    i understand the concept of pipelining.

    if speed is critical you might want to use a pulsed latch.

Page 28 of 29 FirstFirst ... 182526272829 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •