Page 29 of 29 FirstFirst ... 1926272829
Results 701 to 719 of 719

Thread: AMD cuts to the core with 'Bulldozer' Opterons

  1. #701
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by kl0012 View Post
    While I really believe you designed all that stuff (I designed by self some functional blocks, including adder and multiplier, during an university VLSI study), it is still a bit perkily to think that no one else can do it better (or at least in a different way) then you.
    I also trust an intel engineers about a difficulty in designing an efficient execution units.
    http://www.intel.com/technology/itj/...er/8-radix.htm
    If a design is "feedback free" then hyperpipelining is a fully automated
    process. Synopsis design compiler has had such features for years.
    The software has to find the locations to which the signals have
    propagated after a 1/2 cycle (or 1/N cycle in general) and then place
    the flipflops there which will constitute the intermediate pipeline stages.

    Intel (that is, the process guys) may probably have developed its own
    software specifically adopted to Intel's own process physics model.


    "Feedback free" is when the logic doesn't need to know what it just
    calculated in the previous cycle. The P4 ALU was not feedback free
    because it needs results from the previous cycle:

    non-hyperpipelined:

    ----> C = A + B;
    ----
    ----> E = C + D;
    ----

    hyperpipelined:

    ----> C = A + B;
    ----> E = C + D;
    ---->

    The problem is that the result C is not yet known after a 1/2 cycle,
    so you have to design logic which works with the intermediate result
    instead. There exist tricks which do so. In general these tricks are
    different for different functions and sometimes there are no tricks.

    Hyperpipeling for circuits which are not "feedback free" is a Science
    on its own right. The risk is that all these tricks blow up the size of
    the circuits as well as its power consumption.


    Regards, Hans

  2. #702
    I am Xtreme FlanK3r's Avatar
    Join Date
    May 2008
    Location
    Czech republic
    Posts
    6,823
    its some idea, how much transistors will have bulldozer?
    ROG Power PCs - Intel and AMD
    CPUs:i9-7900X, i9-9900K, i7-6950X, i7-5960X, i7-8086K, i7-8700K, 4x i7-7700K, i3-7350K, 2x i7-6700K, i5-6600K, R7-2700X, 4x R5 2600X, R5 2400G, R3 1200, R7-1800X, R7-1700X, 3x AMD FX-9590, 1x AMD FX-9370, 4x AMD FX-8350,1x AMD FX-8320,1x AMD FX-8300, 2x AMD FX-6300,2x AMD FX-4300, 3x AMD FX-8150, 2x AMD FX-8120 125 and 95W, AMD X2 555 BE, AMD x4 965 BE C2 and C3, AMD X4 970 BE, AMD x4 975 BE, AMD x4 980 BE, AMD X6 1090T BE, AMD X6 1100T BE, A10-7870K, Athlon 845, Athlon 860K,AMD A10-7850K, AMD A10-6800K, A8-6600K, 2x AMD A10-5800K, AMD A10-5600K, AMD A8-3850, AMD A8-3870K, 2x AMD A64 3000+, AMD 64+ X2 4600+ EE, Intel i7-980X, Intel i7-2600K, Intel i7-3770K,2x i7-4770K, Intel i7-3930KAMD Cinebench R10 challenge AMD Cinebench R15 thread Intel Cinebench R15 thread

  3. #703
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by Chumbucket843 View Post
    using die area as an estimation for static power can be very misleading. leakage increases linearly with transistors and exponentially with drive current.
    Ok, then let me rephrase it: double pumping uses less transistors than a monolithic implementation.

    Quote Originally Posted by Chumbucket843 View Post
    i understand the concept of pipelining.

    if speed is critical you might want to use a pulsed latch.
    No prob. The example was also targeted at other possibly interested readers. Unfortunately it didn't work in one case.

    Quote Originally Posted by savantu View Post
    You contradict yourself in your own argument.
    Wrong. In a pipeline like those discussed here there is a lot of logic consisting of millions of transistors.

    In one pipeline stage there are several layers of many transistors creating the desired output of e.g. an adder. The longest possible delay path is the well known critical path.

    Now you can have many fast or fewer slow transistors in such a circuit at a same clock frequency. The slower won't do as much and might need another pipe stage to get their work done. In my example I said that I would cut the multiplier circuit and do half the work per pipeline stage. There is no need to use faster and more power hungry transistors. Yet you could apply a faster clock frequency. It's as simple as that.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  4. #704
    Xtreme Member
    Join Date
    Aug 2004
    Posts
    210
    Quote Originally Posted by -Boris- View Post
    What? But I want 128Bit 3DNow! Pro++.

    And there is still wasted space, if they had the resources for a complete overhaul on the layout they would save some, and possibly increase performance.
    The wasted space is used up now for Llano, maybe SSE4.1, maybe even AVX, perhaps 3DNow!++ .

  5. #705
    Xtreme Member
    Join Date
    Aug 2004
    Posts
    210
    Quote Originally Posted by JF-AMD View Post
    Wow, that is pretty much a description of the whole bulldozer architecture. No need to double everything, only the things that matter.
    Haha, but that was initially meant to explain Sandy's FPU ;-)

    But perhaps that's a nice idea for some presentation slides:
    double xy unit because ... <add great performance gain>
    No doubled xz unit, because ... <add minor gains>.
    .
    .

    At least if there are several doubled units ;-)

    In any case, everybody is looking forward for Hot Chips.

    One additional question to the die shots topic:
    May we see at least a partly die-shot, i.e. of one BD module ? Like in case for Llano, where there are detailed die shots for the "K10.6" core, but not the whole Llano die, including the GPU.

    thanks

    Opteron

  6. #706
    Xtreme Cruncher
    Join Date
    Apr 2005
    Location
    TX, USA
    Posts
    898
    Quote Originally Posted by savantu View Post
    You contradict yourself in your own argument.
    Little late on this since I'm a slow writer at times, but thought I'd throw in my 2cents in still.

    He doesn't, you have the contexts mixed up
    (this is topic that crosses the logic/physical design and process boundaries)

    Faster transistors is in reference to using different physical-sizing/doping to achieve a lower propagation delay (through higher driving strength, etc.), which is independent of the pipeline clock-speed. The opposite is true however, the pipeline clock-speed is dependent upon the speed of the transistors; the clock-speed is bounded by worst-case delay path through the logic-transistors + setup time of the pipeline latches (without going into latch hold-time and clock-jitter/offset).

    In his hypothetical scenario, he validly describes a situation where the actual speed (process parameters) of the logic-transistors does not have to be modified, while the clock-speed of the pipeline/unit as a whole can be raised.
    Last edited by rcofell; 08-11-2010 at 01:16 PM. Reason: Revise revise revise.



  7. #707
    Xtreme Enthusiast
    Join Date
    Dec 2009
    Posts
    846
    Quote Originally Posted by Opteron146 View Post
    One additional question to the die shots topic:
    May we see at least a partly die-shot, i.e. of one BD module ?
    Nope, there are no die shots in the presentation.
    While I work for AMD, my posts are my own opinions.

    http://blogs.amd.com/work/author/jfruehe/

  8. #708
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by Hans de Vries View Post
    If a design is "feedback free" then hyperpipelining is a fully automated
    process. Synopsis design compiler has had such features for years.
    The software has to find the locations to which the signals have
    propagated after a 1/2 cycle (or 1/N cycle in general) and then place
    the flipflops there which will constitute the intermediate pipeline stages.

    Intel (that is, the process guys) may probably have developed its own
    software specifically adopted to Intel's own process physics model.


    "Feedback free" is when the logic doesn't need to know what it just
    calculated in the previous cycle. The P4 ALU was not feedback free
    because it needs results from the previous cycle:

    non-hyperpipelined:

    ----> C = A + B;
    ----
    ----> E = C + D;
    ----

    hyperpipelined:

    ----> C = A + B;
    ----> E = C + D;
    ---->

    The problem is that the result C is not yet known after a 1/2 cycle,
    so you have to design logic which works with the intermediate result
    instead. There exist tricks which do so. In general these tricks are
    different for different functions and sometimes there are no tricks.

    Hyperpipeling for circuits which are not "feedback free" is a Science
    on its own right. The risk is that all these tricks blow up the size of
    the circuits as well as its power consumption.



    Regards, Hans
    I specifically bolded the most important part of your post. This is why I mentioned multiplication/division.

  9. #709
    Xtreme Addict
    Join Date
    Nov 2007
    Location
    Illinois
    Posts
    2,095
    Quote Originally Posted by kl0012 View Post
    I specifically bolded the most important part of your post. This is why I mentioned multiplication/division.
    Do you actually understand what he is saying? It seems as if you're retreating behind your assumption that only intel employees understand what's going on.
    E7200 @ 3.4 ; 7870 GHz 2 GB
    Intel's atom is a terrible chip.

  10. #710
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by cegras View Post
    Do you actually understand what he is saying? It seems as if you're retreating behind your assumption that only intel employees understand what's going on.
    Yes, I fully understand what he said. But during the design of an efficient (from a space/power respective) miltiplier you probably can't avoid an iterative part of multiplier tree which is hardly pipelineable.
    Last edited by kl0012; 08-11-2010 at 10:47 PM.

  11. #711
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by kl0012 View Post
    Yes, I fully understand what he said. But during the design of an efficient (from a space/power respective) miltiplier you probably can't avoid an iterative part of multiplier tree which is hardly pipelineable.
    FP Multipliers are fully pipelined since many years now...

    The CRAY-1 had fully pipelined 64 bit FP multipliers in 1975
    http://en.wikipedia.org/wiki/Cray-1

    The Intel i860 had an on chip fully pipelined 64 bit FP multiplier in 1989
    http://en.wikipedia.org/wiki/Intel_i860

    Xilinx Virtex-7 family will have an FPGA with almost 4000 pipelined
    multiplier blocks each with a size of 25x18 bit with which you can build
    all kind of multipliers in any way you want.
    http://www.xilinx.com/support/docume...s_Overview.pdf


    Regards, Hans

  12. #712
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by Dresdenboy View Post
    The first version of the chart (said to be wrong in 1)) contained "AVX LO" and "AVX HI" units, also drawn at the same width as the 128 bit units. Maybe they're even not using double pumping but other techniques like wavefronts (less likely).
    With "wavefronts", which is actually called "wave pipelining" (my brain mixed this up a bit, "wavefront" is used in some GPUs), there's even no need to add latches to get the pipelined behaviour:
    http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  13. #713
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by Hans de Vries View Post
    FP Multipliers are fully pipelined since many years now...

    The CRAY-1 had fully pipelined 64 bit FP multipliers in 1975
    http://en.wikipedia.org/wiki/Cray-1

    The Intel i860 had an on chip fully pipelined 64 bit FP multiplier in 1989
    http://en.wikipedia.org/wiki/Intel_i860

    Xilinx Virtex-7 family will have an FPGA with almost 4000 pipelined
    multiplier blocks each with a size of 25x18 bit with which you can build
    all kind of multipliers in any way you want.
    http://www.xilinx.com/support/docume...s_Overview.pdf

    Regards, Hans
    Thats about an actual implementation of multiplier (as I said above). You may chose parallel implementation and then you have no problem for further serializing it. But such implementation consumes much more space and power. On the other hand you may choose an iterative implementation and still get a fully pipelined multiplier if you can implement an iterative stage in one cycle but this specific stage is not pipelineable further (in case you want to rise the freq.). I know nothing about the actual implementation of multipliers in Intel's cpu, but my point is that "pipelineablilty" depends on specific design decision.
    Last edited by kl0012; 08-12-2010 at 01:12 AM.

  14. #714
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by kl0012 View Post
    Thats about an actual implementation of multiplier (as I said above). You may chose parallel implementation and then you have no problem for further serializing it. But such implementation consumes much more space and power. On the other hand you may choose an iterative implementation and still get a fully pipelined multiplier if you can implement an iterative stage in one cycle but this specific stage is not pipelineable further (in case you want to rise the freq.). I know nothing about the actual implementation of multipliers in Intel's cpu, but my point is that "pipelineablilty" depends on specific design decision.
    Intel's multipliers as well as those of almost everybody else are variations of
    the booth encoded wallace tree multiplier

    These are generally pipelined except for the small ones which operate
    within a single cycle.


    Regards, Hans

  15. #715
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by Hans de Vries View Post
    Intel's multipliers as well as those of almost everybody else are variations of
    the booth encoded wallace tree multiplier

    These are generally pipelined except for the small ones which operate
    within a single cycle.


    Regards, Hans
    I don't see how your post contradicts mine. You call it wallace tree, I call it multiplier tree. Any way wallace tree arrays may have different sizes - if I remember correctly, Pentium 4 multiplication array is less then 64x64 so it needs additional iterations to execute DP and EP multiplications. BTW in all multipliers a wallace tree is a stage by it self and it is not usual to break it in an additional stages.
    Last edited by kl0012; 08-12-2010 at 07:37 AM.

  16. #716
    Xtreme Enthusiast
    Join Date
    Feb 2009
    Posts
    800
    Quote Originally Posted by JF-AMD View Post
    yes
    That's a promise!

    There wasn't any dieshots of your Evergreen graphics chips (although looking back, you guys have always posted dieshots for processors.).

  17. #717
    Xtreme Enthusiast
    Join Date
    Dec 2009
    Posts
    846
    Well, Bulldozer is my product and I am the one in charge of disclosures. We have a cadence for when we release info and things like die shots, cache sizes, clock speeds, pricing, launch dates, benchmarks, etc. are all things that help the other guys, so we have no desire to disclose them too early.

    As for GPU, I am not even in the same country as that part of the company, so I really can't comment on them.
    While I work for AMD, my posts are my own opinions.

    http://blogs.amd.com/work/author/jfruehe/

  18. #718
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by JF-AMD View Post
    Well, Bulldozer is my product and I am the one in charge of disclosures. We have a cadence for when we release info and things like die shots, cache sizes, clock speeds, pricing, launch dates, benchmarks, etc. are all things that help the other guys, so we have no desire to disclose them too early.
    I'm sensing no die shot until the Nov 9 Analyst Day, at which point the launch window will be narrowed to "H2 2011" and cache size will be announced (as the die will give it away), no pricing, clock speeds or benchmarks until 2011.

  19. #719
    Xtreme Enthusiast
    Join Date
    Oct 2007
    Location
    Singapore
    Posts
    970
    Quote Originally Posted by terrace215 View Post
    I'm sensing no die shot until the Nov 9 Analyst Day, at which point the launch window will be narrowed to "H2 2011" and cache size will be announced (as the die will give it away), no pricing, clock speeds or benchmarks until 2011.

    ok noted....thanks for the info
    Main Rig:
    Processor & Motherboard:AMD Ryzen5 1400 ' Gigabyte B450M-DS3H
    Random Access Memory Module:Adata XPG DDR4 3000 MHz 2x8GB
    Graphic Card:XFX RX 580 4GB
    Power Supply Unit:FSP AURUM 92+ Series PT-650M
    Storage Unit:Crucial MX 500 240GB SATA III SSD
    Processor Heatsink Fan:AMD Wraith Spire RGB
    Chasis:Thermaltake Level 10GTS Black

Page 29 of 29 FirstFirst ... 1926272829

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •