Page 6 of 7 FirstFirst ... 34567 LastLast
Results 126 to 150 of 175

Thread: AMD Tapes Out First "Bulldozer" Microprocessors.

  1. #126
    Xtreme Enthusiast
    Join Date
    Oct 2008
    Posts
    678
    Quote Originally Posted by savantu View Post


    Where did they mention 12.5 % ?
    Another thing we should have in mind is that those percentage numbers vary greatly depending on what you are calculationg on.

    50% extra core space could be 12% extra die space. There is lots of other things than cores on a die.

  2. #127
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by savantu View Post
    Who said that?
    At most, they were the first to successfully implement it in a commercial large volume product.
    That's correct. But you originally wrote:
    Quote Originally Posted by savantu
    Maybe instead of amateur sources and interpretations, we should look into real technical articles, done by the people who invented this technologies and which are published at conferences and tech journals.
    and posted a paper made by Intel.

    Quote Originally Posted by savantu View Post
    There is no misunderstanding. They are active at the same time. Even with the threads being decoded and retired in alternating cycle it doesn't mean anything about the parallel execution. Since we're talking after all of a shared resource it is logical that some arbitration exists at different level.

    Says who ? You're basically narrowing the definition of SMT , so you can build an argument on a corner case which would come against the new definition. How do you want to have a discussion when you're changing the definition of things so it fit your stance ?
    Talk about a logical fallacy the size of Everest.
    I wrote: "Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, [...]"

    And it's not only me, who points to the simultaneous execution of instructions of multiple threads on a set of EUs:
    Simultaneous multi-threading is the ability of a single physical processor to simultaneously dispatch instructions from more than one hardware thread context at the same time. Simultaneous multi-threading is designed to produce performance benefits in commercial environments and for workloads that have a high Cycles Per Instruction (CPI) count.
    http://publib.boulder.ibm.com/infoce...2/iphb2smt.htm

    Maybe you use a different definition of SMT? Could you please post it, so that I can see, if there is a difference?

    Quote Originally Posted by savantu View Post
    Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
    Most of code is programmed in a sequential manner and data dependencies are the norm.
    But we left the times of accumulator architectures behind us. There are 16 GPRs and 16 XMM/YMM registers. You can do a lot of calculations using these, which are independent over a couple of instructions each. This is, what can be exploited by OoO execution.

    The low branch or cache miss rates don't mean, that they only have a small effect on IPC. For branch prediction:
    Let's assume:
    14 cycles of branch misprediction penalty
    avg. IPC of 3 (with 4 decode/issue etc.) as long as there is no misprediction - CPI = 1/IPC = 0.33
    BP hit rate = 95%
    percentage of branch insts: 20%

    CPI = 0.33 + 0.2*0.05*14 = 0.47 or an avg IPC of 2.1

    Now let's add the cache:
    L1D$ hit rate: 95%
    miss latency: 9 cycles (L2 hit)
    percentage of mem accesses: 30%

    CPI = 0.33 + 0.3*0.05*9 = 0.465 or an avg IPC of 2.15

    Now both effects together:
    CPI = 0.33 + 0.2*0.05*14 + 0.3*0.05*9= 0.605 or an avg IPC of 1.65

    I still left out L3, mem and other effects. But this should be enough to see the effect of a few percent of miss rates.

    Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]

    Quote Originally Posted by savantu View Post
    I am sure future CPUs will include run-ahead and advanced replay uarch. But it is possible that OoO execution will hinder such approaches. The first CPU with scout threads ( run ahead execution ), Sun Rock , was a major failure performance wise.
    I don't think that OoO execution would be a hindrance. It's the actual execution of code afterall. But it stalls sometimes to allow for more advanced techniques. And even if not, a lower priority handling of runahead execution (e.g. executing future mem ops as prefetches) would work.

    Sun Rock added more to the design than just scout threads and there is not just one way to implement runahead execution. It's like saying: "Car A will be a fail because it uses 4 wheels and a motor like car B, which failed." There are other statements:
    An efficient runahead execution processor employing these techniques executes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance.
    From http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  3. #128
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Dresdenboy View Post
    That's correct. But you originally wrote:
    and posted a paper made by Intel.
    The 2 weren't connected; should have been more explicit.

    I wrote: "Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, [...]"

    And it's not only me, who points to the simultaneous execution of instructions of multiple threads on a set of EUs:

    http://publib.boulder.ibm.com/infoce...2/iphb2smt.htm

    Maybe you use a different definition of SMT? Could you please post it, so that I can see, if there is a difference?
    Now, we're perfectly matching. My comment was directed to the emphasis on the limitation of the decode stage in P4.

    The definition that I would use is :

    Simultaneous multithreading (SMT) is a variation on multithreading that uses the resources of a multiple-issue, dynamically-scheduled processor to exploit TLP at the same time it exploits ILP. The key insight that motivates SMT is that modern multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. Furthermore, with register renaming and dynamic scheduling, multiple instructions from independent threads can be is isued without regard to the dependences among them; the resolution of the dependences can be handled by the dynamic scheduling capability.
    Computer Architecture: A Quantitative Approach, 3rd Edition by David Patterson and John L. Hennessy.


    But we left the times of accumulator architectures behind us. There are 16 GPRs and 16 XMM/YMM registers. You can do a lot of calculations using these, which are independent over a couple of instructions each. This is, what can be exploited by OoO execution.

    The low branch or cache miss rates don't mean, that they only have a small effect on IPC. For branch prediction:
    Let's assume:
    14 cycles of branch misprediction penalty
    avg. IPC of 3 (with 4 decode/issue etc.) as long as there is no misprediction - CPI = 1/IPC = 0.33
    BP hit rate = 95%
    percentage of branch insts: 20%

    CPI = 0.33 + 0.2*0.05*14 = 0.47 or an avg IPC of 2.1

    Now let's add the cache:
    L1D$ hit rate: 95%
    miss latency: 9 cycles (L2 hit)
    percentage of mem accesses: 30%

    CPI = 0.33 + 0.3*0.05*9 = 0.465 or an avg IPC of 2.15

    Now both effects together:
    CPI = 0.33 + 0.2*0.05*14 + 0.3*0.05*9= 0.605 or an avg IPC of 1.65

    I still left out L3, mem and other effects. But this should be enough to see the effect of a few percent of miss rates.

    Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.

    By your examples, IPC is 1.5-2 a best out of a theoretical possible of 4. SMT could really help such a scenario.

    I don't think that OoO execution would be a hindrance. It's the actual execution of code afterall. But it stalls sometimes to allow for more advanced techniques. And even if not, a lower priority handling of runahead execution (e.g. executing future mem ops as prefetches) would work.

    Sun Rock added more to the design than just scout threads and there is not just one way to implement runahead execution. It's like saying: "Car A will be a fail because it uses 4 wheels and a motor like car B, which failed." There are other statements:

    From http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
    I know that Rock added far more, it was a quick example.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  4. #129
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    i think we can agree that SMT offers more than 1% ipc gain per 1% increase in core size,
    and i think we can also agree that some people are very negative about CMT, even though its not even out yet.

  5. #130
    Xtreme Mentor
    Join Date
    Jul 2008
    Location
    Shimla , India
    Posts
    2,631
    I would like to point out that Hyper-threading does not really run two threads in parallel but it actually means that the OS is free to schedule two threads and run either when the situation is right for which ever thread.

    Eg:

    T1 and T2 running on CPU
    T1 and T2 scheduled to run by OS on a HTT enabled cpu
    T1 has first run
    T1 uses R1, R2 AND R3
    R1, R2 and R3 locked by T1 since T1 doing work
    T1 requests time on cpu leading to timeout
    T2 gets cpu time and resources it needs like R1,R2 OR R3

    This is how HTT runs simple as that.
    Coming Soon

  6. #131
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by savantu View Post
    Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
    Most of code is programmed in a sequential manner and data dependencies are the norm.
    this is simply not true. ILP is limited to about 5 IPC on average. the majority of ilp is limited by branch/jump prediction.

    just because instructions sequential doesnt mean RAW dependencies exist.

  7. #132
    Banned
    Join Date
    Sep 2009
    Posts
    97
    I'm pretty sure Intel is already working on their bulldozer.

  8. #133
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Quote Originally Posted by LesGrossman View Post
    I'm pretty sure Intel is already working on their bulldozer.
    If you mean their next big *new* design,then yes.It's called Haswell.

  9. #134
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by informal View Post
    If you mean their next big *new* design,then yes.It's called Haswell.
    Sandy Bridge (Socket R) will be more than *new* enough for BD.

  10. #135
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Quote Originally Posted by terrace215 View Post
    Sandy Bridge (Socket R) will be more than *new* enough for BD.
    We'll have to first wait and see what kind of numbers both pull.With 16 BD cores AMD is pretty much covered in server segment. Also I have no doubts SB will be fast.

  11. #136
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Chumbucket843 View Post
    this is simply not true. ILP is limited to about 5 IPC on average. the majority of ilp is limited by branch/jump prediction.

    just because instructions sequential doesnt mean RAW dependencies exist.
    Are you claiming current CPUs get 5 IPC on average ?

    Somebody should tell the CPU designers that they bothered in vain in going to unbelievable lengths in extracting ILP.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  12. #137
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by informal View Post
    With 16 BD cores AMD is pretty much covered in server segment.
    The number of cores doesn't tell you much of anything, in itself.

    If it did, AMD would simply cut clocks more, make a "2x Magny Cours" part, and be "covered in the server segment".

  13. #138
    Xtreme Enthusiast
    Join Date
    Oct 2007
    Location
    Hong Kong
    Posts
    526
    Quote Originally Posted by terrace215 View Post
    Sandy Bridge (Socket R) will be more than *new* enough for BD.
    Adding memory channel is more than new?

  14. #139
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Quote Originally Posted by terrace215 View Post
    The number of cores doesn't tell you much of anything, in itself.

    If it did, AMD would simply cut clocks more, make a "2x Magny Cours" part, and be "covered in the server segment".
    Except they can't scale that hypothetical monster past 2P if they do it your way(just one of the many cons if they do this). This way,we have 16 *improved* cores,with new power gating features(new turbo too),AVX support,a core design that is fully modular,scalable and die area savvy. This design will serve as a basis for future server Fusion part too,since the big shared FPU can be replaced by the future hybrid GPU part ,as AMD stated themselves.

  15. #140
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by ajaidev View Post
    I would like to point out that Hyper-threading does not really run two threads in parallel but it actually means that the OS is free to schedule two threads and run either when the situation is right for which ever thread.

    Eg:

    T1 and T2 running on CPU
    T1 and T2 scheduled to run by OS on a HTT enabled cpu
    T1 has first run
    T1 uses R1, R2 AND R3
    R1, R2 and R3 locked by T1 since T1 doing work
    T1 requests time on cpu leading to timeout
    T2 gets cpu time and resources it needs like R1,R2 OR R3

    This is how HTT runs simple as that.
    Erm, no, it isn't.

    At any point in time, T1 may be running in some execution units while AT THE SAME TIME, T2 is running in the others.

    See virtually any of Intel's papers on Nehalem Hyperthreading...

    http://software.intel.com/en-us/arti...ng-technology/

    Note the little diagram of the 4-wide nehalem core.

    And this:

    The execution pipeline of processors based on Intel® Core™ microarchitecture is four instructions wide, meaning that it can execute up to four micro-operations per clock cycle. As shown in Figure 3, however, the software thread being executed often does not have four instructions eligible for simultaneous execution. Common reasons for fewer than four instructions per clock being retired include dependencies between the output of one instruction and the input of another, as well as latency waiting for data to be fetched from cache or memory.

    Intel HT Technology improves performance through increased instruction level parallelism by having two threads with independent instruction streams, eliminating data dependencies between threads and increasing utilization of the available execution units. This effect typically increases the number of instructions executed in a given amount of time within a core, as shown in Figure 3. The impact of this greater efficiency is experienced by users as higher throughput (since more work gets completed per clock cycle) and higher performance per watt (since fewer idle execution units consume power without contributing to performance). In addition, when one thread has a cache miss, branch mispredict, or any other pipeline stall, the other thread continues processing instructions at nearly the same rate as a single thread running on the core. Intel HT Technology augments other advanced architectural features, higher clock speeds, and additional cores with a capability that is relatively inexpensive in terms of space on the silicon and production cost.

  16. #141
    Registered User
    Join Date
    Jul 2010
    Posts
    11
    Quote Originally Posted by Manicdan View Post
    why would cpus be built in such a way they are never used to maximum capacity?
    Why are highways built in such a way that 80% of the time, half of their capacity is unused? You could halve the number of lanes in my local interstate highway, and for 140 hours a week, it would be perfectly fine. For the other ~30 hours though, you'll be very happy to have those 3rd and 4th lanes available for use! Engineers design to a maximum throughput number, but in the majority of circumstances there just won't be enough traffic to use 100% capacity.

    If algorithms were perfectly parallel, memory accesses were perfectly predictable, and compilers were perfect at scheduling every integer and floating point pipeline to have one computation per cycle, then it would be possible to build a CPU with perfect utilization. But we don't live in such a world, and never will. For the record, x86-64 processors tend to hover around an IPC of 0.8-1.5 on optimized benchmark code.

    IPC chart for selected SPEC 2006 components: http://www.marss86.org/index.php/File:Ipc_chart.png
    The above chart shows the accuracy of the MARSS simulator by comparing IPCs (Instructions-Commits Per Cycle) obtained from simulation against the IPCs realized in executing the same benchmark programs on two real implementations, an AMD Athlon X2 and an Intel Core-2 Duo for SPEC 2006 benchmarks. These IPC values are for only user space execution and does not contain any kernel simulation because the kernel execution paths are different on the MARSS VM and on the test machines used to gather above statistics. All the benchmarks are run from start to completion for the simulated runs and the runs on the real machines. We have used Linux Kernel's Performance Counters to get the IPCs realized on the actual hardware.
    Last edited by intangir; 07-20-2010 at 10:33 AM.

  17. #142
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by savantu View Post
    Are you claiming current CPUs get 5 IPC on average ?

    Somebody should tell the CPU designers that they bothered in vain in going to unbelievable lengths in extracting ILP.
    nope. i am claiming that ilp is limited to around 5 ipc for most applications using realistic branch prediction.

  18. #143
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Chumbucket843 View Post
    nope. i am claiming that ilp is limited to around 5 ipc for most applications using realistic branch prediction.
    Let me see that I get it right : you want to say that between 2 branches that are on average 5 instructions. If so, I agree and this is consistent with what I've read in the literature.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  19. #144
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by qcmadness View Post
    Adding memory channel is more than new?
    ?? Is SB getting more memory channels ? :P

    Anyway, there are enough innovations in it that make it a tock. And if some are confirmed, like the L3 latency, it will rock ( not in the Sun Rock sense ).
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  20. #145
    Xtreme Addict
    Join Date
    Jul 2007
    Posts
    1,488
    After thinking about it for a bit I realize that this IPC and ILP discussion might be taking the wrong direction in relation to BD. I'll elaborate more later but first a question: what's the average IPC of Nehalem?

  21. #146
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    ^^ i think i might do some experiments on that. nothing fancy though.

    Quote Originally Posted by savantu View Post
    Let me see that I get it right : you want to say that between 2 branches that are on average 5 instructions. If so, I agree and this is consistent with what I've read in the literature.
    i think you mean blocks and no that's not what i am saying. to reach 5 ipc you must predict the the control flow instructions so that you can start working on the next block otherwise your pipeline will be filled with nops. control flow must be predicted correctly for ilp to reach 5 ipc. increasing ilp within a block is a task for dynamic execution.

  22. #147
    Xtreme Mentor
    Join Date
    May 2008
    Location
    cleveland ohio
    Posts
    2,879
    can we get a die shot at least, Lllano is the only 32nm die we've seen
    no Ontario or bulldozer yet.
    HAVE NO FEAR!
    "AMD fallen angel"
    Quote Originally Posted by Gamekiller View Post
    You didn't get the memo? 1 hour 'Fugger time' is equal to 12 hours of regular time.

  23. #148
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    Quote Originally Posted by intangir View Post
    Why are highways built in such a way that 80% of the time, half of their capacity is unused? You could halve the number of lanes in my local interstate highway, and for 140 hours a week, it would be perfectly fine. For the other ~30 hours though, you'll be very happy to have those 3rd and 4th lanes available for use! Engineers design to a maximum throughput number, but in the majority of circumstances there just won't be enough traffic to use 100% capacity.

    If algorithms were perfectly parallel, memory accesses were perfectly predictable, and compilers were perfect at scheduling every integer and floating point pipeline to have one computation per cycle, then it would be possible to build a CPU with perfect utilization. But we don't live in such a world, and never will. For the record, x86-64 processors tend to hover around an IPC of 0.8-1.5 on optimized benchmark code.

    IPC chart for selected SPEC 2006 components: http://www.marss86.org/index.php/File:Ipc_chart.png
    you missed the point, i understand "some of the time" or "not very often", but saying "never" is an absolution that was way to limiting. no one builds something for never, but they do build things for expected bottlenecks

  24. #149
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Solus Corvus View Post
    After thinking about it for a bit I realize that this IPC and ILP discussion might be taking the wrong direction in relation to BD. I'll elaborate more later but first a question: what's the average IPC of Nehalem?
    A bit higher than Core and it depends on app. David Kanter tested in physics calculations and the best result was 2 IPC.

    http://www.realworldtech.com/page.cf...0510142143&p=3

    According to the author, once above 1 it is respectable. Chumbucket483 talks nonsense with his 5 IPC. Not even Itanium reaches that and it's based on static compiling and EPIC ( Explicitly Parallel Instruction Set ) architecture. EPIC was designed to solve the ILP problem by doing optimizations at compile time and inserting hints into the code so the CPU knows what to do next.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  25. #150
    Registered User
    Join Date
    Jul 2010
    Posts
    11
    Quote Originally Posted by Particle View Post
    You don't appear to understand what is really going on yourself. There is only one execution unit. You can't have two threads with instructions that compete for the same resources executing on the same clock cycle in the same execution unit. That's the end of the story. HT is, as we've been claiming all along, just a way to maximize the utilization of the core's resources by scheduling work where there would normally be none being done (misses and whatnot). It does not magically let you execute two threads at the same time the way two real cores do.
    Nope! You seem to be assuming there is only one execution unit per core. When a microarchitect says execution unit, he means an ALU or FPU. Each core of any out-of-order x86 CPU contains multiple integer execution units and multiple floating-point execution units. For example, the first superscalar x86 chip, the Pentium Pro, started with two ALUs and one FPU. While each individual execution unit can only be executing one instruction at a time, there are multiple per core, and a Nehalem core can easily schedule two instructions from two different threads on two of its execution units in the same cycle.

    The single-core Athlon XP had nine execution units.

    Huynh, Jack (2003). "The AMD Athlon XP Processor with 512KB L2 Cache": http://courses.ece.uiuc.edu/ece512/Papers/Athlon.pdf

    At the heart of QuantiSpeed architecture is a fully pipelined,
    nine-issue, superscalar processor core. The AMD Athlon XP processor
    provides a wider execution bandwidth of nine execution pipes when
    compared with competitive x86 processors with up to six execution
    pipes. The nine execution engines are comprised of three address
    calculation units, three integer units, and three floating-point units.
    One Nehalem core has a set of nine execution units, much the same as Core 2.

    1x ALU Shift
    1x ALU LEA
    1x ALU Shift Branch
    1x SSE Shuffle ALU
    1x SSE Mul
    1x SSE Shuffle ALU
    1x 128-bit FMUL FDIV
    1x 128-bit FADD
    1x 128-bit FP Shuffle

    Kanter, David (2008) "Inside Nehalem: Intel's Future Processor and System": http://www.realworldtech.com/page.cf...0208182719&p=6
    As with Core 2, the register alias table (RAT) points each architectural register into either the Re-Order Buffer (ROB) or the Retirement Register File (RRF) and holds the most recent speculative state (whereas the RRF holds the most recent non-speculative and committed state). The RAT can rename up to 4 uops each cycle, giving each one a destination register in the ROB. The renamed instructions then read their source operands and issue into the unified Reservation Station (RS), which is used by all instruction types.

    Any instructions in the RS which have all their operands ready are dispatched to the execution units, which are largely unchanged from the Core 2 and unaffected by SMT, except for an increase in utilization.

Page 6 of 7 FirstFirst ... 34567 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •