MMM
Page 5 of 7 FirstFirst ... 234567 LastLast
Results 101 to 125 of 175

Thread: AMD Tapes Out First "Bulldozer" Microprocessors.

  1. #101
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Manicdan View Post
    ..
    so are cpu designers idiots who overbuild chips knowing its a waste of money and resources? or is it that SMT only gets 15% of a bonus because chips are almost always being fully utilized?

    Not at all. No single thread will ever use all your integer and floating point resources at the same time. You typically have low usage, than spikes when you're running some loops which fit in the caches and so on. Everything works against you, diminishing utilization : code, caches, IMC, RAM, I/O,etc. You can have special cases of hand tuned software like Linpack where you can achieve almost 100% utilization ( they typically get somewhere around 9x% of what's theoretically possible ).
    Most of the real life code is spaghetti code with data dependencies, branches,etc where even a CPU with excellent front end like Nehalem barely manages an IPC of 1.5-2 out of a theoretical 4.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  2. #102
    Banned
    Join Date
    Jan 2010
    Posts
    263
    Quote Originally Posted by savantu View Post
    Now this is an interesting point : what if being dependent on shared resources isn't a liability, but actually a desirable feature ?

    SMT allows me to increase the utilization of an underutilized core. CMT duplicates the core or part of it, thus duplicating the lack of utilization also.

    Example : we take a 4 issues wide core with 4 ALU units. Let's say, most of the time only 1 or 2 of those units are used.
    -with SMT, we have 2 threads running in parallel on that core, the second thread being dispatched to the idle units. Thus we now have 3 or even all of the units in use.
    -with CMT, I add another cluster of 4 ALUs for a total of 8. I have 2 threads, but I also have 2x as many resources available and each thread uses most of the time 1 or 2 ALUs. Thus, out of 8 in the module, I'm constantly using 3-4 units.
    I get your point completely; that's why I said: The key to all the confusion is workload. Once you start running a very efficient code (like linpack) that fills the pipeline, you render SMT useless (even hurting performance in some cases as it probably steals resources that are needed for a more efficient execution of a thread). So we effectively go back to the fundamental reason behind SMT; which is what you already highlighted.

    In this regard, CMT is not necessarily a direct competitor of SMT because it is more about raw processing power (while maximizing die space due to the design) than eliminating thread-level waste, which is the real strength of SMT. SMT is not a substitute for a real core.
    Last edited by OhNoes!; 07-19-2010 at 09:33 AM.

  3. #103
    Xtreme Addict
    Join Date
    Jul 2007
    Posts
    1,488
    Quote Originally Posted by Movieman View Post
    Just a thought but does it really matter the path that the two companies have taken?
    What should matter is the effectiveness of the choice they made.
    IE: Take a $1000.00 intel chip and a $1000.00 AMD chip and see which one does the work you need done better.
    Maybe thats too black and white for you smart guys here but to me thats all that counts..
    The rest is just a way to kill time typing in a forum..
    ( Puts on flamesuit)
    As an ( currently occasional ) programmer and unrepentant geek, I find it useful and fascinating to know about the underlying architecture that my code will be running on. Also we won't know for a fact the price and performance until it's released. But speculating about it beforehand is not only an interesting way to pass the time but also could potentially lead one to make wiser investing decisions.

    You are very right though. The path taken doesn't matter to most people. What matters is the performance in the apps you use and the price.

    Quote Originally Posted by savantu View Post
    Example : we take a 4 issues wide core with 4 ALU units. Let's say, most of the time only 1 or 2 of those units are used.
    -with SMT, we have 2 threads running in parallel on that core, the second thread being dispatched to the idle units. Thus we now have 3 or even all of the units in use.
    -with CMT, I add another cluster of 4 ALUs for a total of 8. I have 2 threads, but I also have 2x as many resources available and each thread uses most of the time 1 or 2 ALUs. Thus, out of 8 in the module, I'm constantly using 3-4 units.
    I don't think this is a very representative example. If it was then adding SMT to a processor would yield nearly as much gain as adding a second core. But even if your example were the average case I would still prefer an extra core over SMT because there will still be lots of non-average cases where one thread achieves high utilization and starves the other thread. Especially in high-performance cases where the code is more likely to be optimized.

    Savantu makes a fair point though. Higher core utilization is a desirable feature. 8 real cores with high utilization would be better then 8 cores with lower utilization, all else being equal. I don't see how CMT and a tech like SMT are necessarily mutually exclusive.

    Mr Fruehe, would you be able to comment on if there is any tech, novel or otherwise, in bulldozer designed to maximize the utilization of a single core? I see AMD talking a lot about exploiting explicit parallelism (multiple threads, multiple programs), but I hope they haven't forgotten about single thread performance because that helps both the single threaded and multithreaded cases.
    Last edited by Solus Corvus; 07-19-2010 at 09:31 AM.

  4. #104
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by savantu View Post
    What's your point, somehow I am missing it ? What do execution units have to do with executing threads simultaneously ?

    Maybe instead of amateur sources and interpretations, we should look into real technical articles, done by the people who invented this technologies and which are published at conferences and tech journals.

    I've attached a diagram of the a Netburst execution core to show the simultaneous execution of 2 threads : you can find it in this paper
    ftp://download.intel.com/technology/...technology.pdf
    Oh, I didn't know that Intel invented SMT. Several people cite sources dating back to the early 90s, like these: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

    Back to my posting: I linked to an Intel slide in an article, which didn't allow direct linking. Fixed that now.

    My point is: There is some misunderstanding about when and where those two threads in a SMT core are active. I wanted to show, that during each cycle, both threads could be executed on the available execution units. In other units, like the decode or retirement units of the Netburst architecture (see the paper linked by you) , the threads are being decoded/retired in alternating cycles.

    Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, the image I posted should serve well to understand that. OTOH the sharing of some of the other resources (like decoder, cache, RF) as shown by the Netburst pipeline image is not about avoiding underutilization but causing overutilization (except one thread stalls due to a cache miss or so).
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  5. #105
    Xtreme Enthusiast
    Join Date
    Oct 2007
    Location
    Hong Kong
    Posts
    526
    Quote Originally Posted by OhNoes! View Post
    I get your point completely; that's why I said: The key to all the confusion is workload. Once you start running a very efficient code (like linpack) that fills the pipeline, you render SMT useless (even hurting performance in some cases as it probably steals resources that are needed for a more efficient execution of a thread). So we effectively go back to the fundamental reason behind SMT; which is what you already highlighted.

    In this regard, CMT is not necessarily a direct competitor of SMT because it is more about raw processing power (while maximizing die space due to the design) than eliminating thread-level waste, which is the real strength of SMT. SMT is not a substitute for a real core.
    SMT is good at server workload generally but not workstation.

  6. #106
    Xtreme Enthusiast
    Join Date
    Dec 2009
    Posts
    846
    Quote Originally Posted by richierich View Post
    Newegg:

    AMD Opteron 6172 12-core 2.1GHz $1009

    Intel Xeon X5550 Quad-Core 2.66GHz $1016.49
    Intel Xeon X5650 Hexa-Core 2.66GHz $1024.71

    Is this what you mean, MM?
    Best X5650 is 346 (int) and 241 (FP)
    Best 6172 is 375 (int) and 309 (FP)

    That makes it 8-28% faster. At the same price point.
    While I work for AMD, my posts are my own opinions.

    http://blogs.amd.com/work/author/jfruehe/

  7. #107
    Xtreme Enthusiast
    Join Date
    Dec 2009
    Posts
    846
    Quote Originally Posted by qcmadness View Post
    SMT is good at server workload generally but not workstation.
    I would disagree. It is better in a desktop workload than a server workload. Not familiar enough with workstations to say where they land.
    While I work for AMD, my posts are my own opinions.

    http://blogs.amd.com/work/author/jfruehe/

  8. #108
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by savantu View Post
    Now this is an interesting point : what if being dependent on shared resources isn't a liability, but actually a desirable feature ?

    SMT allows me to increase the utilization of an underutilized core. CMT duplicates the core or part of it, thus duplicating the lack of utilization also.

    Example : we take a 4 issues wide core with 4 ALU units. Let's say, most of the time only 1 or 2 of those units are used.
    -with SMT, we have 2 threads running in parallel on that core, the second thread being dispatched to the idle units. Thus we now have 3 or even all of the units in use.
    -with CMT, I add another cluster of 4 ALUs for a total of 8. I have 2 threads, but I also have 2x as many resources available and each thread uses most of the time 1 or 2 ALUs. Thus, out of 8 in the module, I'm constantly using 3-4 units.
    The low IPCs known for many workloads are not caused by missing ILP, but mainly by missing data from caches or external memory and by branch mispredictions. So if there is no cache miss or mispredicted branch, avg. IPC might well be around 3 at that time.

    But why don't you try using the free execution resources with the thread itself? If for example the thread stalls due to a cache miss, it could speculatively continue to run those instructions, which load data, as kind of prefetch instructions. After serving the cache miss the thread could execute normally, but having the further needed data in place.

    Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  9. #109
    Banned
    Join Date
    Jun 2008
    Location
    Mi
    Posts
    1,063
    Quote Originally Posted by savantu View Post
    Because it is damn hard to achieve high utilization in just about every domain. Look at your body; when do you use it at max capacity ? The arms for example ? I really doubt you use more than 30-40% of the capacity of the right arm and 10-20% of the left one ( if you're right handed ).

    But, I suppose, you don't think in this way : " why does the human body have 2 arms if they are never used to maximum capacity ? ".

    It's not a matter of utilization, but of capability.


    You seem to look at all of this in a vacuum. Why are all your remarks predicated on the fact you believe Intel's design is better? Yet, don't fundamentally understand physical aspects of... more?

    Efficiency and cost and wattage are all demographics and marketing of the product, but what is and isn't on a chip is strait forward (architecture).

    Secondly, the CPU doesn't knows what is coming next, it doesn't know what software is coursing threw it's veins, or even cares.. it just executes.



    So, if a particular CPU can handle more of what you are trying to do with it, then it's better.. so which architecture you have doesn't matter, as long as it suites your needs. (ie: movieman's point)

    What I don't get, is that you are arguing what is actually happening inside the chips.. and saying it's isn't what is actually happening.



    Here:
    • 4 lanes highway & 1 toll both, is not good for heavy traffic, but efficient on manpower..
    • 4 lanes & 4 toll booths, is great, but excessive if you have light traffic..



    You pick^... do you travel during rush hour, or not?

  10. #110
    Xtreme Enthusiast
    Join Date
    Feb 2009
    Posts
    800
    Quote Originally Posted by savantu View Post
    SMT allows me to increase the utilization of an underutilized core. CMT duplicates the core or part of it, thus duplicating the lack of utilization also.
    CMT duplicates part of the core, not the whole core. Why call CMT when it duplicates the whole core? Both SMT and CMT takes up die space. However, both are also means to exploit all available die-space for performance. Simply put, SMT and CMT increases performance per area. 50% (theoretical) die-space increase to yield twice the performance? Who wouldn't want that? Hell, I don't think AMD will be selling 8-cores to desktops later if it wasn't for this tech.

    Well, do you understand why AMD wants CMT? If you're still stubborn, I guess you like Intel's methods way better.

    Plus, your scenario seems to indicate that SMT gives twice the performance. We all know this isn't true. And we all know adding another core gives almost twice (or more than twice if it somehow removes a bottleneck somewhere) the performance for well-threaded softwares. The lack of utilizations are also automatically fixed by the shared units two cores in a module share. Who the hell cares about lack of utilization anyway, when adding another core is very cheap with CMT?

    Oh and, Bulldozer does offer better single-threaded performance. JF has already confirmed that. By how much, we do not know.

    @Xoulz: I don't get your analogy. Perhaps you should refer to Particle's sig

    I want to put more technical details on this post to cover up the argumentation holes, but it would make this post messy.

    AMD probably won't spend years on Bulldozer if CMT isn't any better than SMT in terms of increasing performance per area.

  11. #111
    I am Xtreme
    Join Date
    Jul 2007
    Location
    Austria
    Posts
    5,485
    Quote Originally Posted by Xoulz View Post
    Secondly, the CPU doesn't knows what is coming next, it doesn't know what software is coursing threw it's veins, or even cares.. it just executes.
    A word for you: Prefetching look into it.

  12. #112
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    cpu's have been speculating since the 80's. we have had dynamic execution since the 70's. today we have extremely advanced algorithms for speculation like branch prediction, prefetch, cache, tlb, etc.

  13. #113
    Xtreme Addict
    Join Date
    Jan 2009
    Posts
    1,445
    Quote Originally Posted by Particle View Post
    To be brief, I'll bulletize my bones to pick in this thread.

    • SMT - Listed as threads instead of cores for a reason. Intel's implementation means you'll only ever have [core count] threads active/executing even though you have [core count] * 2 "threads". Let's not lose sight of that. I don't know why Sav is going off on a tangent about this.
    • CMT - Genuinely has [core count] threads active/executing at any given moment. Each module contains two integer units capable of chewing on instructions at the same exact moment. It's two cores per module.
    • Threads - When it comes to comparing thread counts, realize that Intel's SMT-enabled chip thread counts aren't the same thing as AMD's CMT thread counts. In the case of AMD, all those threads are actually executing in parallel. In Intel's case they are not.


    Maybe that will help clear up some confusion.

    thanks.
    [MOBO] Asus CrossHair Formula 5 AM3+
    [GPU] ATI 6970 x2 Crossfire 2Gb
    [RAM] G.SKILL Ripjaws X Series 16GB (4 x 4GB) 240-Pin DDR3 1600
    [CPU] AMD FX-8120 @ 4.8 ghz
    [COOLER] XSPC Rasa 750 RS360 WaterCooling
    [OS] Windows 8 x64 Enterprise
    [HDD] OCZ Vertex 3 120GB SSD
    [AUDIO] Logitech S-220 17 Watts 2.1

  14. #114
    Xtreme Member
    Join Date
    Apr 2010
    Posts
    137
    Quote Originally Posted by JF-AMD View Post
    Let me put it in simple terms:

    With actual cores, throughput generally goes up ~90% when you go from 1 core to 2 cores.

    With SMT, throughput generally goes up ~14% for int and ~20% for FP (from SPEC.org, on Intel-based submissions).

    SMT may double the number of threads, but it does not double the number of pipelines. You can only fit so many executions per cycle based on the pipelines. SMT might give you better utilization, but you are still limited on pipelines.

    Doubling the number of cores will double the number pipelines and allow for more simultaneous execution. That is the key to this whole discussion. Everyone can argue about how many angels can dance on the head of a pin, but in reality, having more cores means that you have a larger dancefloor.
    The advantage of SMT is that it doesn't need a lot of extra chip area. While an extra core will take a lot of chip area.

  15. #115
    Xtreme Addict
    Join Date
    Jun 2002
    Location
    Ontario, Canada
    Posts
    1,782
    Quote Originally Posted by BoredByLife View Post
    The advantage of SMT is that it doesn't need a lot of extra chip area. While an extra core will take a lot of chip area.
    With the ever shrinking nodes that CPUs are built on, I'll take the extra die space and a real core. Not arguing any way over SMT or CMT, just saying I'll take a real core any day.
    Last edited by freeloader; 07-19-2010 at 05:18 PM.
    As quoted by LowRun......"So, we are one week past AMD's worst case scenario for BD's availability but they don't feel like communicating about the delay, I suppose AMD must be removed from the reliable sources list for AMD's products launch dates"

  16. #116
    Xtreme Addict
    Join Date
    Apr 2007
    Location
    canada
    Posts
    1,886
    smt has its flaws and cmt too ... it depend on your task and daily use to figure out what you need
    WILL CUDDLE FOR FOOD

    Quote Originally Posted by JF-AMD View Post
    Dual proc client systems are like sex in high school. Everyone talks about it but nobody is really doing it.

  17. #117
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by savantu View Post
    Not at all. No single thread will ever use all your integer and floating point resources at the same time. You typically have low usage, than spikes when you're running some loops which fit in the caches and so on. Everything works against you, diminishing utilization : code, caches, IMC, RAM, I/O,etc. You can have special cases of hand tuned software like Linpack where you can achieve almost 100% utilization ( they typically get somewhere around 9x% of what's theoretically possible ).
    Most of the real life code is spaghetti code with data dependencies, branches,etc where even a CPU with excellent front end like Nehalem barely manages an IPC of 1.5-2 out of a theoretical 4.
    where did you get this misinformation on code?

    linpack is not "hand coded". it's written in c.

    most code is very simple and as a whole it becomes complex. in a procedural language you would simplify the code into functions. the majority of loops are simple and highly predictable. even nested loops can be predicted very accurately.

    data dependencies vary a lot with what code you are running. a finite state machine is 100% dependent on it's current state. matrix multiplication has virtually no dependencies.

    and as for nehalem, i dont think they would have made it 4 issue if that didnt do anything for general code. i dont have vtune but it think a lot of software can benefit from it. the theoretical max is 5 decoded instructions but under heavy restrictions.

  18. #118
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by BoredByLife View Post
    The advantage of SMT is that it doesn't need a lot of extra chip area. While an extra core will take a lot of chip area.
    This has been discussed before. Both the AMD engineers asked by John and Andy Glew said, that the additional cluster (now core) adds about 12.5% to the core itself. If you look at todays chips, in most cases the cores take less than one half of the area. Thus the increased chip size amounts to ~5%. It's not the same type of core you have in mind. It's just a bunch of duplicated execution resources (scheduler, ROB, registers etc), and one additional 16k L1 cache.
    Last edited by Dresdenboy; 07-19-2010 at 10:00 PM.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  19. #119
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Dresdenboy View Post
    Oh, I didn't know that Intel invented SMT. Several people cite sources dating back to the early 90s, like these: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
    Who said that?
    At most, they were the first to successfully implement it in a commercial large volume product.

    Back to my posting: I linked to an Intel slide in an article, which didn't allow direct linking. Fixed that now.

    My point is: There is some misunderstanding about when and where those two threads in a SMT core are active. I wanted to show, that during each cycle, both threads could be executed on the available execution units. In other units, like the decode or retirement units of the Netburst architecture (see the paper linked by you) , the threads are being decoded/retired in alternating cycles.
    There is no misunderstanding. They are active at the same time. Even with the threads being decoded and retired in alternating cycle it doesn't mean anything about the parallel execution. Since we're talking after all of a shared resource it is logical that some arbitration exists at different level.
    Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them,
    Says who ? You're basically narrowing the definition of SMT , so you can build an argument on a corner case which would come against the new definition. How do you want to have a discussion when you're changing the definition of things so it fit your stance ?
    Talk about a logical fallacy the size of Everest.


    Quote Originally Posted by Dresdenboy View Post
    The low IPCs known for many workloads are not caused by missing ILP, but mainly by missing data from caches or external memory and by branch mispredictions. So if there is no cache miss or mispredicted branch, avg. IPC might well be around 3 at that time.
    Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
    Most of code is programmed in a sequential manner and data dependencies are the norm.
    But why don't you try using the free execution resources with the thread itself? If for example the thread stalls due to a cache miss, it could speculatively continue to run those instructions, which load data, as kind of prefetch instructions. After serving the cache miss the thread could execute normally, but having the further needed data in place.
    Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]

    I am sure future CPUs will include run-ahead and advanced replay uarch. But it is possible that OoO execution will hinder such approaches. The first CPU with scout threads ( run ahead execution ), Sun Rock , was a major failure performance wise.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  20. #120
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Xoulz View Post
    It's not a matter of utilization, but of capability.
    Expand on this please so we can better understand you.

    You seem to look at all of this in a vacuum. Why are all your remarks predicated on the fact you believe Intel's design is better? Yet, don't fundamentally understand physical aspects of... more?
    It looks like you're not understanding my remarks very well. I am simply trying to clear what SMT is and what is not. People think in an SMT core only 1 thread is running at any given time. That is false. You have 2 or even 4 ( Power 7 ) threads executing simultaneously.

    Whether it is better than CMT or not is another discussion. My view is that SMT is complementary to CMT. It isn't like we need to choose between SMT and more cores/execution units. The optimal solution is more cores/execution units with SMT ( or any other form of multithreading.



    Quote Originally Posted by blindbox View Post
    CMT duplicates part of the core, not the whole core. Why call CMT when it duplicates the whole core?
    In AMD therminology a cluster=core. JF clearly states that a few posts above.
    Both SMT and CMT takes up die space. However, both are also means to exploit all available die-space for performance. Simply put, SMT and CMT increases performance per area. 50% (theoretical) die-space increase to yield twice the performance? Who wouldn't want that? Hell, I don't think AMD will be selling 8-cores to desktops later if it wasn't for this tech.
    Don't assume the corner case to be representative of all situations. Generalization is bad in this scenario.
    If 1 core = 1x in performance, how can 1.5 cores = 2x in performance ? Maybe performance in this context means Integer performance where you basically doubled the resources. No wonders in this case then.
    Well, do you understand why AMD wants CMT? If you're still stubborn, I guess you like Intel's methods way better.
    I understand perfectly well why AMD wants CMT. I am not sure however if you understand its merits and those of SMT for example.
    Plus, your scenario seems to indicate that SMT gives twice the performance. We all know this isn't true.
    Nice red herring there. "Argumentum ad populum" or appeal to majority is a logical fallacy. Here's the answer : you know it wrong.

    Per core performance of Power 7 (single thread/base core = 1.0):

    8 cores active/chip

    single 1.0x
    SMT2 ~1.5x
    SMT4 ~1.8x

    4 cores active/chip (turbo)

    single ~1.15x
    SMT2 ~1.7x
    SMT4 ~2.05x

    SMT truly offers some bang for the buck ( or die space if you want ).

    https://www.e-techservices.com/publi...lanetPower.pdf

    And we all know adding another core gives almost twice (or more than twice if it somehow removes a bottleneck somewhere) the performance for well-threaded softwares.
    Yeah, like we've all seen in when dual cores 1st appeared and later with QC. Here's a clue for you : scaling drops rapidly as you increase core count. You can perform some exotic benchmarks to show almost perfect scaling, but in real world, the marginal contribution of a new core drops rapidly as you increase the core count.

    Besides, I like your argument about superlinear scaling. Maybe you should also think of a perpetuum mobile device. It will solve our oil problems.
    The lack of utilizations are also automatically fixed by the shared units two cores in a module share. Who the hell cares about lack of utilization anyway, when adding another core is very cheap with CMT?
    Which one is it then ? Is it fixed or who the hell cares about it ?
    No its not fix. You simply replicate the problem.
    Oh and, Bulldozer does offer better single-threaded performance. JF has already confirmed that. By how much, we do not know.
    It definetly should since it has more execution units than the K10 generation of cores.
    @Xoulz: I don't get your analogy. Perhaps you should refer to Particle's sig
    I didn't get it either, but maybe he will exand on it.

    AMD probably won't spend years on Bulldozer if CMT isn't any better than SMT in terms of increasing performance per area.
    SMT is harder to get right than CMT. It also takes less die space. If capacity is a problem and your volume is high enough to warrant the extra costs in development, SMT offers a nice benefit.
    OTOH CMT is good in any situation.

    Quote Originally Posted by Chumbucket843 View Post
    where did you get this misinformation on code?

    linpack is not "hand coded". it's written in c.
    I said "hand tunned". The kernels and math libraries are assembler optimized.
    most code is very simple and as a whole it becomes complex. in a procedural language you would simplify the code into functions. the majority of loops are simple and highly predictable. even nested loops can be predicted very accurately.

    data dependencies vary a lot with what code you are running. a finite state machine is 100% dependent on it's current state. matrix multiplication has virtually no dependencies.

    and as for nehalem, i dont think they would have made it 4 issue if that didnt do anything for general code. i dont have vtune but it think a lot of software can benefit from it. the theoretical max is 5 decoded instructions but under heavy restrictions.
    All very true.

    Sorry for the lengthy post.
    Last edited by savantu; 07-19-2010 at 11:18 PM.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  21. #121
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by Dresdenboy View Post
    This has been discussed before. Both the AMD engineers asked by John and Andy Glew said, that the additional cluster (now core) adds about 12.5% to the core itself.


    Where did they mention 12.5 % ?
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  22. #122
    Registered User
    Join Date
    Sep 2009
    Posts
    77
    Sun Rock - The one and only ancestor of bulldozer. 4 clusters per module, 32 threads.

    Last edited by superrugal; 07-19-2010 at 11:40 PM.

  23. #123
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by savantu View Post
    Where did they mention 12.5 % ?
    Chuck Moore's slide couldn't even have a BD floorplan as a base for estimations at that time. And even then we'd not know, if L2 (1-2 MB) has been counted in.

    Quote Originally Posted by JF-AMD
    Guys, the engineers have done the math. The additional circuitry increases the total floor space of a module by about 12%.

    Four incremental cores adds ~5% total real estate to the whole die.
    from http://www.amdzone.com/phpbb3/viewto...rt=100#p173927

    The 12.5% has been mentioned earlier, but said to be too exact.

    Quote Originally Posted by Andy Glew
    The multicluster multithreaded core replicates only the out-of-order core, roughly 1/8 the core on some chips. Thus, 2 clusters cost 12.5% area, and hence 12.5% leakage; round up to 15% to account for extra routing.
    From his Multistar architecture description which can be found here:
    http://andyglew.blogspot.com/2009/12...threading.html

    Here the number is an estimation.

    Once I did a calculation of adding the necessary resources for CMT based on the unit sizes of Llano's core with some resizing due to changed functionality. Based on that my estimation is ~11% for the core. Adding the uncore stuff would double the die size and lead to an additional 5.5%.

    But the same argument goes for SMT. With increasing uncore areas the area effect of SMT becomes really small - clearly less than 5% for the whole die.

    More on our other discussion later.
    Last edited by Dresdenboy; 07-20-2010 at 02:06 AM.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  24. #124
    Xtreme Addict
    Join Date
    Jan 2008
    Posts
    1,176
    Quote Originally Posted by savantu View Post


    Where did they mention 12.5 % ?
    half a decade ago...

  25. #125
    Xtreme Enthusiast
    Join Date
    Feb 2009
    Posts
    800
    Quote Originally Posted by Jowy Atreides View Post
    half a decade ago...
    Lol.

    12.5% increase in size for an extra core in a module.. daaamnnn. Thanks for the latest info, Dresdenboy. I don't see how this can fail.

    savantu, real numbers instead please? Like, encoding?

Page 5 of 7 FirstFirst ... 234567 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •