AMD Tapes Out First "Bulldozer" Microprocessors.

**savantu** · 07-19-2010, 09:03 AM

Originally Posted by Manicdan

..
so are cpu designers idiots who overbuild chips knowing its a waste of money and resources? or is it that SMT only gets 15% of a bonus because chips are almost always being fully utilized?

Not at all. No single thread will ever use all your integer and floating point resources at the same time. You typically have low usage, than spikes when you're running some loops which fit in the caches and so on. Everything works against you, diminishing utilization : code, caches, IMC, RAM, I/O,etc. You can have special cases of hand tuned software like Linpack where you can achieve almost 100% utilization ( they typically get somewhere around 9x% of what's theoretically possible ).
Most of the real life code is spaghetti code with data dependencies, branches,etc where even a CPU with excellent front end like Nehalem barely manages an IPC of 1.5-2 out of a theoretical 4.

**~~OhNoes!~~** · 07-19-2010, 09:27 AM

Originally Posted by savantu

Now this is an interesting point : what if being dependent on shared resources isn't a liability, but actually a desirable feature ?

SMT allows me to increase the utilization of an underutilized core. CMT duplicates the core or part of it, thus duplicating the lack of utilization also.

Example : we take a 4 issues wide core with 4 ALU units. Let's say, most of the time only 1 or 2 of those units are used.
-with SMT, we have 2 threads running in parallel on that core, the second thread being dispatched to the idle units. Thus we now have 3 or even all of the units in use.
-with CMT, I add another cluster of 4 ALUs for a total of 8. I have 2 threads, but I also have 2x as many resources available and each thread uses most of the time 1 or 2 ALUs. Thus, out of 8 in the module, I'm constantly using 3-4 units.

I get your point completely; that's why I said: The key to all the confusion is workload. Once you start running a very efficient code (like linpack) that fills the pipeline, you render SMT useless (even hurting performance in some cases as it probably steals resources that are needed for a more efficient execution of a thread). So we effectively go back to the fundamental reason behind SMT; which is what you already highlighted.

In this regard, CMT is not necessarily a direct competitor of SMT because it is more about raw processing power (while maximizing die space due to the design) than eliminating thread-level waste, which is the real strength of SMT. SMT is not a substitute for a real core.

**Solus Corvus** · 07-19-2010, 09:28 AM

Originally Posted by Movieman

Just a thought but does it really matter the path that the two companies have taken?
What should matter is the effectiveness of the choice they made.
IE: Take a $1000.00 intel chip and a $1000.00 AMD chip and see which one does the work you need done better.
Maybe thats too black and white for you smart guys here but to me thats all that counts..
The rest is just a way to kill time typing in a forum..
( Puts on flamesuit)

As an ( currently occasional ) programmer and unrepentant geek, I find it useful and fascinating to know about the underlying architecture that my code will be running on. Also we won't know for a fact the price and performance until it's released. But speculating about it beforehand is not only an interesting way to pass the time but also could potentially lead one to make wiser investing decisions.

You are very right though. The path taken doesn't matter to most people. What matters is the performance in the apps you use and the price.

Originally Posted by savantu

Example : we take a 4 issues wide core with 4 ALU units. Let's say, most of the time only 1 or 2 of those units are used.
-with SMT, we have 2 threads running in parallel on that core, the second thread being dispatched to the idle units. Thus we now have 3 or even all of the units in use.
-with CMT, I add another cluster of 4 ALUs for a total of 8. I have 2 threads, but I also have 2x as many resources available and each thread uses most of the time 1 or 2 ALUs. Thus, out of 8 in the module, I'm constantly using 3-4 units.

I don't think this is a very representative example. If it was then adding SMT to a processor would yield nearly as much gain as adding a second core. But even if your example were the average case I would still prefer an extra core over SMT because there will still be lots of non-average cases where one thread achieves high utilization and starves the other thread. Especially in high-performance cases where the code is more likely to be optimized.

Savantu makes a fair point though. Higher core utilization is a desirable feature. 8 real cores with high utilization would be better then 8 cores with lower utilization, all else being equal. I don't see how CMT and a tech like SMT are necessarily mutually exclusive.

Mr Fruehe, would you be able to comment on if there is any tech, novel or otherwise, in bulldozer designed to maximize the utilization of a single core? I see AMD talking a lot about exploiting explicit parallelism (multiple threads, multiple programs), but I hope they haven't forgotten about single thread performance because that helps both the single threaded and multithreaded cases.

**Dresdenboy** · 07-19-2010, 09:37 AM

Originally Posted by savantu

What's your point, somehow I am missing it ? What do execution units have to do with executing threads simultaneously ?

Maybe instead of amateur sources and interpretations, we should look into real technical articles, done by the people who invented this technologies and which are published at conferences and tech journals.

I've attached a diagram of the a Netburst execution core to show the simultaneous execution of 2 threads : you can find it in this paper
ftp://download.intel.com/technology/...technology.pdf

Oh, I didn't know that Intel invented SMT. Several people cite sources dating back to the early 90s, like these: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

Back to my posting: I linked to an Intel slide in an article, which didn't allow direct linking. Fixed that now.

My point is: There is some misunderstanding about when and where those two threads in a SMT core are active. I wanted to show, that during each cycle, both threads could be executed on the available execution units. In other units, like the decode or retirement units of the Netburst architecture (see the paper linked by you) , the threads are being decoded/retired in alternating cycles.

Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them, the image I posted should serve well to understand that. OTOH the sharing of some of the other resources (like decoder, cache, RF) as shown by the Netburst pipeline image is not about avoiding underutilization but causing overutilization (except one thread stalls due to a cache miss or so).

**qcmadness** · 07-19-2010, 09:42 AM

Originally Posted by OhNoes!

I get your point completely; that's why I said: The key to all the confusion is workload. Once you start running a very efficient code (like linpack) that fills the pipeline, you render SMT useless (even hurting performance in some cases as it probably steals resources that are needed for a more efficient execution of a thread). So we effectively go back to the fundamental reason behind SMT; which is what you already highlighted.

In this regard, CMT is not necessarily a direct competitor of SMT because it is more about raw processing power (while maximizing die space due to the design) than eliminating thread-level waste, which is the real strength of SMT. SMT is not a substitute for a real core.

SMT is good at server workload generally but not workstation.

**JF-AMD** · 07-19-2010, 10:10 AM

Originally Posted by richierich

Newegg:

AMD Opteron 6172 12-core 2.1GHz $1009

Intel Xeon X5550 Quad-Core 2.66GHz $1016.49
Intel Xeon X5650 Hexa-Core 2.66GHz $1024.71

Is this what you mean, MM?

Best X5650 is 346 (int) and 241 (FP)
Best 6172 is 375 (int) and 309 (FP)

That makes it 8-28% faster. At the same price point.

**JF-AMD** · 07-19-2010, 10:13 AM

Originally Posted by qcmadness

SMT is good at server workload generally but not workstation.

I would disagree. It is better in a desktop workload than a server workload. Not familiar enough with workstations to say where they land.

**Dresdenboy** · 07-19-2010, 10:20 AM

Originally Posted by savantu

Now this is an interesting point : what if being dependent on shared resources isn't a liability, but actually a desirable feature ?

SMT allows me to increase the utilization of an underutilized core. CMT duplicates the core or part of it, thus duplicating the lack of utilization also.

Example : we take a 4 issues wide core with 4 ALU units. Let's say, most of the time only 1 or 2 of those units are used.
-with SMT, we have 2 threads running in parallel on that core, the second thread being dispatched to the idle units. Thus we now have 3 or even all of the units in use.
-with CMT, I add another cluster of 4 ALUs for a total of 8. I have 2 threads, but I also have 2x as many resources available and each thread uses most of the time 1 or 2 ALUs. Thus, out of 8 in the module, I'm constantly using 3-4 units.

The low IPCs known for many workloads are not caused by missing ILP, but mainly by missing data from caches or external memory and by branch mispredictions. So if there is no cache miss or mispredicted branch, avg. IPC might well be around 3 at that time.

But why don't you try using the free execution resources with the thread itself? If for example the thread stalls due to a cache miss, it could speculatively continue to run those instructions, which load data, as kind of prefetch instructions. After serving the cache miss the thread could execute normally, but having the further needed data in place.

Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.

**~~Xoulz~~** · 07-19-2010, 10:32 AM

Originally Posted by savantu

Because it is damn hard to achieve high utilization in just about every domain. Look at your body; when do you use it at max capacity ? The arms for example ? I really doubt you use more than 30-40% of the capacity of the right arm and 10-20% of the left one ( if you're right handed ).

But, I suppose, you don't think in this way : " why does the human body have 2 arms if they are never used to maximum capacity ? ".

It's not a matter of utilization, but of capability.

You seem to look at all of this in a vacuum. Why are all your remarks predicated on the fact you believe Intel's design is better? Yet, don't fundamentally understand physical aspects of... more?

Efficiency and cost and wattage are all demographics and marketing of the product, but what is and isn't on a chip is strait forward (architecture).

Secondly, the CPU doesn't knows what is coming next, it doesn't know what software is coursing threw it's veins, or even cares.. it just executes.

So, if a particular CPU can handle more of what you are trying to do with it, then it's better.. so which architecture you have doesn't matter, as long as it suites your needs. (ie: movieman's point)

What I don't get, is that you are arguing what is actually happening inside the chips.. and saying it's isn't what is actually happening.

Here:

4 lanes highway & 1 toll both, is not good for heavy traffic, but efficient on manpower..
4 lanes & 4 toll booths, is great, but excessive if you have light traffic..

You pick^... do you travel during rush hour, or not?

**blindbox** · 07-19-2010, 11:54 AM

Originally Posted by savantu

SMT allows me to increase the utilization of an underutilized core. CMT duplicates the core or part of it, thus duplicating the lack of utilization also.

CMT duplicates part of the core, not the whole core. Why call CMT when it duplicates the whole core? Both SMT and CMT takes up die space. However, both are also means to exploit all available die-space for performance. Simply put, SMT and CMT increases performance per area. 50% (theoretical) die-space increase to yield twice the performance? Who wouldn't want that? Hell, I don't think AMD will be selling 8-cores to desktops later if it wasn't for this tech.

Well, do you understand why AMD wants CMT? If you're still stubborn, I guess you like Intel's methods way better.

Plus, your scenario seems to indicate that SMT gives twice the performance. We all know this isn't true. And we all know adding another core gives almost twice (or more than twice if it somehow removes a bottleneck somewhere) the performance for well-threaded softwares. The lack of utilizations are also automatically fixed by the shared units two cores in a module share. Who the hell cares about lack of utilization anyway, when adding another core is very cheap with CMT?

Oh and, Bulldozer does offer better single-threaded performance. JF has already confirmed that. By how much, we do not know.

@Xoulz: I don't get your analogy. Perhaps you should refer to Particle's sig

I want to put more technical details on this post to cover up the argumentation holes, but it would make this post messy.

AMD probably won't spend years on Bulldozer if CMT isn't any better than SMT in terms of increasing performance per area.

**Hornet331** · 07-19-2010, 12:04 PM

Originally Posted by Xoulz

Secondly, the CPU doesn't knows what is coming next, it doesn't know what software is coursing threw it's veins, or even cares.. it just executes.

A word for you: Prefetching look into it.

**Chumbucket843** · 07-19-2010, 12:17 PM

cpu's have been speculating since the 80's. we have had dynamic execution since the 70's. today we have extremely advanced algorithms for speculation like branch prediction, prefetch, cache, tlb, etc.

**god_43** · 07-19-2010, 02:46 PM

Originally Posted by Particle

To be brief, I'll bulletize my bones to pick in this thread.

SMT - Listed as threads instead of cores for a reason. Intel's implementation means you'll only ever have [core count] threads active/executing even though you have [core count] * 2 "threads". Let's not lose sight of that. I don't know why Sav is going off on a tangent about this.
CMT - Genuinely has [core count] threads active/executing at any given moment. Each module contains two integer units capable of chewing on instructions at the same exact moment. It's two cores per module.
Threads - When it comes to comparing thread counts, realize that Intel's SMT-enabled chip thread counts aren't the same thing as AMD's CMT thread counts. In the case of AMD, all those threads are actually executing in parallel. In Intel's case they are not.

Maybe that will help clear up some confusion.

thanks.

**BoredByLife** · 07-19-2010, 04:36 PM

Originally Posted by JF-AMD

Let me put it in simple terms:

With actual cores, throughput generally goes up ~90% when you go from 1 core to 2 cores.

With SMT, throughput generally goes up ~14% for int and ~20% for FP (from SPEC.org, on Intel-based submissions).

SMT may double the number of threads, but it does not double the number of pipelines. You can only fit so many executions per cycle based on the pipelines. SMT might give you better utilization, but you are still limited on pipelines.

Doubling the number of cores will double the number pipelines and allow for more simultaneous execution. That is the key to this whole discussion. Everyone can argue about how many angels can dance on the head of a pin, but in reality, having more cores means that you have a larger dancefloor.

The advantage of SMT is that it doesn't need a lot of extra chip area. While an extra core will take a lot of chip area.

**freeloader** · 07-19-2010, 05:16 PM

Originally Posted by BoredByLife

The advantage of SMT is that it doesn't need a lot of extra chip area. While an extra core will take a lot of chip area.

With the ever shrinking nodes that CPUs are built on, I'll take the extra die space and a real core.

Not arguing any way over SMT or CMT, just saying I'll take a real core any day.

**Sn0wm@n** · 07-19-2010, 05:24 PM

smt has its flaws and cmt too ... it depend on your task and daily use to figure out what you need

**Chumbucket843** · 07-19-2010, 06:39 PM

Originally Posted by savantu

Not at all. No single thread will ever use all your integer and floating point resources at the same time. You typically have low usage, than spikes when you're running some loops which fit in the caches and so on. Everything works against you, diminishing utilization : code, caches, IMC, RAM, I/O,etc. You can have special cases of hand tuned software like Linpack where you can achieve almost 100% utilization ( they typically get somewhere around 9x% of what's theoretically possible ).
Most of the real life code is spaghetti code with data dependencies, branches,etc where even a CPU with excellent front end like Nehalem barely manages an IPC of 1.5-2 out of a theoretical 4.

where did you get this misinformation on code?

linpack is not "hand coded". it's written in c.

most code is very simple and as a whole it becomes complex. in a procedural language you would simplify the code into functions. the majority of loops are simple and highly predictable. even nested loops can be predicted very accurately.

data dependencies vary a lot with what code you are running. a finite state machine is 100% dependent on it's current state. matrix multiplication has virtually no dependencies.

and as for nehalem, i dont think they would have made it 4 issue if that didnt do anything for general code. i dont have vtune but it think a lot of software can benefit from it. the theoretical max is 5 decoded instructions but under heavy restrictions.

**Dresdenboy** · 07-19-2010, 09:56 PM

Originally Posted by BoredByLife

The advantage of SMT is that it doesn't need a lot of extra chip area. While an extra core will take a lot of chip area.

This has been discussed before. Both the AMD engineers asked by John and Andy Glew said, that the additional cluster (now core) adds about 12.5% to the core itself. If you look at todays chips, in most cases the cores take less than one half of the area. Thus the increased chip size amounts to ~5%. It's not the same type of core you have in mind. It's just a bunch of duplicated execution resources (scheduler, ROB, registers etc), and one additional 16k L1 cache.

**savantu** · 07-19-2010, 10:09 PM

Originally Posted by Dresdenboy

Oh, I didn't know that Intel invented SMT. Several people cite sources dating back to the early 90s, like these: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

Who said that?
At most, they were the first to successfully implement it in a commercial large volume product.

Back to my posting: I linked to an Intel slide in an article, which didn't allow direct linking. Fixed that now.

My point is: There is some misunderstanding about when and where those two threads in a SMT core are active. I wanted to show, that during each cycle, both threads could be executed on the available execution units. In other units, like the decode or retirement units of the Netburst architecture (see the paper linked by you) , the threads are being decoded/retired in alternating cycles.

There is no misunderstanding. They are active at the same time. Even with the threads being decoded and retired in alternating cycle it doesn't mean anything about the parallel execution. Since we're talking after all of a shared resource it is logical that some arbitration exists at different level.

Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them,

Says who ? You're basically narrowing the definition of SMT , so you can build an argument on a corner case which would come against the new definition. How do you want to have a discussion when you're changing the definition of things so it fit your stance ?
Talk about a logical fallacy the size of Everest.

Originally Posted by Dresdenboy

The low IPCs known for many workloads are not caused by missing ILP, but mainly by missing data from caches or external memory and by branch mispredictions. So if there is no cache miss or mispredicted branch, avg. IPC might well be around 3 at that time.

Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
Most of code is programmed in a sequential manner and data dependencies are the norm.

But why don't you try using the free execution resources with the thread itself? If for example the thread stalls due to a cache miss, it could speculatively continue to run those instructions, which load data, as kind of prefetch instructions. After serving the cache miss the thread could execute normally, but having the further needed data in place.

Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]

I am sure future CPUs will include run-ahead and advanced replay uarch. But it is possible that OoO execution will hinder such approaches. The first CPU with scout threads ( run ahead execution ), Sun Rock , was a major failure performance wise.

**savantu** · 07-19-2010, 11:04 PM

Originally Posted by Xoulz

It's not a matter of utilization, but of capability.

Expand on this please so we can better understand you.

You seem to look at all of this in a vacuum. Why are all your remarks predicated on the fact you believe Intel's design is better? Yet, don't fundamentally understand physical aspects of... more?

It looks like you're not understanding my remarks very well. I am simply trying to clear what SMT is and what is not. People think in an SMT core only 1 thread is running at any given time. That is false. You have 2 or even 4 ( Power 7 ) threads executing simultaneously.

Whether it is better than CMT or not is another discussion. My view is that SMT is complementary to CMT. It isn't like we need to choose between SMT and more cores/execution units. The optimal solution is more cores/execution units with SMT ( or any other form of multithreading.

Originally Posted by blindbox

CMT duplicates part of the core, not the whole core. Why call CMT when it duplicates the whole core?

In AMD therminology a cluster=core. JF clearly states that a few posts above.

Both SMT and CMT takes up die space. However, both are also means to exploit all available die-space for performance. Simply put, SMT and CMT increases performance per area. 50% (theoretical) die-space increase to yield twice the performance? Who wouldn't want that? Hell, I don't think AMD will be selling 8-cores to desktops later if it wasn't for this tech.

Don't assume the corner case to be representative of all situations. Generalization is bad in this scenario.
If 1 core = 1x in performance, how can 1.5 cores = 2x in performance ? Maybe performance in this context means Integer performance where you basically doubled the resources. No wonders in this case then.

Well, do you understand why AMD wants CMT? If you're still stubborn, I guess you like Intel's methods way better.

I understand perfectly well why AMD wants CMT. I am not sure however if you understand its merits and those of SMT for example.

Plus, your scenario seems to indicate that SMT gives twice the performance. We all know this isn't true.

Nice red herring there. "Argumentum ad populum" or appeal to majority is a logical fallacy. Here's the answer : you know it wrong.

Per core performance of Power 7 (single thread/base core = 1.0):

8 cores active/chip

single 1.0x
SMT2 ~1.5x
SMT4 ~1.8x

4 cores active/chip (turbo)

single ~1.15x
SMT2 ~1.7x
SMT4 ~2.05x

SMT truly offers some bang for the buck ( or die space if you want ).

https://www.e-techservices.com/publi...lanetPower.pdf

And we all know adding another core gives almost twice (or more than twice if it somehow removes a bottleneck somewhere) the performance for well-threaded softwares.

Yeah, like we've all seen in when dual cores 1st appeared and later with QC. Here's a clue for you : scaling drops rapidly as you increase core count. You can perform some exotic benchmarks to show almost perfect scaling, but in real world, the marginal contribution of a new core drops rapidly as you increase the core count.

Besides, I like your argument about superlinear scaling. Maybe you should also think of a perpetuum mobile device. It will solve our oil problems.

The lack of utilizations are also automatically fixed by the shared units two cores in a module share. Who the hell cares about lack of utilization anyway, when adding another core is very cheap with CMT?

Which one is it then ? Is it fixed or who the hell cares about it ?
No its not fix. You simply replicate the problem.

Oh and, Bulldozer does offer better single-threaded performance. JF has already confirmed that. By how much, we do not know.

It definetly should since it has more execution units than the K10 generation of cores.

@Xoulz: I don't get your analogy. Perhaps you should refer to Particle's sig

I didn't get it either, but maybe he will exand on it.

AMD probably won't spend years on Bulldozer if CMT isn't any better than SMT in terms of increasing performance per area.

SMT is harder to get right than CMT. It also takes less die space. If capacity is a problem and your volume is high enough to warrant the extra costs in development, SMT offers a nice benefit.
OTOH CMT is good in any situation.

Originally Posted by Chumbucket843

where did you get this misinformation on code?

linpack is not "hand coded". it's written in c.

I said "hand tunned". The kernels and math libraries are assembler optimized.

most code is very simple and as a whole it becomes complex. in a procedural language you would simplify the code into functions. the majority of loops are simple and highly predictable. even nested loops can be predicted very accurately.

data dependencies vary a lot with what code you are running. a finite state machine is 100% dependent on it's current state. matrix multiplication has virtually no dependencies.

and as for nehalem, i dont think they would have made it 4 issue if that didnt do anything for general code. i dont have vtune but it think a lot of software can benefit from it. the theoretical max is 5 decoded instructions but under heavy restrictions.

All very true.

Sorry for the lengthy post.

**savantu** · 07-19-2010, 11:36 PM

Originally Posted by Dresdenboy

This has been discussed before. Both the AMD engineers asked by John and Andy Glew said, that the additional cluster (now core) adds about 12.5% to the core itself.

Where did they mention 12.5 % ?

**superrugal** · 07-19-2010, 11:38 PM

Sun Rock - The one and only ancestor of bulldozer. 4 clusters per module, 32 threads.

**Dresdenboy** · 07-20-2010, 12:48 AM

Originally Posted by savantu

Where did they mention 12.5 % ?

Chuck Moore's slide couldn't even have a BD floorplan as a base for estimations at that time. And even then we'd not know, if L2 (1-2 MB) has been counted in.

Originally Posted by JF-AMD

Guys, the engineers have done the math. The additional circuitry increases the total floor space of a module by about 12%.

Four incremental cores adds ~5% total real estate to the whole die.

from http://www.amdzone.com/phpbb3/viewto...rt=100#p173927

The 12.5% has been mentioned earlier, but said to be too exact.

Originally Posted by Andy Glew

The multicluster multithreaded core replicates only the out-of-order core, roughly 1/8 the core on some chips. Thus, 2 clusters cost 12.5% area, and hence 12.5% leakage; round up to 15% to account for extra routing.

From his Multistar architecture description which can be found here:
http://andyglew.blogspot.com/2009/12...threading.html

Here the number is an estimation.

Once I did a calculation of adding the necessary resources for CMT based on the unit sizes of Llano's core with some resizing due to changed functionality. Based on that my estimation is ~11% for the core. Adding the uncore stuff would double the die size and lead to an additional 5.5%.

But the same argument goes for SMT. With increasing uncore areas the area effect of SMT becomes really small - clearly less than 5% for the whole die.

More on our other discussion later.

**Jowy Atreides** · 07-20-2010, 01:35 AM

Originally Posted by savantu

Where did they mention 12.5 % ?

half a decade ago...

**blindbox** · 07-20-2010, 02:24 AM

Originally Posted by Jowy Atreides

half a decade ago...

Lol.

12.5% increase in size for an extra core in a module.. daaamnnn. Thanks for the latest info, Dresdenboy. I don't see how this can fail.

savantu, real numbers instead please? Like, encoding?

Thread: AMD Tapes Out First "Bulldozer" Microprocessors.

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions