AMD to Disclose Details About Bulldozer Micro-Architecture in August

**Sn0wm@n** · 06-29-2010, 01:49 AM

Originally Posted by Chrysalis

my point been you cannot program for it.

HT shows up as extra real cores to an app.

On a intel i7 how many cpu graphs is there in task manager 4 or 8?

havent we adressed that same subject last page???

Originally Posted by Chrysalis

how can a app programmer fix that?

htt shows up as extra cpu's in the OS. Can maybe lie the blame in the OS in that it should show a difference between a virtual cpu and real cpu, but things like mysql get told its a real cpu.

Originally Posted by Sn0wm@n

its the programer who has to code for intel's HT implementation iirc

its the same as intel implementing AVX or amd implementing it... no program will benefit from its use until they code for it .....

**Hornet331** · 06-29-2010, 03:43 AM

Originally Posted by Chrysalis

my point been you cannot program for it.

HT shows up as extra real cores to an app.

On a intel i7 how many cpu graphs is there in task manager 4 or 8?

Read the the last 2 or 3 pages, the os is aware what cores are logical and what core are physical, so you can program for it.

**savantu** · 06-29-2010, 04:12 AM

Originally Posted by Hornet331

Read the the last 2 or 3 pages, the os is aware what cores are logical and what core are physical, so you can program for it.

Actually you don't need to program for HT, you need to program with parallelism in mind. In the multicore era, as long as your program is multithreaded it will take advantage of HT. A multi processor aware program from 1995 designed to run on a Pentium Pro will take advantage of HT ( or current multicore CPUs ) without changes.

HT in the P4, caused desktop apps to turn multithreaded and preempt the multicore era.

**Sn0wm@n** · 06-29-2010, 04:24 AM

Originally Posted by savantu

Actually you don't need to program for HT, you need to program with parallelism in mind. In the multicore era, as long as your program is multithreaded it will take advantage of HT. A multi processor aware program from 1995 designed to run on a Pentium Pro will take advantage of HT ( or current multicore CPUs ) without changes.

HT in the P4, caused desktop apps to turn multithreaded and preempt the multicore era.

huh.... proof about such claims....

**savantu** · 06-29-2010, 04:30 AM

Originally Posted by Sn0wm@n

huh.... proof about such claims....

?
This means you do not understand the concept of multithreading and what HT actually is.

**Sn0wm@n** · 06-29-2010, 04:32 AM

then help me figure it out with proof of how it actually works...

**savantu** · 06-29-2010, 05:00 AM

Originally Posted by Sn0wm@n

then help me figure it out with proof of how it actually works...

Well, let's start with multiprocessing and multithreading then.

Multiprocessing - using 2 or more central processing units ; originally this meant servers which had more CPUs, now you have multiprocessing at socket level due to multicore CPUs. For this to be used efficiently you need programs which use multiple threads or to run multiple single threaded programs in parallel. SW that can take advantage of multiprocesseing configs has been around since the '50s.

Multithreading ( as in SMT or HT ) refers to a CPU's ability to process simultaneously instructions from multiple threads. In both single and multiprocessing configurations it was soon discovered that a single thread isn't able to fully utilize the resources of the HW. There is a limited amount of instruction level parallelism which can be extracted in any given thread. A multithreaded core thus executes instructions from multiple threads at the same time. Those threads can be either from the same program or from different single threaded programs (I'm talking here about SMT or pure multi-threading and not the derivates like SoeMT ). A multithreaded core appears to the SW as a multiprocessing system ( that's why you have a single real core and multiple logical cores like in Task Manager ).
In fact, a Nehalem CPU is basically a multiprocessing multithreaded system at socket level.

If you connect the bolded part in the 2 concepts you'll understand why HyperThreading helps any software designed for multiprocessor use and even desktop sessions where you are multitasking ( running programs in parallel ). That's why many Pentium 4 users felt their systems snappier or more responsive when multitasking. You didn't have to flush the pipeline and switch to the new thread, they were going in parallel.

*Obviously I'm greatly simplying everything here, but I hope you get the point.

**Sn0wm@n** · 06-29-2010, 05:07 AM

i was expecting some sort of papers from intel clearly saying that you dont need to code for HT in order for it to work ....

**Chumbucket843** · 06-29-2010, 05:09 AM

Originally Posted by OneEng

Chumbucket843,

First, surely you have something better to do than complain about AMD's use of the word "module"?

nope.

Second,
A module is not "a core". I can understand your objection, and have even asked a few questions myself about this architectural terminology. For instance, why doesn't Intel simply say that they have a 12 "core" product today?

The answer is in the performance scaling and amount of shared resources. According to AMD, each "core" will scale 80% as good as 2 complete cores scale today. The only real difference between "real" cores and BD "cores" is that BD cores share more resources than we have typically seen in existing architectures. Intel's SMT only gains 20-30% in the best situations.

CMT is nothing like SMT other than it attempts to share resources within a processor to execute more efficiently. AMD isn't calling their approach CMT (yet). I don't think they intend to at this point in time, but IMHO, it is pretty close to the definition.

This is a pretty good article: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

I don't think that CMT is appreciably bigger than SMT considering the rather small portion of the die that a core currently takes up compared to cache. Additionally, it can afford to be somewhat larger since it scales the performance by a much bigger margin than does SMT.

Finally, scaling CMT to 3 or 4 cores should be of little difficulty for AMD in future iterations of BD. Not only would Intel gain little performance from moving to 4 way SMT from 2 way, but it would be very costly in development time and would still have the limitations and issues discussed in the article I linked to above.

that paper simulated an unrealistic processor. 24 instruction issue width? that (and some other things) might skew results. i would take computer simulations of pretty much anything with a grain of salt.

i do agree and i think CMT is a good idea but it's definitely not going to be as simple as adding more execution units. control logic and communications dominate die area and they are more important to get right than execution. you can have 10,000 64bit fpu's on a 32nm chip but they are pretty much useless and nothing will take advantage of them.

Really? What about cache? What about cache coherency? Processors haven't been "independent" since the advent of SMP.

cache coherency is a race condition caused by memory latency and i am only taking into account the behavioral domain. a core has an architecture which is an algorithm. the algorithm must be able to execute instructions.

Not according to the research I have read. Perhaps you have other information?

you're taking what i said out of context... i think. SMT only replicates the architectural state for each thread. CMT adds more alu's running their own threads.

Faster? Only if the ASM is poorly written. I agree with your other points though. Most applications have little need for the speed and efficiency of ASM (with the exception of embedded applications, and even most of these can be done in C with enough efficiency to get by competitively).

it's really easy to write bad code in ASM. it's for experts.

**savantu** · 06-29-2010, 05:32 AM

Originally Posted by Chumbucket843

...

you're taking what i said out of context... i think. SMT only replicates the architectural state for each thread. CMT adds more alu's running their own threads.

? What multiple threads? Maybe you meant "multiple instructions from the same thread".

**Particle** · 06-29-2010, 06:20 AM

Originally Posted by Sn0wm@n

i was expecting some sort of papers from intel clearly saying that you dont need to code for HT in order for it to work ....

Would it help if a programmer who writes multithreaded code said he was right? I am one. He is right. You're doubting a very basic concept akin to debating if trees are primarily made out of metal instead of wood.

A thread is a thread. If your program creates four of them, the OS schedules those threads to run on any of the available pipelines. I'll use that term to refer to execution slots since what HT provides are used like logical cores but are not actually cores. That program's threads might be scheduled on a real core or not, but either way the program doesn't have to be aware. It created different threads and those threads can take advantage of anywhere the OS's thread scheduler wants to send them. It might be a real core, an HT pipe, or any future construct that allows for a logical appearance of being a "core".

**ajaidev** · 06-29-2010, 09:00 AM

Originally Posted by savantu

Well, let's start with multiprocessing and multithreading then.

Multiprocessing - using 2 or more central processing units ; originally this meant servers which had more CPUs, now you have multiprocessing at socket level due to multicore CPUs. For this to be used efficiently you need programs which use multiple threads or to run multiple single threaded programs in parallel. SW that can take advantage of multiprocesseing configs has been around since the '50s.

Multithreading ( as in SMT or HT ) refers to a CPU's ability to process simultaneously instructions from multiple threads. In both single and multiprocessing configurations it was soon discovered that a single thread isn't able to fully utilize the resources of the HW. There is a limited amount of instruction level parallelism which can be extracted in any given thread. A multithreaded core thus executes instructions from multiple threads at the same time. Those threads can be either from the same program or from different single threaded programs (I'm talking here about SMT or pure multi-threading and not the derivates like SoeMT ). A multithreaded core appears to the SW as a multiprocessing system ( that's why you have a single real core and multiple logical cores like in Task Manager ).
In fact, a Nehalem CPU is basically a multiprocessing multithreaded system at socket level.

If you connect the bolded part in the 2 concepts you'll understand why HyperThreading helps any software designed for multiprocessor use and even desktop sessions where you are multitasking ( running programs in parallel ). That's why many Pentium 4 users felt their systems snappier or more responsive when multitasking. You didn't have to flush the pipeline and switch to the new thread, they were going in parallel.

*Obviously I'm greatly simplying everything here, but I hope you get the point.

Originally Posted by Particle

Would it help if a programmer who writes multithreaded code said he was right? I am one. He is right. You're doubting a very basic concept akin to debating if trees are primarily made out of metal instead of wood.

A thread is a thread. If your program creates four of them, the OS schedules those threads to run on any of the available pipelines. I'll use that term to refer to execution slots since what HT provides are used like logical cores but are not actually cores. That program's threads might be scheduled on a real core or not, but either way the program doesn't have to be aware. It created different threads and those threads can take advantage of anywhere the OS's thread scheduler wants to send them. It might be a real core, an HT pipe, or any future construct that allows for a logical appearance of being a "core".

There is a problem with the logic here, a multiprocesseing aware program will not automatically run great on multicore cpu's and multicore systems are generally better than multiprocesseing ones because of bandwidth and overall design.

A cpu with 1 core and HT is not in any way equal to one with 2 cores. There is no way SMT will have its independent resources because that would increase silicon size a lot. The only way they are even close is that if a program is so badly written that the degraded performance of a real core seems similar to what SMT can achieve.

Yes there is a impact of HT but is it worth while for Intel to put HT in their latest cpus i dont think so. In the ends its all about how effective the code is and how well it runs and uses the cpu resources.

I also wanted to point out you all are giving super responsibilities to thread scheduler well thats a bit much, yes it does assign priorities and well it does not do its job 100% all the time. There are several places it can miss and mess up. Imagine putting a hungry runner on the HT pipes and say a similar runner is on the real core. The one on the HT will have to wait till enough resources are allocated and then start.

You all recall the first K10's those had a dynamic clock speed thingy the thread scheduler actually messed up big on that and assigned work loads to cores that were clocked low resulting in performance loss. The point is that thread scheduler is not god it makes mistakes and a lot at that. They have tries to make it better in Vista.

The program doesn't have to be aware of logical or real cores but its better if it is that will result in better program execution and less reliance on the thread scheduler.

**Particle** · 06-29-2010, 10:50 AM

I have two primary complaints regarding your reply.

The first is that you completely misunderstood what I was saying, assuming I was talking about performance instead of utilization. I was trying to show the person how a program that creates two threads doesn't have to know anything about the underlying hardware when it comes to cores. Those two threads can easily take advantage of either cores or psuedocores just based on what the thread scheduler does with them. As such, you could have written an MT program in 1995 that would end up utilizing HT and later on CMT. Would it benefit? That's a different story entirely, but it would certainly execute using both the core and the HT pseudocore if the thread scheduler decided it wants it to. THAT is what I was demonstrating and is certainly correct.

The second complaint is that NO, I am not overstating what the scheduler does. If I create an app that spawns two threads and provide no further direction with regard to affinity or whatnot, it is 100% up to the scheduler to decide where and when to execute each thread. That's what I said in my post and that is absolutely correct.

**w0mbat** · 06-29-2010, 11:44 AM

just relax and dont get ur expectations high for august.

**Mad Pistol** · 06-29-2010, 03:55 PM

Just pray that AMD pulls a magic rabbit out of their hat and gets intel back on their toes and into competitive pricing again. I'm tired of this market being incredibly 1 sided (Intel: Performance. AMD: Budget)

**Sn0wm@n** · 06-29-2010, 04:23 PM

so you dont need to code to get HT to work properly... but if HT gives great gains means that the program wasnt properly written??? .. i think im getting a bit less lost LOL

but wasnt the point of ht to use the same hardware ressource as the real core just to try and remove lag in the processing pipeline????

**haylui** · 06-29-2010, 04:51 PM

Originally Posted by w0mbat

just relax and dont get ur expectations high for august.

hm...you mean the performance figure or depth of technical details?

**Dresdenboy** · 06-30-2010, 01:56 AM

Originally Posted by savantu

I don't think BD has any relation whatsoever with Trips. The people who designed BD did so in 2005-2007 and most have left AMD by now.

I did't talk about TRIPS. This is just another architecture. I pointed to the paradigms shown before discussing the TRIPS architecture as a solution. You can find many other talks (related to future CPUs from AMD), where Chuck Moore repeated most of the points.

Originally Posted by savantu

You're basically talking Netburst. Those were the premises on which Netburst was based. It is interesting to see that they are more valid today than in the past. It very well could be that we will see a Netburst revisited uarch for the upcoming process nodes. There's a limit to the Pentium Pro off springs anyway.
The lead uarch for Nehalem said that if they would have used Netburst as a starting point for Nehalem and not Core, the performance would have been higher, but with worse energy efficiency. Interesting remark anyway ( from a Stanford lecture on Nehalem ).

Many things changed after Netburst. So did the FO4 inversion delay or the proven methods to save energy in a high performance microarchitecture. Remember Power 6 or Cell, which were low FO4 (short cycletime) designs.

One thing I can imagine to be in Bulldozer are slow and fast clock domains (as in Cell), so that a fast clock only has to be provided to a part of the logic.

Originally Posted by savantu

True. However, the word no to forget is latency, whenever you need to cross clock domains or arbitrate streams you have latency.

That's correct. But several BD related patents indicate, that there could be many buffers and queues to help loosening the connections of different units. There are also patents talking about data crossing clock domains. And a simple case would be to have units with twice the clock frequency. You could even interleave the accesses of slow (half) clock frequency units on a half cycle basis.

Originally Posted by savantu

When you have so many cores, it becomes a bit useless to think as small as powering different units down, its way better to power cores down as done in Nehalem. I mean, from 8 modules, what's the benefit of shutting down 3 integer clusters ? Why not turn off 2 modules ?
Regarding caches, you always need to maintain coherency ( so L3 isn't a candidate ), so powering down the L2 , you can as well power down the cores attached to it.

No, shutting cores down just saves more power. But the required conditions are more seldomly met. And running everything powered on is the inverse of that measure. You can use clock gating. But relatively increasing static leakage will begin to hurt. Standard apps don't show the behaviour of a thermal virus. So imagine this: the power management powers off an ALU and an AGU if int throughput indicates less utilization. It also powers off 1/2 of the cache if not needed. And it powers down half of the decoders because a µOp cache helps keeping the remaining units busy. Finally it could reduce the clock of the retirement unit. Result: 96% of app performance at 60% the power. This is just an example of what could be done.

**Dresdenboy** · 06-30-2010, 01:57 AM

Originally Posted by w0mbat

just relax and dont get ur expectations high for august.

You mean, there will be nothing much surprising then?

**w0mbat** · 06-30-2010, 02:41 AM

Originally Posted by haylui

hm...you mean the performance figure or depth of technical details?

i mean the latter.

Originally Posted by Dresdenboy

You mean, there will be nothing much surprising then?

surprising for whom? i dont think for u

**ajaidev** · 06-30-2010, 06:42 AM

Originally Posted by Particle

I have two primary complaints regarding your reply.

The first is that you completely misunderstood what I was saying, assuming I was talking about performance instead of utilization. I was trying to show the person how a program that creates two threads doesn't have to know anything about the underlying hardware when it comes to cores. Those two threads can easily take advantage of either cores or psuedocores just based on what the thread scheduler does with them. As such, you could have written an MT program in 1995 that would end up utilizing HT and later on CMT. Would it benefit? That's a different story entirely, but it would certainly execute using both the core and the HT pseudocore if the thread scheduler decided it wants it to. THAT is what I was demonstrating and is certainly correct.

The second complaint is that NO, I am not overstating what the scheduler does. If I create an app that spawns two threads and provide no further direction with regard to affinity or whatnot, it is 100% up to the scheduler to decide where and when to execute each thread. That's what I said in my post and that is absolutely correct.

Ya your points are valid and i knew it when i wrote what i did, my point was how low the benefit would be in HT's case.

Secondly yes exactly it is 100% up to the scheduler to decide where and when to execute each thread, even if it makes a dodo and assigns a heavy job to the HT pipe.

The fact remains that SMT does help but with ineffective code (or small code) and once proper utilization of resources is achieved "Better OS allocator" HT's effectiveness will decrease in most cases.

Originally Posted by Dresdenboy

You mean, there will be nothing much surprising then?

The surprise is that every1 will get a pair of CAT shoes with each processor. yay

**saaya** · 06-30-2010, 07:13 AM

Originally Posted by Sn0wm@n

i was expecting some sort of papers from intel clearly saying that you dont need to code for HT in order for it to work ....

you definately DONT HAVE to code for HT to have it "work", but if you really want to use a HT cpu to the fullest and get the most performance out of it, then im pretty sure that you HAVE to take ht into account to prevent stalls in the pipeline or cache trashing etc...

but im not sure how big of an effort this is compared to properly coding for multiple cores... cause you have to take most of the limitations that ht has into account as well on a multicore platform... so if you code with HT in mind then that code will probably run close to perfect on a multicore cpu without ht as well...

which is why its a shame that most programmers ignored HT when intel introduced it... it would have been the perfect preparation for multi core and wouldnt have resulted in the software stall we all saw a few years ago and still see now... the transistion from single to multi core would have been much smoother if everybody had picked up ht i think...

im not a programmer, so please take everything i wrote here with a big grain of salt ^^

**Sn0wm@n** · 06-30-2010, 07:25 AM

i got the idea that it needed proper coding because HT didnt have dedicated math hardware fp/alu etc.. stuff ... so it shared its ressource with the real core .. so thats why i got it all wrong in the begining.... or maybe there is still some part about it that im missing ....

**Mechanical Man** · 06-30-2010, 07:47 AM

HT is in all respects real core for software layer, software does not know that it is just some doubled parts in some part of cpu to look like two cores when in reality it is just single execution unit.

So, if you code for multicore, you "code for ht". Basically, everybody should code for multicore if doing heavy programs when doing so, in most cases ht can improve core performance by utilising its execution units more efficiently.

**informal** · 06-30-2010, 07:57 AM

Originally Posted by Mechanical Man

HT is in all respects real core for software layer, software does not know that it is just some doubled parts in some part of cpu to look like two cores when in reality it is just single execution unit.

So, if you code for multicore, you "code for ht". Basically, everybody should code for multicore if doing heavy programs when doing so, in most cases ht can improve core performance by utilising its execution units more efficiently.

This is basically the essence of it.Good post

Thread: AMD to Disclose Details About Bulldozer Micro-Architecture in August

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions