Actually you don't need to program for HT, you need to program with parallelism in mind. In the multicore era, as long as your program is multithreaded it will take advantage of HT. A multi processor aware program from 1995 designed to run on a Pentium Pro will take advantage of HT ( or current multicore CPUs ) without changes.
HT in the P4, caused desktop apps to turn multithreaded and preempt the multicore era.
Well, let's start with multiprocessing and multithreading then.
Multiprocessing - using 2 or more central processing units ; originally this meant servers which had more CPUs, now you have multiprocessing at socket level due to multicore CPUs. For this to be used efficiently you need programs which use multiple threads or to run multiple single threaded programs in parallel. SW that can take advantage of multiprocesseing configs has been around since the '50s.
Multithreading ( as in SMT or HT ) refers to a CPU's ability to process simultaneously instructions from multiple threads. In both single and multiprocessing configurations it was soon discovered that a single thread isn't able to fully utilize the resources of the HW. There is a limited amount of instruction level parallelism which can be extracted in any given thread. A multithreaded core thus executes instructions from multiple threads at the same time. Those threads can be either from the same program or from different single threaded programs (I'm talking here about SMT or pure multi-threading and not the derivates like SoeMT ). A multithreaded core appears to the SW as a multiprocessing system ( that's why you have a single real core and multiple logical cores like in Task Manager ).
In fact, a Nehalem CPU is basically a multiprocessing multithreaded system at socket level.
If you connect the bolded part in the 2 concepts you'll understand why HyperThreading helps any software designed for multiprocessor use and even desktop sessions where you are multitasking ( running programs in parallel ). That's why many Pentium 4 users felt their systems snappier or more responsive when multitasking. You didn't have to flush the pipeline and switch to the new thread, they were going in parallel.
*Obviously I'm greatly simplying everything here, but I hope you get the point.
nope.
that paper simulated an unrealistic processor. 24 instruction issue width? that (and some other things) might skew results. i would take computer simulations of pretty much anything with a grain of salt.Second,
A module is not "a core". I can understand your objection, and have even asked a few questions myself about this architectural terminology. For instance, why doesn't Intel simply say that they have a 12 "core" product today?
The answer is in the performance scaling and amount of shared resources. According to AMD, each "core" will scale 80% as good as 2 complete cores scale today. The only real difference between "real" cores and BD "cores" is that BD cores share more resources than we have typically seen in existing architectures. Intel's SMT only gains 20-30% in the best situations.
CMT is nothing like SMT other than it attempts to share resources within a processor to execute more efficiently. AMD isn't calling their approach CMT (yet). I don't think they intend to at this point in time, but IMHO, it is pretty close to the definition.
This is a pretty good article: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
I don't think that CMT is appreciably bigger than SMT considering the rather small portion of the die that a core currently takes up compared to cache. Additionally, it can afford to be somewhat larger since it scales the performance by a much bigger margin than does SMT.
Finally, scaling CMT to 3 or 4 cores should be of little difficulty for AMD in future iterations of BD. Not only would Intel gain little performance from moving to 4 way SMT from 2 way, but it would be very costly in development time and would still have the limitations and issues discussed in the article I linked to above.
i do agree and i think CMT is a good idea but it's definitely not going to be as simple as adding more execution units. control logic and communications dominate die area and they are more important to get right than execution. you can have 10,000 64bit fpu's on a 32nm chip but they are pretty much useless and nothing will take advantage of them.
cache coherency is a race condition caused by memory latency and i am only taking into account the behavioral domain. a core has an architecture which is an algorithm. the algorithm must be able to execute instructions.Really? What about cache? What about cache coherency? Processors haven't been "independent" since the advent of SMP.
you're taking what i said out of context... i think. SMT only replicates the architectural state for each thread. CMT adds more alu's running their own threads.Not according to the research I have read. Perhaps you have other information?
it's really easy to write bad code in ASM. it's for experts.Faster? Only if the ASM is poorly written. I agree with your other points though. Most applications have little need for the speed and efficiency of ASM (with the exception of embedded applications, and even most of these can be done in C with enough efficiency to get by competitively).
Would it help if a programmer who writes multithreaded code said he was right? I am one. He is right. You're doubting a very basic concept akin to debating if trees are primarily made out of metal instead of wood.
A thread is a thread. If your program creates four of them, the OS schedules those threads to run on any of the available pipelines. I'll use that term to refer to execution slots since what HT provides are used like logical cores but are not actually cores. That program's threads might be scheduled on a real core or not, but either way the program doesn't have to be aware. It created different threads and those threads can take advantage of anywhere the OS's thread scheduler wants to send them. It might be a real core, an HT pipe, or any future construct that allows for a logical appearance of being a "core".
Particle's First Rule of Online Technical Discussion:
As a thread about any computer related subject has its length approach infinity, the likelihood and inevitability of a poorly constructed AMD vs. Intel fight also exponentially increases.
Rule 1A:
Likewise, the frequency of a car pseudoanalogy to explain a technical concept increases with thread length. This will make many people chuckle, as computer people are rarely knowledgeable about vehicular mechanics.
Rule 2:
When confronted with a post that is contrary to what a poster likes, believes, or most often wants to be correct, the poster will pick out only minor details that are largely irrelevant in an attempt to shut out the conflicting idea. The core of the post will be left alone since it isn't easy to contradict what the person is actually saying.
Rule 2A:
When a poster cannot properly refute a post they do not like (as described above), the poster will most likely invent fictitious counter-points and/or begin to attack the other's credibility in feeble ways that are dramatic but irrelevant. Do not underestimate this tactic, as in the online world this will sway many observers. Do not forget: Correctness is decided only by what is said last, the most loudly, or with greatest repetition.
Rule 3:
When it comes to computer news, 70% of Internet rumors are outright fabricated, 20% are inaccurate enough to simply be discarded, and about 10% are based in reality. Grains of salt--become familiar with them.
Remember: When debating online, everyone else is ALWAYS wrong if they do not agree with you!
Random Tip o' the Whatever
You just can't win. If your product offers feature A instead of B, people will moan how A is stupid and it didn't offer B. If your product offers B instead of A, they'll likewise complain and rant about how anyone's retarded cousin could figure out A is what the market wants.
There is a problem with the logic here, a multiprocesseing aware program will not automatically run great on multicore cpu's and multicore systems are generally better than multiprocesseing ones because of bandwidth and overall design.
A cpu with 1 core and HT is not in any way equal to one with 2 cores. There is no way SMT will have its independent resources because that would increase silicon size a lot. The only way they are even close is that if a program is so badly written that the degraded performance of a real core seems similar to what SMT can achieve.
Yes there is a impact of HT but is it worth while for Intel to put HT in their latest cpus i dont think so. In the ends its all about how effective the code is and how well it runs and uses the cpu resources.
I also wanted to point out you all are giving super responsibilities to thread scheduler well thats a bit much, yes it does assign priorities and well it does not do its job 100% all the time. There are several places it can miss and mess up. Imagine putting a hungry runner on the HT pipes and say a similar runner is on the real core. The one on the HT will have to wait till enough resources are allocated and then start.
You all recall the first K10's those had a dynamic clock speed thingy the thread scheduler actually messed up big on that and assigned work loads to cores that were clocked low resulting in performance loss. The point is that thread scheduler is not god it makes mistakes and a lot at that. They have tries to make it better in Vista.
The program doesn't have to be aware of logical or real cores but its better if it is that will result in better program execution and less reliance on the thread scheduler.
Coming Soon
I have two primary complaints regarding your reply.
The first is that you completely misunderstood what I was saying, assuming I was talking about performance instead of utilization. I was trying to show the person how a program that creates two threads doesn't have to know anything about the underlying hardware when it comes to cores. Those two threads can easily take advantage of either cores or psuedocores just based on what the thread scheduler does with them. As such, you could have written an MT program in 1995 that would end up utilizing HT and later on CMT. Would it benefit? That's a different story entirely, but it would certainly execute using both the core and the HT pseudocore if the thread scheduler decided it wants it to. THAT is what I was demonstrating and is certainly correct.
The second complaint is that NO, I am not overstating what the scheduler does. If I create an app that spawns two threads and provide no further direction with regard to affinity or whatnot, it is 100% up to the scheduler to decide where and when to execute each thread. That's what I said in my post and that is absolutely correct.
Particle's First Rule of Online Technical Discussion:
As a thread about any computer related subject has its length approach infinity, the likelihood and inevitability of a poorly constructed AMD vs. Intel fight also exponentially increases.
Rule 1A:
Likewise, the frequency of a car pseudoanalogy to explain a technical concept increases with thread length. This will make many people chuckle, as computer people are rarely knowledgeable about vehicular mechanics.
Rule 2:
When confronted with a post that is contrary to what a poster likes, believes, or most often wants to be correct, the poster will pick out only minor details that are largely irrelevant in an attempt to shut out the conflicting idea. The core of the post will be left alone since it isn't easy to contradict what the person is actually saying.
Rule 2A:
When a poster cannot properly refute a post they do not like (as described above), the poster will most likely invent fictitious counter-points and/or begin to attack the other's credibility in feeble ways that are dramatic but irrelevant. Do not underestimate this tactic, as in the online world this will sway many observers. Do not forget: Correctness is decided only by what is said last, the most loudly, or with greatest repetition.
Rule 3:
When it comes to computer news, 70% of Internet rumors are outright fabricated, 20% are inaccurate enough to simply be discarded, and about 10% are based in reality. Grains of salt--become familiar with them.
Remember: When debating online, everyone else is ALWAYS wrong if they do not agree with you!
Random Tip o' the Whatever
You just can't win. If your product offers feature A instead of B, people will moan how A is stupid and it didn't offer B. If your product offers B instead of A, they'll likewise complain and rant about how anyone's retarded cousin could figure out A is what the market wants.
just relax and dont get ur expectations high for august.
Just pray that AMD pulls a magic rabbit out of their hat and gets intel back on their toes and into competitive pricing again. I'm tired of this market being incredibly 1 sided (Intel: Performance. AMD: Budget)
so you dont need to code to get HT to work properly... but if HT gives great gains means that the program wasnt properly written??? .. i think im getting a bit less lost LOL
but wasnt the point of ht to use the same hardware ressource as the real core just to try and remove lag in the processing pipeline????
Main Rig:
Processor & Motherboard:AMD Ryzen5 1400 ' Gigabyte B450M-DS3H
Random Access Memory Module:Adata XPG DDR4 3000 MHz 2x8GB
Graphic Card:XFX RX 580 4GB
Power Supply Unit:FSP AURUM 92+ Series PT-650M
Storage Unit:Crucial MX 500 240GB SATA III SSD
Processor Heatsink Fan:AMD Wraith Spire RGB
Chasis:Thermaltake Level 10GTS Black
I did't talk about TRIPS. This is just another architecture. I pointed to the paradigms shown before discussing the TRIPS architecture as a solution. You can find many other talks (related to future CPUs from AMD), where Chuck Moore repeated most of the points.
Many things changed after Netburst. So did the FO4 inversion delay or the proven methods to save energy in a high performance microarchitecture. Remember Power 6 or Cell, which were low FO4 (short cycletime) designs.
One thing I can imagine to be in Bulldozer are slow and fast clock domains (as in Cell), so that a fast clock only has to be provided to a part of the logic.
That's correct. But several BD related patents indicate, that there could be many buffers and queues to help loosening the connections of different units. There are also patents talking about data crossing clock domains. And a simple case would be to have units with twice the clock frequency. You could even interleave the accesses of slow (half) clock frequency units on a half cycle basis.
No, shutting cores down just saves more power. But the required conditions are more seldomly met. And running everything powered on is the inverse of that measure. You can use clock gating. But relatively increasing static leakage will begin to hurt. Standard apps don't show the behaviour of a thermal virus. So imagine this: the power management powers off an ALU and an AGU if int throughput indicates less utilization. It also powers off 1/2 of the cache if not needed. And it powers down half of the decoders because a ľOp cache helps keeping the remaining units busy. Finally it could reduce the clock of the retirement unit. Result: 96% of app performance at 60% the power. This is just an example of what could be done.
Ya your points are valid and i knew it when i wrote what i did, my point was how low the benefit would be in HT's case.
Secondly yes exactly it is 100% up to the scheduler to decide where and when to execute each thread, even if it makes a dodo and assigns a heavy job to the HT pipe.
The fact remains that SMT does help but with ineffective code (or small code) and once proper utilization of resources is achieved "Better OS allocator" HT's effectiveness will decrease in most cases.
The surprise is that every1 will get a pair of CAT shoes with each processor. yay
Last edited by ajaidev; 06-30-2010 at 06:45 AM.
Coming Soon
you definately DONT HAVE to code for HT to have it "work", but if you really want to use a HT cpu to the fullest and get the most performance out of it, then im pretty sure that you HAVE to take ht into account to prevent stalls in the pipeline or cache trashing etc...
but im not sure how big of an effort this is compared to properly coding for multiple cores... cause you have to take most of the limitations that ht has into account as well on a multicore platform... so if you code with HT in mind then that code will probably run close to perfect on a multicore cpu without ht as well...
which is why its a shame that most programmers ignored HT when intel introduced it... it would have been the perfect preparation for multi core and wouldnt have resulted in the software stall we all saw a few years ago and still see now... the transistion from single to multi core would have been much smoother if everybody had picked up ht i think...
im not a programmer, so please take everything i wrote here with a big grain of salt ^^
i got the idea that it needed proper coding because HT didnt have dedicated math hardware fp/alu etc.. stuff ... so it shared its ressource with the real core .. so thats why i got it all wrong in the begining.... or maybe there is still some part about it that im missing ....
HT is in all respects real core for software layer, software does not know that it is just some doubled parts in some part of cpu to look like two cores when in reality it is just single execution unit.
So, if you code for multicore, you "code for ht". Basically, everybody should code for multicore if doing heavy programs when doing so, in most cases ht can improve core performance by utilising its execution units more efficiently.
Bookmarks