havent we adressed that same subject last page???
Actually you don't need to program for HT, you need to program with parallelism in mind. In the multicore era, as long as your program is multithreaded it will take advantage of HT. A multi processor aware program from 1995 designed to run on a Pentium Pro will take advantage of HT ( or current multicore CPUs ) without changes.
HT in the P4, caused desktop apps to turn multithreaded and preempt the multicore era.
then help me figure it out with proof of how it actually works...
Well, let's start with multiprocessing and multithreading then.
Multiprocessing - using 2 or more central processing units ; originally this meant servers which had more CPUs, now you have multiprocessing at socket level due to multicore CPUs. For this to be used efficiently you need programs which use multiple threads or to run multiple single threaded programs in parallel. SW that can take advantage of multiprocesseing configs has been around since the '50s.
Multithreading ( as in SMT or HT ) refers to a CPU's ability to process simultaneously instructions from multiple threads. In both single and multiprocessing configurations it was soon discovered that a single thread isn't able to fully utilize the resources of the HW. There is a limited amount of instruction level parallelism which can be extracted in any given thread. A multithreaded core thus executes instructions from multiple threads at the same time. Those threads can be either from the same program or from different single threaded programs (I'm talking here about SMT or pure multi-threading and not the derivates like SoeMT ). A multithreaded core appears to the SW as a multiprocessing system ( that's why you have a single real core and multiple logical cores like in Task Manager ).
In fact, a Nehalem CPU is basically a multiprocessing multithreaded system at socket level.
If you connect the bolded part in the 2 concepts you'll understand why HyperThreading helps any software designed for multiprocessor use and even desktop sessions where you are multitasking ( running programs in parallel ). That's why many Pentium 4 users felt their systems snappier or more responsive when multitasking. You didn't have to flush the pipeline and switch to the new thread, they were going in parallel.
*Obviously I'm greatly simplying everything here, but I hope you get the point.
i was expecting some sort of papers from intel clearly saying that you dont need to code for HT in order for it to work ....
nope.
that paper simulated an unrealistic processor. 24 instruction issue width? that (and some other things) might skew results. i would take computer simulations of pretty much anything with a grain of salt.Quote:
Second,
A module is not "a core". I can understand your objection, and have even asked a few questions myself about this architectural terminology. For instance, why doesn't Intel simply say that they have a 12 "core" product today?
The answer is in the performance scaling and amount of shared resources. According to AMD, each "core" will scale 80% as good as 2 complete cores scale today. The only real difference between "real" cores and BD "cores" is that BD cores share more resources than we have typically seen in existing architectures. Intel's SMT only gains 20-30% in the best situations.
CMT is nothing like SMT other than it attempts to share resources within a processor to execute more efficiently. AMD isn't calling their approach CMT (yet). I don't think they intend to at this point in time, but IMHO, it is pretty close to the definition.
This is a pretty good article: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
I don't think that CMT is appreciably bigger than SMT considering the rather small portion of the die that a core currently takes up compared to cache. Additionally, it can afford to be somewhat larger since it scales the performance by a much bigger margin than does SMT.
Finally, scaling CMT to 3 or 4 cores should be of little difficulty for AMD in future iterations of BD. Not only would Intel gain little performance from moving to 4 way SMT from 2 way, but it would be very costly in development time and would still have the limitations and issues discussed in the article I linked to above.
i do agree and i think CMT is a good idea but it's definitely not going to be as simple as adding more execution units. control logic and communications dominate die area and they are more important to get right than execution. you can have 10,000 64bit fpu's on a 32nm chip but they are pretty much useless and nothing will take advantage of them.
cache coherency is a race condition caused by memory latency and i am only taking into account the behavioral domain. a core has an architecture which is an algorithm. the algorithm must be able to execute instructions.Quote:
Really? What about cache? What about cache coherency? Processors haven't been "independent" since the advent of SMP.
you're taking what i said out of context... i think. SMT only replicates the architectural state for each thread. CMT adds more alu's running their own threads.Quote:
Not according to the research I have read. Perhaps you have other information?
it's really easy to write bad code in ASM. it's for experts.Quote:
Faster? Only if the ASM is poorly written. I agree with your other points though. Most applications have little need for the speed and efficiency of ASM (with the exception of embedded applications, and even most of these can be done in C with enough efficiency to get by competitively).
Would it help if a programmer who writes multithreaded code said he was right? I am one. He is right. You're doubting a very basic concept akin to debating if trees are primarily made out of metal instead of wood.
A thread is a thread. If your program creates four of them, the OS schedules those threads to run on any of the available pipelines. I'll use that term to refer to execution slots since what HT provides are used like logical cores but are not actually cores. That program's threads might be scheduled on a real core or not, but either way the program doesn't have to be aware. It created different threads and those threads can take advantage of anywhere the OS's thread scheduler wants to send them. It might be a real core, an HT pipe, or any future construct that allows for a logical appearance of being a "core".
There is a problem with the logic here, a multiprocesseing aware program will not automatically run great on multicore cpu's and multicore systems are generally better than multiprocesseing ones because of bandwidth and overall design.
A cpu with 1 core and HT is not in any way equal to one with 2 cores. There is no way SMT will have its independent resources because that would increase silicon size a lot. The only way they are even close is that if a program is so badly written that the degraded performance of a real core seems similar to what SMT can achieve.
Yes there is a impact of HT but is it worth while for Intel to put HT in their latest cpus i dont think so. In the ends its all about how effective the code is and how well it runs and uses the cpu resources.
I also wanted to point out you all are giving super responsibilities to thread scheduler well thats a bit much, yes it does assign priorities and well it does not do its job 100% all the time. There are several places it can miss and mess up. Imagine putting a hungry runner on the HT pipes and say a similar runner is on the real core. The one on the HT will have to wait till enough resources are allocated and then start.
You all recall the first K10's those had a dynamic clock speed thingy the thread scheduler actually messed up big on that and assigned work loads to cores that were clocked low resulting in performance loss. The point is that thread scheduler is not god it makes mistakes and a lot at that. They have tries to make it better in Vista.
The program doesn't have to be aware of logical or real cores but its better if it is that will result in better program execution and less reliance on the thread scheduler.
I have two primary complaints regarding your reply.
The first is that you completely misunderstood what I was saying, assuming I was talking about performance instead of utilization. I was trying to show the person how a program that creates two threads doesn't have to know anything about the underlying hardware when it comes to cores. Those two threads can easily take advantage of either cores or psuedocores just based on what the thread scheduler does with them. As such, you could have written an MT program in 1995 that would end up utilizing HT and later on CMT. Would it benefit? That's a different story entirely, but it would certainly execute using both the core and the HT pseudocore if the thread scheduler decided it wants it to. THAT is what I was demonstrating and is certainly correct.
The second complaint is that NO, I am not overstating what the scheduler does. If I create an app that spawns two threads and provide no further direction with regard to affinity or whatnot, it is 100% up to the scheduler to decide where and when to execute each thread. That's what I said in my post and that is absolutely correct.
just relax and dont get ur expectations high for august.
Just pray that AMD pulls a magic rabbit out of their hat and gets intel back on their toes and into competitive pricing again. I'm tired of this market being incredibly 1 sided (Intel: Performance. AMD: Budget)
so you dont need to code to get HT to work properly... but if HT gives great gains means that the program wasnt properly written??? .. i think im getting a bit less lost LOL
but wasnt the point of ht to use the same hardware ressource as the real core just to try and remove lag in the processing pipeline????
I did't talk about TRIPS. This is just another architecture. I pointed to the paradigms shown before discussing the TRIPS architecture as a solution. You can find many other talks (related to future CPUs from AMD), where Chuck Moore repeated most of the points.
Many things changed after Netburst. So did the FO4 inversion delay or the proven methods to save energy in a high performance microarchitecture. Remember Power 6 or Cell, which were low FO4 (short cycletime) designs.
One thing I can imagine to be in Bulldozer are slow and fast clock domains (as in Cell), so that a fast clock only has to be provided to a part of the logic.
That's correct. But several BD related patents indicate, that there could be many buffers and queues to help loosening the connections of different units. There are also patents talking about data crossing clock domains. And a simple case would be to have units with twice the clock frequency. You could even interleave the accesses of slow (half) clock frequency units on a half cycle basis.
No, shutting cores down just saves more power. But the required conditions are more seldomly met. And running everything powered on is the inverse of that measure. You can use clock gating. But relatively increasing static leakage will begin to hurt. Standard apps don't show the behaviour of a thermal virus. So imagine this: the power management powers off an ALU and an AGU if int throughput indicates less utilization. It also powers off 1/2 of the cache if not needed. And it powers down half of the decoders because a µOp cache helps keeping the remaining units busy. Finally it could reduce the clock of the retirement unit. Result: 96% of app performance at 60% the power. This is just an example of what could be done.
Ya your points are valid and i knew it when i wrote what i did, my point was how low the benefit would be in HT's case.
Secondly yes exactly it is 100% up to the scheduler to decide where and when to execute each thread, even if it makes a dodo and assigns a heavy job to the HT pipe.
The fact remains that SMT does help but with ineffective code (or small code) and once proper utilization of resources is achieved "Better OS allocator" HT's effectiveness will decrease in most cases.
The surprise is that every1 will get a pair of CAT shoes with each processor. yay
you definately DONT HAVE to code for HT to have it "work", but if you really want to use a HT cpu to the fullest and get the most performance out of it, then im pretty sure that you HAVE to take ht into account to prevent stalls in the pipeline or cache trashing etc...
but im not sure how big of an effort this is compared to properly coding for multiple cores... cause you have to take most of the limitations that ht has into account as well on a multicore platform... so if you code with HT in mind then that code will probably run close to perfect on a multicore cpu without ht as well...
which is why its a shame that most programmers ignored HT when intel introduced it... it would have been the perfect preparation for multi core and wouldnt have resulted in the software stall we all saw a few years ago and still see now... the transistion from single to multi core would have been much smoother if everybody had picked up ht i think...
im not a programmer, so please take everything i wrote here with a big grain of salt ^^
i got the idea that it needed proper coding because HT didnt have dedicated math hardware fp/alu etc.. stuff ... so it shared its ressource with the real core .. so thats why i got it all wrong in the begining.... or maybe there is still some part about it that im missing ....
HT is in all respects real core for software layer, software does not know that it is just some doubled parts in some part of cpu to look like two cores when in reality it is just single execution unit.
So, if you code for multicore, you "code for ht". Basically, everybody should code for multicore if doing heavy programs when doing so, in most cases ht can improve core performance by utilising its execution units more efficiently.
the buffers and queues are there to keep communication overhead of a clustered uarch down.
could you give a link to that patent? i really would like to know how they are going to handle sequencing with high clockspeeds. even if you double the clockspeed for a pipeline stage there will still be a lot of complex issues like clock skew, power, and area. in the past AMD has made a few borked synchronizers. idk if they want to go that direction with BD but that was 30 years ago.:)
They should have some experience now with a differently clocked NB or HT PHY. However using a specific clock and 2x or 0.5x that clock as a second clock, should be ok to handle. Cell did it this way.
And some more recent patents:
http://www.freepatentsonline.com/y2008/0288805.html
http://www.freepatentsonline.com/y2009/0261869.html
http://www.freepatentsonline.com/y2010/0049887.html
http://www.freepatentsonline.com/7636803.html
and a paper (NB related, by some of the inventors):
http://www.computer.org/portal/web/c.../ASYNC.2007.21
we addressed nothing. I said it cant be coded for. your response is they should code for it.
I am thinking what the ???? to your response.
I know in linux and freebsd htt show up as real cpu's what I dont know if its the same case in windows or not.
eg. on a freebsd server I have access to right now it is reporting 8 processors on a quad core htt cpu.
doesn't the windows scheduler have some mechanism that assigns threads onto the first 2/4/6 physical/real cores first-
and only when all those real cores have been used then it starts to allocate more threads to the htt-pipeline?
and that coders have access to this mechanism if their threads initiate this resource call?
ist, the more stalls/bubbles an app has, the more htt will be beneficial. otherwise, if an app is optimized as indicated above, htt will also increase performance.Quote:
6.2 Improving Application Performance on Hyper-Threading-Enabled Systems
In general, multithreaded Windows applications perform better when running unmodified on an HT processor than they do on a similarly equipped single-threaded processor. To optimize the application performance benefit on HT-enabled systems, the application should ensure that the threads executing on the two logical processors have minimal dependencies on the same shared resources on the physical processor. With an understanding of how the application threads and processes utilize the shared resources on an HT processor, setting processor affinity to minimize competition for these system resources can help application performance.
The following example scenarios describe good and bad ways to set thread affinities:
Good HT thread affinity example. Where an application has threads that produce data and threads that consume data, setting affinities so that consumer/producer thread pairs run on the logical processors of the same physical processor should improve performance. This configuration allows the threads to share cached data and to overlap operation. That is, the producer thread can produce future items while the consumer thread is consuming older items.
so, theres the call parameters.Quote:
On HT-enabled systems, each logical processor is treated as an individual processor by the operating system and is represented by a bit in the system affinity mask. This is true for both HT-aware and non-HT-aware releases of the Windows operating system.
The system processor affinity mask can be read using the GetProcessAffinityMask function. The mask has a bit set for each processor in the system. The mask can be used by applications to set processor affinity for its threads and processes using the SetThreadAffinityMask or SetThreadIdealProcessor functions.
Quote:
5.3 Using the YIELD (PAUSE) Instruction to Avoid Spinlock Contention
Where two logical processors on the same physical HT processor are competing for access to the same piece of data, the shared resources on the device can have the effect of "starving" one of the logical processors by, in effect, denying it access to the data. This is particularly significant when the piece of data is a spinlock, because the logical processor that is starved of access might own the spinlock. Intel recommends that logical processors be paused while executing spinlocks to alleviate this problem.
Quote:
5.2 Aggressive HALT of Processors in the Idle Loop
When a processor in a system running the Windows operating system has no work to do, it enters the idle loop. If the first logical processor on an HT processor is executing instructions in the idle loop, that is, if it is not doing any real work, it is competing for shared resources, which degrades the performance capability of the second logical processor on the same physical processor. The result of this is to degrade the rate at which the second logical processor could do real work.
To minimize the impact of this, the idle loop in Windows:banana:XP and the Windows Server:banana:2003 family has been modified to more aggressively HALT processors that are executing in the idle loop. After a logical processor has been halted, it no longer executes instructions and no longer competes for shared resources.
[ m$ ]Quote:
The performance increase that is delivered when transitioning from one active logical processor to two active logical processors, on the same physical processor, is typically in the range of 10% (10) to 30% (30). So on average the total system performance would be likely to increase from 200 to 220 (that is, it goes up by 10%).
This lower performance increase is due to the fact that two threads are competing for the use of the shared resources on one of the physical HT processors. So scheduling a thread onto an HT processor that already has an active logical processor has the following effects:
o Slowing down the performance of that active logical processor
o Limiting the performance of the new scheduled thread on the second logical processor
Obviously it was adressed, you just failed to read/understand it.
Linux is SMT(HT) aware since kenrnel 2.4.18 and windows since XP. The os knows which cores are real cores and what cores are logical cores.
And you can code for it, at least in windows there are certain commands available to the programmer to retrive the mapping of the cores if you want to assign thread affinity manually.
http://www.xtremesystems.org/forums/...&postcount=262
Read the doc that is linked in that post. I am als fairly certain, that there is a similar option for linux.
yep know about affinity so yeah at the very least thats available, will read up on the doc and say my thoughts after.
I am officially on board the Bulldozer Bandwagon...:YIPPIE:
'Give me 16 cores or give me death!':cheer2:
I hope the socket platform is disclosed.
I just bought AM3 with the intention of upgrading to BD and now there's net rumours about an AM3 rev2
Socket AM3r2 has been known for some time.
Sampsa put up a slide here on the forum last year.
IIRC socket AM2+ was called AM2r2 before it was released, so AM3r2 could be
called AM3+ at launch perhaps.
Hopefully a BIOS upgrade is the only thing that is needed, and that the manufacturers
isn't so lazy that they try to avoid releasing them.
I'm pretty sure I read somewhere that AM3 will be compatible with Bulldozer. Unless something has changed since this slide was made.
http://img72.imageshack.us/img72/410...toproadmap.jpg
It's a real pity that Opterons can't be overclocked due to lack of motherboards. The cost-effectiveness of the platform is of great value, especially to advanced home users. Socket longevity, more cores for lower prices etc. I'd be willing to pay more for a multi-processor Opteron board with overclocking abilities. I'm sure most duallie fans will do the same. Maybe we should setup a poll to measure the consensus?
Can anyone confirm the 9 core rumour? :)
Rumor about 9 cores is not true.
Opteron is targeted at commercial server applications. It would be extremely expensive to support the consumer market so that will not happen. I have gone through the economics on several forums, but let me dispel the two biggest myths quickly:
1. There is not a "huge market" for server parts in consumer environments. There are definitely people that will want to do this, but it is a very small part of the market.
2. It is not inexpensive to "just add support". Essentially you are doubling a lot of the back end costs.
I would never stop a consumer from doing this because it is my job to sell more processors. But I would warn that if you go down that path, you won't see the level of support that you will see on Phenom and other consumer brands.
Who was the guy responsible for Socket 939 Opterons 1xx? They were extremely famous and popular here around 2006 because you could get Athlon 64 FX worth bins in a 250 U$D or so Opteron 165, besides that they use Toledo JH-E6 parts with 1 MB Cache L2 per Core while comparable A64X2 were Manchesters BH-E4 with 512 KB Cache L2. The enthusiast market eated those "Server parts".
About the 9 Core issue, the "128 Bits Core that interconnects the 8 64 Bits ones" sounds the 128 Bits IMC and a Crossbar or something.
I can guarantee you that while those were popular with enthusiasts, that was not a net benefit for AMD. And we did not sell nearly as many as you probably think. I have a Fox Vanilla 140 on my bike and so do a lot of the people that I ride with, but that does not mean it is the most popular fork out there. Just real popular with my friends.
Opterons 1xx were on a shortage back on its time, so I suppose that the enthusiast market did have an impact on them. Obviously it wasn't a net benefit because you were selling the highest quality bin at very cheap prices that cannibalized A64 counterparts sales. But for those that put their hands on them, it was wonderful.
Each BD core has 4 ALUs and at least 3 AGUs. I found some nice numbers in Open64 sources again. More (plus SB BOINC stats) here http://citavia.blog.de/2010/07/06/bu...-core-8927293/
It seems, Hiroshige Goto needs to redraw his diagrams (and me too).
Thanks for the new update,looks very interesting :) . Those integer cores should be quite powerful now that we know that Bobcat is pretty speedy too.
K8->K8L: 2 load or 1 load and 1 store per cycle
Bobcat: 1 load and 1 store per cycle
Bulldozer: 2 load and 1 store per cycle
Core: 2 load or 2 store per cycle
K10 can only do 64 bit stores. So for K10: 2 loads (128 bit) or 1 load and 1 store (64 bit) or 2 stores (64 bit) per cycle
Bulldozer: 2 loads and 1 store (likely 128 bit each) per cycle for each thread.
Thus BD could have about twice the L/S bandwidth per cycle compared to K10 (on average two 128 bit loads and one 128 bit store - actually 2x64 - every two cycles) when taking into account a 2R1W pattern.
One Sandy Bridge core could also do 2 loads and 1 store (128 bit each) every cycle or twice the width (256 bit) every two cycles or one 128 bit load and a 256 bit store thanks to it's 48B/cycle combined L/S bandwidth.
JF, you talked about +50% performance for opteron with 33% more core.
We all know that performance is P=IPC x Frequency.
So It's +50% performance ? or IPC ? And it's +50% for the 12 core highest frequency opteron, or 6 cores in spec int i suppose ?
If it's performance so you must already know final frequency, and ES should already running. That's a good news. I can't wait for the 24 ^^ :D
When I saw the "50% perf increase" news spread all over the world, I found a lot of people criticized AMD “Not impressive, Intel may still on top of the world!” , “WTF were AMD doing since many years ago??”
Hey man, when the product isn't ready please don't release any negative news about the product, or it will affect the stock!:D:shrug:
superrugal: right. And never seen new CPU product with 50% more performance. If we looking at the same clock on clock, not big increase (llok at one core Core architecture vs Nehalem architecture, im talking about one core, example not big diferent q9550 one core vs i7 Nehalem one core with HTT disabled)
If you think that 50% "for the whole chip" is not much than you need a reality check.Many applications don't scale perfectly to many cores so extracting the "single core" perf. increase out of this generic average figure is pretty much pointless(not knowing clock speeds makes it even more pointless).Not to mention this is a conservative estimate and the MC platform is drop in compatible with the BD CPUs.
My experience with numbers like these is that they are in theoretical benches like SPECint and SPECfp. So i guess the performance estimate from JF is pretty close to 50% more theoretical performance rather than practical.
BUT, I also believe that these benches are on a clockspeed that might be on the conservative side.
So if all goes well we see higher numbers than these.
Of course it is SPECint and SPECfp. I definitely trust these more than stuff like Cinebench and 3DMark CPU test. :shakes:
More and more I am interested with the multithreaded performance. This single core Celeron M is getting on my nerves when doing some number crunching.
Mr. Fruehe has made a slight clarification of the 50% number in a other thread.
I sincerely hopes that by major server workloads he doesn't mean syntetic
benchmarks, but real world workload. Will be interesting to see where we end up
in the end, and if AMD has been hyping to much.
theres also the perspective that its the newest 32nm chip vs the best 45nm chip amd is offering
if we look at the growth of 45nm, we had a PII-920 and a 1055T, both 2.8ghz, both 125W TDP, but one has 50% more cores and turbo. (then we can even go farther and compare it to the 95W version, in reality, its sick how much they can improve something on the same process)
it might be a little doubtful that they can pack in more BD cores on the sever side chips, so 16 might be the max. but with better experience they will be improving that process enough that higher clocks should be expected at the same TDP across the life of 32nm BD
I'm following your thoughts, not really arguing, just putting up some more
information/clarification.
Your line ...isn't comparable with desktop workloads cannot be stressed enough.
People are using info from the server side way to literally when they are talking
desktop chips in these threads about AMD's Bulldozer architecture.
So Hot Chips finally arrives ;)
Still a few days away:
Conference Day Two: August 24, 2010 (Tuesday)
Session 7: New Processor Architectures (Session Chair: Bevan Baas, UC Davis) 5:00 - 6:30 (pm)
* The Next-generation System z Micro-Processor
Authors: Brian Curran
Affiliations: IBM
* AMD "Bulldozer" Core - a new approach to multithreaded compute performance for maximum efficiency and throughput
Authors: Mike Butler
Affiliations: AMD
* AMD's "Bobcat" x86 Core - Small, Efficient and Strong
Authors: Brad Burgess
Affiliations: AMD
Is this being covered by anyone? I mean should I expect to hear anything before Wednesday morning (or very late Tuesday night)?
just one question
are the slides green or red?
black is good, but i was just curious if they stuck with their old fashioned green theme, or if they were going to surprise us with some ATI red that we hope sticks around after the drop of the name
well i also wouldnt expect a change so quickly. these slides have probably been around longer than the announcement of the ATI name dropping.
2011 should be an xtremely interesting year for PCs, so much change coming and who knows what might stick to the walls
Any infos about the time zone concerning the 24th ?
The 24th will start in Tokyo in ~16 hours, but in Los Angeles it is still 31hours away ;-)
It's still eternity away :D
It's already the 24th here in Sweden so you must have your dates mixed somehow. Can't be more than 24h tops until the 24th anywhere...
I thought there's 5 hours left until 24th on almost all places in the world?
I can't go to sleep now because it's already 24th here in U.K. :eek:
It's funny how some people are knowledge hungry :up:
If it comes right on time then Hexus will be the first once again eheh.
ok
so...hotchips. and jf amd is no where to be found. why doesn't this surprise me?
No, because he has a job to do and he does most of his forum surfing at night or really early in the morning.
I won't be at hot chips, I am not an engineer. And we don't go on until late on Tuesday anyway.
lol you know i would quote you BO....but it would be cliche at this point ;p.
maybe some crazy plan for ode about Bulldozer launch? :-) Tuning car to bulldozer with AMD logos :-D ?
were waiting........................
any live for the hotchips 22 ?
i hear we will get the slides later, and some chats late tonight or tomorrow