AMD to Disclose Details About Bulldozer Micro-Architecture in August

**Chumbucket843** · 06-27-2010, 01:19 PM

Originally Posted by madcho

I would be disapointed serously if AMD choose to take an API for use the accelerator.

The best was to "simply" extand x86-64 to a new instruction set.

API's provide a level of abtraction that is very useful for targeting different hardware. the only reason to use x86 would be for saving costs on designing the ISA. x86-64 is essentially x86 with 2x registers and 64bit support. at its core it is still plagued with the performance issues of x86. it's very wasteful and consequentially instruction decoding is a bottleneck in virtually every x86 cpu.

what you have suggested has sort of been done with SSE. i have wondered what a cpu that just did SSE would be capable of in terms of performance and compatibility, it's probably not so great. when you have that much compute density you really need to save memory bandwidth with streaming. SSE doesnt handle that as well as it could.

Yes not an easy work, but would be a lot faster to get performance improvement, update compilers and re-compile code.

And kill that f*cking x87, free some die space.

you dont update a compiler for a new architecture or in some cases even extensions, you start from scratch. that's why compilers take forever to mature.

And JF : directcompute is proprietary.

which has advantages over an open standard such as being much faster to support new features, no board of people filibustering API's = things get done.

also directcompute is arguably more vender neutral that opencl or opengl. khronos group allows for proprietary extensions in OGL&OCL where as DC doesnt. nvidia sort of abuses the extension system which is no surprise.

**Movieman** · 06-27-2010, 03:14 PM

Originally Posted by gOJDO

Yeap.

Obviously I'm just pi$$ing off some die-hard AMD fanboys.

This thread is funny. It's a 10 pages waste in the database, so I'm adding fuel for the next 10

@informal
Intel is better than AMD

No your not. This isn't the first time but it's the last.
You are app 30 seconds from being past history on this forum.
Sayonara!

**JF-AMD** · 06-27-2010, 07:05 PM

Since directcompute will run on intel or AMD and run on the ~90% of the market that runs windows, it doesn't rank as proprietary in my mind.

**madcho** · 06-27-2010, 10:20 PM

Originally Posted by JF-AMD

Since directcompute will run on intel or AMD and run on the ~90% of the market that runs windows, it doesn't rank as proprietary in my mind.

It's compatible with 95% of the market, but it's not open so it's proprietary.

Yeah it's good writing code/cpu/gpu compatible with market, but it's not a reason to say, we want do open standart, and in real it's not true.

And now steam is available on linux, with some real games. And if we want switch linux ? This is not really possible, because ATI's drivers a really bad on that os.

AMD need improve communication for sure.

**Chrysalis** · 06-27-2010, 11:20 PM

Originally Posted by hollo

that's the programmer's fault, not intel's

how can a app programmer fix that?

htt shows up as extra cpu's in the OS. Can maybe lie the blame in the OS in that it should show a difference between a virtual cpu and real cpu, but things like mysql get told its a real cpu.

**Sn0wm@n** · 06-27-2010, 11:30 PM

Originally Posted by Chrysalis

how can a app programmer fix that?

htt shows up as extra cpu's in the OS. Can maybe lie the blame in the OS in that it should show a difference between a virtual cpu and real cpu, but things like mysql get told its a real cpu.

its the programer who has to code for intel's HT implementation iirc

its the same as intel implementing AVX or amd implementing it... no program will benefit from its use until they code for it .....

**Mechanical Man** · 06-28-2010, 01:54 AM

You cant be serious. HT aint anything like AVX or other extension. For HT you should not need to program other than you program your program to use multiple cores. If HT is broken in your workload, then switch it off.

And no, its not programmers fault. It is Intels fault.

**JF-AMD** · 06-28-2010, 03:51 AM

Actually for HT, you do want to program for it. Just telling it that you have more cores is a fiasco because you'll end up scheduling too many threads to one core while other cores are sitting idle.

Remember that with HT, 2 cores do not execute at the same time, one has to wait for the other. So if you had a quad core with HT and you wanted to schedule 4 threads, you would put them all on different cores for the best speed. If you just said take the first 4 cores, you would get two threads sharing the first core, 2 sharing the second, and two idle cores.

In a server workload, HT gives you a 10-20% increase in throughput, so do you want 2 threads sharing for that 10-20% increase, or do you want all of your threads on individual cores?

**zir_blazer** · 06-28-2010, 04:07 AM

There are very good examples of that, but that is because Windows XP Task Scheduler lacks intelligence. When Intel had the Pentium D Smithfield and a Pentium Extreme Edition of it that had Hyper Threading (Core and Frequency being the same), the Pentium D was faster in all applications with 2 Threads because Windows XP in the EE usually assingned them to the first two Cores, that were the Processor first Physical Core and its own Logical Core, leaving the entire second Core on Idle.
I don't think that you should need to program with Hyper Threading in mind, but this happens because you should make sure than your application is aware of Hyper Threading to not use Logical Cores before Physical Cores are fully used, however, that should be a workaround to compensate for the Task Scheduler stupidity of not knowing what Core is Physical and what one is Logical, and assign the Threads first to the Physical and then to the Logicals. If the Task Scheduler was more intelligent, you shouldn't actually need Hyper Threading specific considerations because the OS would take care of it.

**Helmore** · 06-28-2010, 04:14 AM

That just shows you that Windows XP is an old (and obsolete

) OS. More modern OS's like Windows 7 and Linux handle that much better.

**Dresdenboy** · 06-28-2010, 04:30 AM

Originally Posted by JF-AMD

Remember that with HT, 2 cores do not execute at the same time, one has to wait for the other. So if you had a quad core with HT and you wanted to schedule 4 threads, you would put them all on different cores for the best speed. If you just said take the first 4 cores, you would get two threads sharing the first core, 2 sharing the second, and two idle cores.

Just some nitpicking here:
What you describe is "fine-grain temporal multithreading" (Wiki) with one thread per pipeline stage and clock cycle. This is true only for some parts of a SMT pipeline. Especially some of the execution units could execute thread 0 and the others could be used for thread 1 in the same clock cycle. But resource contention could happen easily.

**Hornet331** · 06-28-2010, 04:45 AM

Originally Posted by zir_blazer

There are very good examples of that, but that is because Windows XP Task Scheduler lacks intelligence. When Intel had the Pentium D Smithfield and a Pentium Extreme Edition of it that had Hyper Threading (Core and Frequency being the same), the Pentium D was faster in all applications with 2 Threads because Windows XP in the EE usually assingned them to the first two Cores, that were the Processor first Physical Core and its own Logical Core, leaving the entire second Core on Idle.
I don't think that you should need to program with Hyper Threading in mind, but this happens because you should make sure than your application is aware of Hyper Threading to not use Logical Cores before Physical Cores are fully used, however, that should be a workaround to compensate for the Task Scheduler stupidity of not knowing what Core is Physical and what one is Logical, and assign the Threads first to the Physical and then to the Logicals. If the Task Scheduler was more intelligent, you shouldn't actually need Hyper Threading specific considerations because the OS would take care of it.

I call BS on that since windows xp is aware of HT cores.
http://download.microsoft.com/downlo...ad_Windows.doc

Read this doc, it describes how the thread assigning in xp/2003 works. Since Windows XP the scheduler tries to always utilize physical idle cores, before logical cores. But if the software your using is hardcoded to use certain cores, without asking the os which cores are smt cores and which not, its not the fault of the os. Since it provides the possibility to do that.

**zir_blazer** · 06-28-2010, 05:32 AM

Originally Posted by Hornet331

I call BS on that since windows xp is aware of HT cores.
http://download.microsoft.com/downlo...ad_Windows.doc

Read this doc, it describes how the thread assigning in xp/2003 works. Since Windows XP the scheduler tries to always utilize physical idle cores, before logical cores. But if the software your using is hardcoded to use certain cores, without asking the os which cores are smt cores and which not, its not the fault of the os. Since it provides the possibility to do that.

I know that WXP is supposed to be HT aware, yet still there were quite many instances where it performed significally worse with it enabled. I recall that PEE (Pun intended) was quite bashed due that 5 years ago.
I have been looking for those Benchmarks, but I wasn't able to find them on the popular reviewers (Anandtech, X-Bit Labs, Tech Report) where it looks to perform basically the same if the application uses just two Threads. There were at least one Article that I saw that reported issues and why, but don't recall where I saw it.

**mstp2009** · 06-28-2010, 05:42 AM

Originally Posted by JF-AMD

Actually for HT, you do want to program for it. Just telling it that you have more cores is a fiasco because you'll end up scheduling too many threads to one core while other cores are sitting idle.

Remember that with HT, 2 cores do not execute at the same time, one has to wait for the other. So if you had a quad core with HT and you wanted to schedule 4 threads, you would put them all on different cores for the best speed. If you just said take the first 4 cores, you would get two threads sharing the first core, 2 sharing the second, and two idle cores.

In a server workload, HT gives you a 10-20% increase in throughput, so do you want 2 threads sharing for that 10-20% increase, or do you want all of your threads on individual cores?

What a load of CRAP. There is so much wrong with these statements I just don't know where I should begin.

First, The thread scheduler of the OS takes care of this, you NEVER have to code your application to say what "core" you want it to run on.

Good thread schedulers fill up "real" cores before they "double up" to an HT core. That's just the way of things, and has been for a very long time. Since Win Server 2003 and Linux 2.6.x at the very least.

Second, A "virtual" core never "waits" on the real core, or vice versa, to complete it's computations. TWO THREADS can be pushed down the same "real + virtual" core at the same time.

XBitLabs has the best Diagram I have seen for this:

Is the solution proposed in BD better? Likely. But does Intel's solution improve overall IPC and resource utilization? Absolutely.

I run servers both with AMD Istanbuls and Intel Nehalems in a cloud environment and push the limits of how many threads can be concurrently run on very very high-end hardware. It's how I make money - how many VM Servers can we push onto a real server without degrading performance.

I love the AMDs because they are CHEAP, low power, and do the job WELL, but I'll be frank with you right here and now: Even fully loaded on all "real" and "virtual" cores, the Nehalem cloud server runs circles around the Istanbul ones that we have deployed in terms of number of customers that can be crammed onto the system without "overloading" the CPU resources. So while the Intel system costs more, on other metrics, it costs me less: LESS power per customer consumed, MORE customers per 2U of rack space (less datacenter costs). So in the end, it is a about a wash from my perspective.

JF-AMD - Just because you have a particular agenda to push, please stop spreading FUD about the competition when it is clear you don't have the technical background to do so. Not trying to be rude, but you yourself have said you are a marketing guy, not an engineer.

**ajaidev** · 06-28-2010, 06:11 AM

Originally Posted by mstp2009

What a load of CRAP. There is so much wrong with these statements I just don't know where I should begin.

First, The thread scheduler of the OS takes care of this, you NEVER have to code your application to say what "core" you want it to run on.

Good thread schedulers fill up "real" cores before they "double up" to an HT core. That's just the way of things, and has been for a very long time. Since Win Server 2003 and Linux 2.6.x at the very least.

Second, A "virtual" core never "waits" on the real core, or vice versa, to complete it's computations. TWO THREADS can be pushed down the same "real + virtual" core at the same time.

XBitLabs has the best Diagram I have seen for this:

Is the solution proposed in BD better? Likely. But does Intel's solution improve overall IPC and resource utilization? Absolutely.
.................................................. ........

Ahhhh you should also do a little checking....

Thread scheduler issue's priority levels yes but those 32's can overflow quite easily and then what? The second core is used thats is one of the reasons HT has a negative impact on some programs.

The virtual core has no resources of its own, so it shares the real cores resources. Now when a specific amount of resources are in use the virtual thread can not be initialized until a resources are free.

**duploxxx** · 06-28-2010, 07:02 AM

Originally Posted by mstp2009

What a load of CRAP. There is so much wrong with these statements I just don't know where I should begin.

First, The thread scheduler of the OS takes care of this, you NEVER have to code your application to say what "core" you want it to run on.

Good thread schedulers fill up "real" cores before they "double up" to an HT core. That's just the way of things, and has been for a very long time. Since Win Server 2003 and Linux 2.6.x at the very least.

Second, A "virtual" core never "waits" on the real core, or vice versa, to complete it's computations. TWO THREADS can be pushed down the same "real + virtual" core at the same time.

XBitLabs has the best Diagram I have seen for this:

Is the solution proposed in BD better? Likely. But does Intel's solution improve overall IPC and resource utilization? Absolutely.

I run servers both with AMD Istanbuls and Intel Nehalems in a cloud environment and push the limits of how many threads can be concurrently run on very very high-end hardware. It's how I make money - how many VM Servers can we push onto a real server without degrading performance.

I love the AMDs because they are CHEAP, low power, and do the job WELL, but I'll be frank with you right here and now: Even fully loaded on all "real" and "virtual" cores, the Nehalem cloud server runs circles around the Istanbul ones that we have deployed in terms of number of customers that can be crammed onto the system without "overloading" the CPU resources. So while the Intel system costs more, on other metrics, it costs me less: LESS power per customer consumed, MORE customers per 2U of rack space (less datacenter costs). So in the end, it is a about a wash from my perspective.

JF-AMD - Just because you have a particular agenda to push, please stop spreading FUD about the competition when it is clear you don't have the technical background to do so. Not trying to be rude, but you yourself have said you are a marketing guy, not an engineer.

Virtualization usage with SMT or NOT depends on application type, when you have several applications with constant cpu load and are depending on each other, forget about HT, scheduling fails on those HT cores and performance decreases. When you have a bunch of small independent Vm's sure go ahead with HT it will increase the consolidation.

On the Istanbul vs Nehalem, what cat do you drag in, this is know for a year by now, Istanbul just didn't have the cpu speed to tackle the high-end Nehalem. Try again with few MC and see what virtualization beasts these are with much lower price and handle more memory.

**JF-AMD** · 06-28-2010, 07:21 AM

Originally Posted by mstp2009

Even fully loaded on all "real" and "virtual" cores, the Nehalem cloud server runs circles around the Istanbul ones that we have deployed in terms of number of customers that can be crammed onto the system without "overloading" the CPU resources.

If I had posted up that Magny Cours kicks Harpertown to the curb there would be 100 people posting up that it is not a fair comparison.

I recognize that you are making comparisons off of your own servers, but let's be clear here, you can always find some data point at some place in time to back up your assumptions. But today's discussion is not about Nehalem and Istanbul, it is about Magny Cours and Westmere. And in that arena, we are doing fine in performance, power consumption, and most importantly, price.

**mstp2009** · 06-28-2010, 07:36 AM

Originally Posted by ajaidev

Ahhhh you should also do a little checking....

Thread scheduler issue's priority levels yes but those 32's can overflow quite easily and then what? The second core is used thats is one of the reasons HT has a negative impact on some programs.

The virtual core has no resources of its own, so it shares the real cores resources. Now when a specific amount of resources are in use the virtual thread can not be initialized until a resources are free.

That is of course a given. Anyone overriding the core assignment should REALLY know what they are doing, and that is a special case. The discussion was on DEFAULT behaviour on systems where core assignment is based on thread scheduler.

Originally Posted by duploxxx

Virtualization usage with SMT or NOT depends on application type, when you have several applications with constant cpu load and are depending on each other, forget about HT, scheduling fails on those HT cores and performance decreases. When you have a bunch of small independent Vm's sure go ahead with HT it will increase the consolidation.

On the Istanbul vs Nehalem, what cat do you drag in, this is know for a year by now, Istanbul just didn't have the cpu speed to tackle the high-end Nehalem. Try again with few MC and see what virtualization beasts these are with much lower price and handle more memory.

Sorry, but MC and Westmere were too expensive as of 2 mos ago when we did our latest deployments. We will re-evaluate them at the next expansion (say 3 mo or so).

**savantu** · 06-28-2010, 08:13 AM

Originally Posted by ajaidev

Ahhhh you should also do a little checking....

Thread scheduler issue's priority levels yes but those 32's can overflow quite easily and then what? The second core is used thats is one of the reasons HT has a negative impact on some programs. The virtual core has no resources of its own, so it shares the real cores resources. Now when a specific amount of resources are in use the virtual thread can not be initialized until a resources are free.

Do you have some data to show where Nehalem HT actually has a negative impact ? And I don't mean single or pseudo-threaded game engines.

HT allows first and utmost an increase in throughput. You can have a core with HT disabled which does 100 work units per thread in a given time frame. You enable 2 thread HT and now you do 70 work units per thread in the same amount of time.

Does HT has a negative impact ? From a thread point of view yes, you're 30% slower per thread. But from a workload point of view ? No, you've done 40% more work ( 2x70=140 work units ).

Especially in Nehalem ( in P4 HT did not have that many units to start with, its main task was to hide memory latency ), HT is a definite plus.

AMD's approach in BD is totally different.A BD module is basically a souped up core with double the INT units or conversely, it's a module with 2 INT cores and a shared FP unit.
Their ideea is that it's not worth tinkering with the core itself, but simply cramming more cores ( or clusters if you want ) on the same die. Improvements in process tech allows you to put 6-10 cores in a reasonable die area, next process is 12-16, than 30 and so on. Why bother with SMT when you'll end up with dozens of "real" cores, as many as the number of threads today ? When you're resources are more limited, it's not worth doing SMT. Simply do CMT, copy and paste as many cores as possible on a die and you're done.

**Chumbucket843** · 06-28-2010, 08:41 AM

linpack is so arithmetic intensive that all HT does is cut the amount of cache in half. it's just overhead. the amount of operands accessed from main memory is usually under 1%.

http://www.anandtech.com/show/3470

on the other hand HT can increase performance by up to 70%. with the 3rd channel it's easily 2x faster than other 45nm processors.
http://blogs.sun.com/4HPCISVs/entry/..._on_nehalem_an

**JF-AMD** · 06-28-2010, 09:36 AM

Originally Posted by savantu

Do you have some data to show where Nehalem HT actually has a negative impact ? And I don't mean single or pseudo-threaded game engines.

HT allows first and utmost an increase in throughput. You can have a core with HT disabled which does 100 work units per thread in a given time frame. You enable 2 thread HT and now you do 70 work units per thread in the same amount of time.

Does HT has a negative impact ? From a thread point of view yes, you're 30% slower per thread. But from a workload point of view ? No, you've done 40% more work ( 2x70=140 work units ).

Especially in Nehalem ( in P4 HT did not have that many units to start with, its main task was to hide memory latency ), HT is a definite plus.

AMD's approach in BD is totally different.A BD module is basically a souped up core with double the INT units or conversely, it's a module with 2 INT cores and a shared FP unit.
Their ideea is that it's not worth tinkering with the core itself, but simply cramming more cores ( or clusters if you want ) on the same die. Improvements in process tech allows you to put 6-10 cores in a reasonable die area, next process is 12-16, than 30 and so on. Why bother with SMT when you'll end up with dozens of "real" cores, as many as the number of threads today ? When you're resources are more limited, it's not worth doing SMT. Simply do CMT, copy and paste as many cores as possible on a die and you're done.

Here are some examples:

http://blogs.amd.com/work/2010/01/21...out-the-cores/

Also, look at LINPACK (HPC) for another example of turning off HT and getting higher throughput.

Games are different than servers. Games generally have a lot more gaps in processing so they take better advantage of HT than servers.

**~~OhNoes!~~** · 06-28-2010, 09:53 AM

Originally Posted by JF-AMD

Here are some examples:

http://blogs.amd.com/work/2010/01/21...out-the-cores/

Also, look at LINPACK (HPC) for another example of turning off HT and getting higher throughput.

Games are different than servers. Games generally have a lot more gaps in processing so they take better advantage of HT than servers.

Frankly, this is a non-issue. All servers can/are configured/customized to maximize efficiency in whatever software environment they are deployed. If HT hurts performance in a particular software environment, it's simply turned off. This is not even a question of it does more good than harm so let's keep it on. It is configured to maximize output right off the bat.

One other thing, Intel is capable of putting more cores on a silicon too (with HT, of course) so what's AMD going to do when Intel matches your cores?

**Manicdan** · 06-28-2010, 10:08 AM

how big is a SMT core or module vs a CMT core or module on 32nm? ignoring L3

**~~OhNoes!~~** · 06-28-2010, 10:13 AM

Originally Posted by Manicdan

how big is a SMT core or module vs a CMT core or module on 32nm? ignoring L3

With regard to what uarchs?

Edit: Bulldozer/Westmere of course. I don't have the numbers, but it all has do with design complexity and tdp. Intel already has 48 cores on a die with a 125w tdp. Also, larabbee design should give us a big clue what Intel is capable of.

**Manicdan** · 06-28-2010, 10:15 AM

the ones that make the most sense to compare

Thread: AMD to Disclose Details About Bulldozer Micro-Architecture in August

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions