AMD to Disclose Details About Bulldozer Micro-Architecture in August

Printable View

Show 100 post(s) from this thread on one page

06-29-2010, 01:49 AM
Sn0wm@n

Quote:

Originally Posted by Chrysalis

my point been you cannot program for it.

HT shows up as extra real cores to an app.

On a intel i7 how many cpu graphs is there in task manager 4 or 8?

havent we adressed that same subject last page???

Quote:

Originally Posted by Chrysalis

how can a app programmer fix that?

htt shows up as extra cpu's in the OS. Can maybe lie the blame in the OS in that it should show a difference between a virtual cpu and real cpu, but things like mysql get told its a real cpu.

Quote:

Originally Posted by Sn0wm@n

its the programer who has to code for intel's HT implementation iirc

its the same as intel implementing AVX or amd implementing it... no program will benefit from its use until they code for it .....
06-29-2010, 03:43 AM
Hornet331

Quote:

Originally Posted by Chrysalis

my point been you cannot program for it.

HT shows up as extra real cores to an app.

On a intel i7 how many cpu graphs is there in task manager 4 or 8?

Read the the last 2 or 3 pages, the os is aware what cores are logical and what core are physical, so you can program for it.
06-29-2010, 04:12 AM
savantu

Quote:

Originally Posted by Hornet331

Read the the last 2 or 3 pages, the os is aware what cores are logical and what core are physical, so you can program for it.

Actually you don't need to program for HT, you need to program with parallelism in mind. In the multicore era, as long as your program is multithreaded it will take advantage of HT. A multi processor aware program from 1995 designed to run on a Pentium Pro will take advantage of HT ( or current multicore CPUs ) without changes.

HT in the P4, caused desktop apps to turn multithreaded and preempt the multicore era.
06-29-2010, 04:24 AM
Sn0wm@n

Quote:

Originally Posted by savantu

Actually you don't need to program for HT, you need to program with parallelism in mind. In the multicore era, as long as your program is multithreaded it will take advantage of HT. A multi processor aware program from 1995 designed to run on a Pentium Pro will take advantage of HT ( or current multicore CPUs ) without changes.

HT in the P4, caused desktop apps to turn multithreaded and preempt the multicore era.

huh.... proof about such claims....
06-29-2010, 04:30 AM
savantu

Quote:

Originally Posted by Sn0wm@n

huh.... proof about such claims....

?
This means you do not understand the concept of multithreading and what HT actually is.
06-29-2010, 04:32 AM
Sn0wm@n

then help me figure it out with proof of how it actually works...
06-29-2010, 05:00 AM
savantu

Quote:

Originally Posted by Sn0wm@n

then help me figure it out with proof of how it actually works...

Well, let's start with multiprocessing and multithreading then.

Multiprocessing - using 2 or more central processing units ; originally this meant servers which had more CPUs, now you have multiprocessing at socket level due to multicore CPUs. For this to be used efficiently you need programs which use multiple threads or to run multiple single threaded programs in parallel. SW that can take advantage of multiprocesseing configs has been around since the '50s.

Multithreading ( as in SMT or HT ) refers to a CPU's ability to process simultaneously instructions from multiple threads. In both single and multiprocessing configurations it was soon discovered that a single thread isn't able to fully utilize the resources of the HW. There is a limited amount of instruction level parallelism which can be extracted in any given thread. A multithreaded core thus executes instructions from multiple threads at the same time. Those threads can be either from the same program or from different single threaded programs (I'm talking here about SMT or pure multi-threading and not the derivates like SoeMT ). A multithreaded core appears to the SW as a multiprocessing system ( that's why you have a single real core and multiple logical cores like in Task Manager ).
In fact, a Nehalem CPU is basically a multiprocessing multithreaded system at socket level.

If you connect the bolded part in the 2 concepts you'll understand why HyperThreading helps any software designed for multiprocessor use and even desktop sessions where you are multitasking ( running programs in parallel ). That's why many Pentium 4 users felt their systems snappier or more responsive when multitasking. You didn't have to flush the pipeline and switch to the new thread, they were going in parallel.

*Obviously I'm greatly simplying everything here, but I hope you get the point.
06-29-2010, 05:07 AM
Sn0wm@n

i was expecting some sort of papers from intel clearly saying that you dont need to code for HT in order for it to work ....
06-29-2010, 05:09 AM
Chumbucket843

Quote:

Originally Posted by OneEng

Chumbucket843,

First, surely you have something better to do than complain about AMD's use of the word "module"?

nope.

Quote:

Second,
A module is not "a core". I can understand your objection, and have even asked a few questions myself about this architectural terminology. For instance, why doesn't Intel simply say that they have a 12 "core" product today?

The answer is in the performance scaling and amount of shared resources. According to AMD, each "core" will scale 80% as good as 2 complete cores scale today. The only real difference between "real" cores and BD "cores" is that BD cores share more resources than we have typically seen in existing architectures. Intel's SMT only gains 20-30% in the best situations.

CMT is nothing like SMT other than it attempts to share resources within a processor to execute more efficiently. AMD isn't calling their approach CMT (yet). I don't think they intend to at this point in time, but IMHO, it is pretty close to the definition.

This is a pretty good article: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

I don't think that CMT is appreciably bigger than SMT considering the rather small portion of the die that a core currently takes up compared to cache. Additionally, it can afford to be somewhat larger since it scales the performance by a much bigger margin than does SMT.

Finally, scaling CMT to 3 or 4 cores should be of little difficulty for AMD in future iterations of BD. Not only would Intel gain little performance from moving to 4 way SMT from 2 way, but it would be very costly in development time and would still have the limitations and issues discussed in the article I linked to above.

that paper simulated an unrealistic processor. 24 instruction issue width? that (and some other things) might skew results. i would take computer simulations of pretty much anything with a grain of salt.

i do agree and i think CMT is a good idea but it's definitely not going to be as simple as adding more execution units. control logic and communications dominate die area and they are more important to get right than execution. you can have 10,000 64bit fpu's on a 32nm chip but they are pretty much useless and nothing will take advantage of them.

Quote:

Really? What about cache? What about cache coherency? Processors haven't been "independent" since the advent of SMP.

cache coherency is a race condition caused by memory latency and i am only taking into account the behavioral domain. a core has an architecture which is an algorithm. the algorithm must be able to execute instructions.

Quote:

Not according to the research I have read. Perhaps you have other information?

you're taking what i said out of context... i think. SMT only replicates the architectural state for each thread. CMT adds more alu's running their own threads.

Quote:

Faster? Only if the ASM is poorly written. I agree with your other points though. Most applications have little need for the speed and efficiency of ASM (with the exception of embedded applications, and even most of these can be done in C with enough efficiency to get by competitively).

it's really easy to write bad code in ASM. it's for experts.
06-29-2010, 05:32 AM
savantu

Quote:

Originally Posted by Chumbucket843

...

you're taking what i said out of context... i think. SMT only replicates the architectural state for each thread. CMT adds more alu's running their own threads.

? What multiple threads? Maybe you meant "multiple instructions from the same thread".
06-29-2010, 06:20 AM
Particle

Quote:

Originally Posted by Sn0wm@n

i was expecting some sort of papers from intel clearly saying that you dont need to code for HT in order for it to work ....

Would it help if a programmer who writes multithreaded code said he was right? I am one. He is right. You're doubting a very basic concept akin to debating if trees are primarily made out of metal instead of wood.

A thread is a thread. If your program creates four of them, the OS schedules those threads to run on any of the available pipelines. I'll use that term to refer to execution slots since what HT provides are used like logical cores but are not actually cores. That program's threads might be scheduled on a real core or not, but either way the program doesn't have to be aware. It created different threads and those threads can take advantage of anywhere the OS's thread scheduler wants to send them. It might be a real core, an HT pipe, or any future construct that allows for a logical appearance of being a "core".
06-29-2010, 09:00 AM
ajaidev

Quote:

Originally Posted by savantu

Well, let's start with multiprocessing and multithreading then.

Multiprocessing - using 2 or more central processing units ; originally this meant servers which had more CPUs, now you have multiprocessing at socket level due to multicore CPUs. For this to be used efficiently you need programs which use multiple threads or to run multiple single threaded programs in parallel. SW that can take advantage of multiprocesseing configs has been around since the '50s.

Multithreading ( as in SMT or HT ) refers to a CPU's ability to process simultaneously instructions from multiple threads. In both single and multiprocessing configurations it was soon discovered that a single thread isn't able to fully utilize the resources of the HW. There is a limited amount of instruction level parallelism which can be extracted in any given thread. A multithreaded core thus executes instructions from multiple threads at the same time. Those threads can be either from the same program or from different single threaded programs (I'm talking here about SMT or pure multi-threading and not the derivates like SoeMT ). A multithreaded core appears to the SW as a multiprocessing system ( that's why you have a single real core and multiple logical cores like in Task Manager ).
In fact, a Nehalem CPU is basically a multiprocessing multithreaded system at socket level.

If you connect the bolded part in the 2 concepts you'll understand why HyperThreading helps any software designed for multiprocessor use and even desktop sessions where you are multitasking ( running programs in parallel ). That's why many Pentium 4 users felt their systems snappier or more responsive when multitasking. You didn't have to flush the pipeline and switch to the new thread, they were going in parallel.

*Obviously I'm greatly simplying everything here, but I hope you get the point.

Quote:

Originally Posted by Particle

Would it help if a programmer who writes multithreaded code said he was right? I am one. He is right. You're doubting a very basic concept akin to debating if trees are primarily made out of metal instead of wood.

A thread is a thread. If your program creates four of them, the OS schedules those threads to run on any of the available pipelines. I'll use that term to refer to execution slots since what HT provides are used like logical cores but are not actually cores. That program's threads might be scheduled on a real core or not, but either way the program doesn't have to be aware. It created different threads and those threads can take advantage of anywhere the OS's thread scheduler wants to send them. It might be a real core, an HT pipe, or any future construct that allows for a logical appearance of being a "core".

There is a problem with the logic here, a multiprocesseing aware program will not automatically run great on multicore cpu's and multicore systems are generally better than multiprocesseing ones because of bandwidth and overall design.

A cpu with 1 core and HT is not in any way equal to one with 2 cores. There is no way SMT will have its independent resources because that would increase silicon size a lot. The only way they are even close is that if a program is so badly written that the degraded performance of a real core seems similar to what SMT can achieve.

Yes there is a impact of HT but is it worth while for Intel to put HT in their latest cpus i dont think so. In the ends its all about how effective the code is and how well it runs and uses the cpu resources.

I also wanted to point out you all are giving super responsibilities to thread scheduler well thats a bit much, yes it does assign priorities and well it does not do its job 100% all the time. There are several places it can miss and mess up. Imagine putting a hungry runner on the HT pipes and say a similar runner is on the real core. The one on the HT will have to wait till enough resources are allocated and then start.

You all recall the first K10's those had a dynamic clock speed thingy the thread scheduler actually messed up big on that and assigned work loads to cores that were clocked low resulting in performance loss. The point is that thread scheduler is not god it makes mistakes and a lot at that. They have tries to make it better in Vista.

The program doesn't have to be aware of logical or real cores but its better if it is that will result in better program execution and less reliance on the thread scheduler.
06-29-2010, 10:50 AM
Particle

I have two primary complaints regarding your reply.

The first is that you completely misunderstood what I was saying, assuming I was talking about performance instead of utilization. I was trying to show the person how a program that creates two threads doesn't have to know anything about the underlying hardware when it comes to cores. Those two threads can easily take advantage of either cores or psuedocores just based on what the thread scheduler does with them. As such, you could have written an MT program in 1995 that would end up utilizing HT and later on CMT. Would it benefit? That's a different story entirely, but it would certainly execute using both the core and the HT pseudocore if the thread scheduler decided it wants it to. THAT is what I was demonstrating and is certainly correct.

The second complaint is that NO, I am not overstating what the scheduler does. If I create an app that spawns two threads and provide no further direction with regard to affinity or whatnot, it is 100% up to the scheduler to decide where and when to execute each thread. That's what I said in my post and that is absolutely correct.
06-29-2010, 11:44 AM
w0mbat

just relax and dont get ur expectations high for august.
06-29-2010, 03:55 PM
Mad Pistol

Just pray that AMD pulls a magic rabbit out of their hat and gets intel back on their toes and into competitive pricing again. I'm tired of this market being incredibly 1 sided (Intel: Performance. AMD: Budget)
06-29-2010, 04:23 PM
Sn0wm@n

so you dont need to code to get HT to work properly... but if HT gives great gains means that the program wasnt properly written??? .. i think im getting a bit less lost LOL

but wasnt the point of ht to use the same hardware ressource as the real core just to try and remove lag in the processing pipeline????
06-29-2010, 04:51 PM
haylui

Quote:

Originally Posted by w0mbat

just relax and dont get ur expectations high for august.

hm...you mean the performance figure or depth of technical details?
06-30-2010, 01:56 AM
Dresdenboy

Quote:

Originally Posted by savantu

I don't think BD has any relation whatsoever with Trips. The people who designed BD did so in 2005-2007 and most have left AMD by now.

I did't talk about TRIPS. This is just another architecture. I pointed to the paradigms shown before discussing the TRIPS architecture as a solution. You can find many other talks (related to future CPUs from AMD), where Chuck Moore repeated most of the points.

Quote:

Originally Posted by savantu

You're basically talking Netburst. Those were the premises on which Netburst was based. It is interesting to see that they are more valid today than in the past. It very well could be that we will see a Netburst revisited uarch for the upcoming process nodes. There's a limit to the Pentium Pro off springs anyway.
The lead uarch for Nehalem said that if they would have used Netburst as a starting point for Nehalem and not Core, the performance would have been higher, but with worse energy efficiency. Interesting remark anyway ( from a Stanford lecture on Nehalem ).

Many things changed after Netburst. So did the FO4 inversion delay or the proven methods to save energy in a high performance microarchitecture. Remember Power 6 or Cell, which were low FO4 (short cycletime) designs.

One thing I can imagine to be in Bulldozer are slow and fast clock domains (as in Cell), so that a fast clock only has to be provided to a part of the logic.

Quote:

Originally Posted by savantu

True. However, the word no to forget is latency, whenever you need to cross clock domains or arbitrate streams you have latency.

That's correct. But several BD related patents indicate, that there could be many buffers and queues to help loosening the connections of different units. There are also patents talking about data crossing clock domains. And a simple case would be to have units with twice the clock frequency. You could even interleave the accesses of slow (half) clock frequency units on a half cycle basis.

Quote:

Originally Posted by savantu

When you have so many cores, it becomes a bit useless to think as small as powering different units down, its way better to power cores down as done in Nehalem. I mean, from 8 modules, what's the benefit of shutting down 3 integer clusters ? Why not turn off 2 modules ?
Regarding caches, you always need to maintain coherency ( so L3 isn't a candidate ), so powering down the L2 , you can as well power down the cores attached to it.

No, shutting cores down just saves more power. But the required conditions are more seldomly met. And running everything powered on is the inverse of that measure. You can use clock gating. But relatively increasing static leakage will begin to hurt. Standard apps don't show the behaviour of a thermal virus. So imagine this: the power management powers off an ALU and an AGU if int throughput indicates less utilization. It also powers off 1/2 of the cache if not needed. And it powers down half of the decoders because a µOp cache helps keeping the remaining units busy. Finally it could reduce the clock of the retirement unit. Result: 96% of app performance at 60% the power. This is just an example of what could be done.
06-30-2010, 01:57 AM
Dresdenboy

Quote:

Originally Posted by w0mbat

just relax and dont get ur expectations high for august.

You mean, there will be nothing much surprising then? ;)
06-30-2010, 02:41 AM
w0mbat

Quote:

Originally Posted by haylui

hm...you mean the performance figure or depth of technical details?

i mean the latter.

Quote:

Originally Posted by Dresdenboy

You mean, there will be nothing much surprising then? ;)

surprising for whom? i dont think for u :p:
06-30-2010, 06:42 AM
ajaidev

Quote:

Originally Posted by Particle

I have two primary complaints regarding your reply.

The first is that you completely misunderstood what I was saying, assuming I was talking about performance instead of utilization. I was trying to show the person how a program that creates two threads doesn't have to know anything about the underlying hardware when it comes to cores. Those two threads can easily take advantage of either cores or psuedocores just based on what the thread scheduler does with them. As such, you could have written an MT program in 1995 that would end up utilizing HT and later on CMT. Would it benefit? That's a different story entirely, but it would certainly execute using both the core and the HT pseudocore if the thread scheduler decided it wants it to. THAT is what I was demonstrating and is certainly correct.

The second complaint is that NO, I am not overstating what the scheduler does. If I create an app that spawns two threads and provide no further direction with regard to affinity or whatnot, it is 100% up to the scheduler to decide where and when to execute each thread. That's what I said in my post and that is absolutely correct.

Ya your points are valid and i knew it when i wrote what i did, my point was how low the benefit would be in HT's case.

Secondly yes exactly it is 100% up to the scheduler to decide where and when to execute each thread, even if it makes a dodo and assigns a heavy job to the HT pipe.

The fact remains that SMT does help but with ineffective code (or small code) and once proper utilization of resources is achieved "Better OS allocator" HT's effectiveness will decrease in most cases.

Quote:

Originally Posted by Dresdenboy

You mean, there will be nothing much surprising then? ;)

The surprise is that every1 will get a pair of CAT shoes with each processor. yay
06-30-2010, 07:13 AM
saaya

Quote:

Originally Posted by Sn0wm@n

i was expecting some sort of papers from intel clearly saying that you dont need to code for HT in order for it to work ....

you definately DONT HAVE to code for HT to have it "work", but if you really want to use a HT cpu to the fullest and get the most performance out of it, then im pretty sure that you HAVE to take ht into account to prevent stalls in the pipeline or cache trashing etc...

but im not sure how big of an effort this is compared to properly coding for multiple cores... cause you have to take most of the limitations that ht has into account as well on a multicore platform... so if you code with HT in mind then that code will probably run close to perfect on a multicore cpu without ht as well...

which is why its a shame that most programmers ignored HT when intel introduced it... it would have been the perfect preparation for multi core and wouldnt have resulted in the software stall we all saw a few years ago and still see now... the transistion from single to multi core would have been much smoother if everybody had picked up ht i think...

im not a programmer, so please take everything i wrote here with a big grain of salt ^^
06-30-2010, 07:25 AM
Sn0wm@n

i got the idea that it needed proper coding because HT didnt have dedicated math hardware fp/alu etc.. stuff ... so it shared its ressource with the real core .. so thats why i got it all wrong in the begining.... or maybe there is still some part about it that im missing ....
06-30-2010, 07:47 AM
Mechanical Man

HT is in all respects real core for software layer, software does not know that it is just some doubled parts in some part of cpu to look like two cores when in reality it is just single execution unit.

So, if you code for multicore, you "code for ht". Basically, everybody should code for multicore if doing heavy programs when doing so, in most cases ht can improve core performance by utilising its execution units more efficiently.
06-30-2010, 07:57 AM
informal

Quote:

Originally Posted by Mechanical Man

HT is in all respects real core for software layer, software does not know that it is just some doubled parts in some part of cpu to look like two cores when in reality it is just single execution unit.

So, if you code for multicore, you "code for ht". Basically, everybody should code for multicore if doing heavy programs when doing so, in most cases ht can improve core performance by utilising its execution units more efficiently.

This is basically the essence of it.Good post :)
06-30-2010, 08:14 AM
haylui

Quote:

Originally Posted by w0mbat

i mean the latter.

surprising for whom? i dont think for u :p:

THANKS.
i thought you would say former
scare the **** out of me
06-30-2010, 08:25 AM
Chumbucket843

Quote:

Originally Posted by Dresdenboy

That's correct. But several BD related patents indicate, that there could be many buffers and queues to help loosening the connections of different units. There are also patents talking about data crossing clock domains. And a simple case would be to have units with twice the clock frequency. You could even interleave the accesses of slow (half) clock frequency units on a half cycle basis.

the buffers and queues are there to keep communication overhead of a clustered uarch down.

could you give a link to that patent? i really would like to know how they are going to handle sequencing with high clockspeeds. even if you double the clockspeed for a pipeline stage there will still be a lot of complex issues like clock skew, power, and area. in the past AMD has made a few borked synchronizers. idk if they want to go that direction with BD but that was 30 years ago.:)
06-30-2010, 11:57 AM
Dresdenboy

Quote:

Originally Posted by Chumbucket843

the buffers and queues are there to keep communication overhead of a clustered uarch down.

could you give a link to that patent? i really would like to know how they are going to handle sequencing with high clockspeeds. even if you double the clockspeed for a pipeline stage there will still be a lot of complex issues like clock skew, power, and area. in the past AMD has made a few borked synchronizers. idk if they want to go that direction with BD but that was 30 years ago.:)

They should have some experience now with a differently clocked NB or HT PHY. However using a specific clock and 2x or 0.5x that clock as a second clock, should be ok to handle. Cell did it this way.

And some more recent patents:
http://www.freepatentsonline.com/y2008/0288805.html
http://www.freepatentsonline.com/y2009/0261869.html
http://www.freepatentsonline.com/y2010/0049887.html
http://www.freepatentsonline.com/7636803.html
and a paper (NB related, by some of the inventors):
http://www.computer.org/portal/web/c.../ASYNC.2007.21
06-30-2010, 04:41 PM
Chrysalis

Quote:

Originally Posted by Sn0wm@n

havent we adressed that same subject last page???

we addressed nothing. I said it cant be coded for. your response is they should code for it.

I am thinking what the ???? to your response.

I know in linux and freebsd htt show up as real cpu's what I dont know if its the same case in windows or not.

eg. on a freebsd server I have access to right now it is reporting 8 processors on a quad core htt cpu.
06-30-2010, 04:55 PM
wuttz

doesn't the windows scheduler have some mechanism that assigns threads onto the first 2/4/6 physical/real cores first-
and only when all those real cores have been used then it starts to allocate more threads to the htt-pipeline?

and that coders have access to this mechanism if their threads initiate this resource call?

Quote:

6.2 Improving Application Performance on Hyper-Threading-Enabled Systems
In general, multithreaded Windows applications perform better when running unmodified on an HT processor than they do on a similarly equipped single-threaded processor. To optimize the application performance benefit on HT-enabled systems, the application should ensure that the threads executing on the two logical processors have minimal dependencies on the same shared resources on the physical processor. With an understanding of how the application threads and processes utilize the shared resources on an HT processor, setting processor affinity to minimize competition for these system resources can help application performance.

The following example scenarios describe good and bad ways to set thread affinities:
Good HT thread affinity example. Where an application has threads that produce data and threads that consume data, setting affinities so that consumer/producer thread pairs run on the logical processors of the same physical processor should improve performance. This configuration allows the threads to share cached data and to overlap operation. That is, the producer thread can produce future items while the consumer thread is consuming older items.

ist, the more stalls/bubbles an app has, the more htt will be beneficial. otherwise, if an app is optimized as indicated above, htt will also increase performance.

Quote:

On HT-enabled systems, each logical processor is treated as an individual processor by the operating system and is represented by a bit in the system affinity mask. This is true for both HT-aware and non-HT-aware releases of the Windows operating system.
The system processor affinity mask can be read using the GetProcessAffinityMask function. The mask has a bit set for each processor in the system. The mask can be used by applications to set processor affinity for its threads and processes using the SetThreadAffinityMask or SetThreadIdealProcessor functions.

so, theres the call parameters.

Quote:

5.3 Using the YIELD (PAUSE) Instruction to Avoid Spinlock Contention
Where two logical processors on the same physical HT processor are competing for access to the same piece of data, the shared resources on the device can have the effect of "starving" one of the logical processors by, in effect, denying it access to the data. This is particularly significant when the piece of data is a spinlock, because the logical processor that is starved of access might own the spinlock. Intel recommends that logical processors be paused while executing spinlocks to alleviate this problem.

Quote:

5.2 Aggressive HALT of Processors in the Idle Loop
When a processor in a system running the Windows operating system has no work to do, it enters the idle loop. If the first logical processor on an HT processor is executing instructions in the idle loop, that is, if it is not doing any real work, it is competing for shared resources, which degrades the performance capability of the second logical processor on the same physical processor. The result of this is to degrade the rate at which the second logical processor could do real work.
To minimize the impact of this, the idle loop in Windows:banana:XP and the Windows Server:banana:2003 family has been modified to more aggressively HALT processors that are executing in the idle loop. After a logical processor has been halted, it no longer executes instructions and no longer competes for shared resources.

Quote:

The performance increase that is delivered when transitioning from one active logical processor to two active logical processors, on the same physical processor, is typically in the range of 10% (10) to 30% (30). So on average the total system performance would be likely to increase from 200 to 220 (that is, it goes up by 10%).
This lower performance increase is due to the fact that two threads are competing for the use of the shared resources on one of the physical HT processors. So scheduling a thread onto an HT processor that already has an active logical processor has the following effects:
o Slowing down the performance of that active logical processor
o Limiting the performance of the new scheduled thread on the second logical processor

[ m$ ]
07-01-2010, 02:53 AM
Hornet331

Quote:

Originally Posted by Chrysalis

we addressed nothing. I said it cant be coded for. your response is they should code for it.

I am thinking what the ???? to your response.

I know in linux and freebsd htt show up as real cpu's what I dont know if its the same case in windows or not.

eg. on a freebsd server I have access to right now it is reporting 8 processors on a quad core htt cpu.

Obviously it was adressed, you just failed to read/understand it.

Linux is SMT(HT) aware since kenrnel 2.4.18 and windows since XP. The os knows which cores are real cores and what cores are logical cores.

And you can code for it, at least in windows there are certain commands available to the programmer to retrive the mapping of the cores if you want to assign thread affinity manually.

http://www.xtremesystems.org/forums/...&postcount=262

Read the doc that is linked in that post. I am als fairly certain, that there is a similar option for linux.
07-01-2010, 01:24 PM
Chrysalis

yep know about affinity so yeah at the very least thats available, will read up on the doc and say my thoughts after.
07-01-2010, 10:03 PM
Dimitriman

I am officially on board the Bulldozer Bandwagon...:YIPPIE:

'Give me 16 cores or give me death!':cheer2:
07-04-2010, 03:20 PM
Jowy Atreides

I hope the socket platform is disclosed.

I just bought AM3 with the intention of upgrading to BD and now there's net rumours about an AM3 rev2
07-04-2010, 03:42 PM
Kej

Quote:

Originally Posted by Jowy Atreides

I hope the socket platform is disclosed.

I just bought AM3 with the intention of upgrading to BD and now there's net rumours about an AM3 rev2

Socket AM3r2 has been known for some time.

Sampsa put up a slide here on the forum last year.

IIRC socket AM2+ was called AM2r2 before it was released, so AM3r2 could be
called AM3+ at launch perhaps.
Hopefully a BIOS upgrade is the only thing that is needed, and that the manufacturers
isn't so lazy that they try to avoid releasing them.
07-04-2010, 08:07 PM
freeloader

I'm pretty sure I read somewhere that AM3 will be compatible with Bulldozer. Unless something has changed since this slide was made.

http://img72.imageshack.us/img72/410...toproadmap.jpg
07-04-2010, 08:26 PM
Jowy Atreides

Quote:

Originally Posted by freeloader

I'm pretty sure I read somewhere that AM3 will be compatible with Bulldozer. Unless something has changed since this slide was made.

http://img72.imageshack.us/img72/410...toproadmap.jpg

http://plaza.fi/s/f/editor/images/20...md_roadmap.jpg
07-04-2010, 09:12 PM
pokipoki

It's a real pity that Opterons can't be overclocked due to lack of motherboards. The cost-effectiveness of the platform is of great value, especially to advanced home users. Socket longevity, more cores for lower prices etc. I'd be willing to pay more for a multi-processor Opteron board with overclocking abilities. I'm sure most duallie fans will do the same. Maybe we should setup a poll to measure the consensus?
07-05-2010, 03:46 AM
Wishmaker

Can anyone confirm the 9 core rumour? :)
07-05-2010, 04:53 AM
informal

Rumor about 9 cores is not true.
07-05-2010, 05:09 AM
JF-AMD

Quote:

Originally Posted by pokipoki

It's a real pity that Opterons can't be overclocked due to lack of motherboards. The cost-effectiveness of the platform is of great value, especially to advanced home users. Socket longevity, more cores for lower prices etc. I'd be willing to pay more for a multi-processor Opteron board with overclocking abilities. I'm sure most duallie fans will do the same. Maybe we should setup a poll to measure the consensus?

Opteron is targeted at commercial server applications. It would be extremely expensive to support the consumer market so that will not happen. I have gone through the economics on several forums, but let me dispel the two biggest myths quickly:

1. There is not a "huge market" for server parts in consumer environments. There are definitely people that will want to do this, but it is a very small part of the market.

2. It is not inexpensive to "just add support". Essentially you are doubling a lot of the back end costs.

I would never stop a consumer from doing this because it is my job to sell more processors. But I would warn that if you go down that path, you won't see the level of support that you will see on Phenom and other consumer brands.
07-05-2010, 05:09 AM
JF-AMD

Quote:

Originally Posted by Wishmaker

Can anyone confirm the 9 core rumour? :)

Not true.
07-05-2010, 07:23 AM
zir_blazer

Quote:

Originally Posted by JF-AMD

1. There is not a "huge market" for server parts in consumer environments. There are definitely people that will want to do this, but it is a very small part of the market.

Who was the guy responsible for Socket 939 Opterons 1xx? They were extremely famous and popular here around 2006 because you could get Athlon 64 FX worth bins in a 250 U$D or so Opteron 165, besides that they use Toledo JH-E6 parts with 1 MB Cache L2 per Core while comparable A64X2 were Manchesters BH-E4 with 512 KB Cache L2. The enthusiast market eated those "Server parts".

About the 9 Core issue, the "128 Bits Core that interconnects the 8 64 Bits ones" sounds the 128 Bits IMC and a Crossbar or something.
07-05-2010, 07:29 AM
freeloader

Quote:

Originally Posted by pokipoki

It's a real pity that Opterons can't be overclocked due to lack of motherboards. The cost-effectiveness of the platform is of great value, especially to advanced home users. Socket longevity, more cores for lower prices etc. I'd be willing to pay more for a multi-processor Opteron board with overclocking abilities. I'm sure most duallie fans will do the same. Maybe we should setup a poll to measure the consensus?

I'd love an overclocking board based on two 4 or 6 core Opteron 4000 series processors.
07-05-2010, 07:44 AM
JF-AMD

Quote:

Originally Posted by zir_blazer

Who was the guy responsible for Socket 939 Opterons 1xx? They were extremely famous and popular here around 2006 because you could get Athlon 64 FX worth bins in a 250 U$D or so Opteron 165, besides that they use Toledo JH-E6 parts with 1 MB Cache L2 per Core while comparable A64X2 were Manchesters BH-E4 with 512 KB Cache L2. The enthusiast market eated those "Server parts".

About the 9 Core issue, the "128 Bits Core that interconnects the 8 64 Bits ones" sounds the 128 Bits IMC and a Crossbar or something.

I can guarantee you that while those were popular with enthusiasts, that was not a net benefit for AMD. And we did not sell nearly as many as you probably think. I have a Fox Vanilla 140 on my bike and so do a lot of the people that I ride with, but that does not mean it is the most popular fork out there. Just real popular with my friends.
07-05-2010, 07:56 AM
zir_blazer

Quote:

Originally Posted by JF-AMD

I can guarantee you that while those were popular with enthusiasts, that was not a net benefit for AMD. And we did not sell nearly as many as you probably think. I have a Fox Vanilla 140 on my bike and so do a lot of the people that I ride with, but that does not mean it is the most popular fork out there. Just real popular with my friends.

Opterons 1xx were on a shortage back on its time, so I suppose that the enthusiast market did have an impact on them. Obviously it wasn't a net benefit because you were selling the highest quality bin at very cheap prices that cannibalized A64 counterparts sales. But for those that put their hands on them, it was wonderful.
07-06-2010, 03:04 PM
Dresdenboy

Each BD core has 4 ALUs and at least 3 AGUs. I found some nice numbers in Open64 sources again. More (plus SB BOINC stats) here http://citavia.blog.de/2010/07/06/bu...-core-8927293/

It seems, Hiroshige Goto needs to redraw his diagrams (and me too).
07-06-2010, 04:25 PM
informal

Thanks for the new update,looks very interesting :) . Those integer cores should be quite powerful now that we know that Bobcat is pretty speedy too.
07-06-2010, 04:29 PM
slaveondope

Quote:

Originally Posted by freeloader

I'd love an overclocking board based on two 4 or 6 core Opteron 4000 series processors.

They makem with nvidia chipsets;)
07-06-2010, 05:55 PM
vietthanhpro

K8->K8L: 2 load or 1 load and 1 store per cycle
Bobcat: 1 load and 1 store per cycle
Bulldozer: 2 load and 1 store per cycle
Core: 2 load or 2 store per cycle
07-06-2010, 06:48 PM
JF-AMD

Quote:

Originally Posted by slaveondope

They makem with nvidia chipsets;)

Nope, AMD chipsets.
07-06-2010, 11:27 PM
-Boris-

Quote:

Originally Posted by vietthanhpro

K8->K8L: 2 load or 1 load and 1 store per cycle
Bobcat: 1 load and 1 store per cycle
Bulldozer: 2 load and 1 store per cycle
Core: 2 load or 2 store per cycle

K8L=Turion ;)
07-13-2010, 10:37 PM
Dresdenboy

Quote:

Originally Posted by vietthanhpro

K8->K8L: 2 load or 1 load and 1 store per cycle
Bobcat: 1 load and 1 store per cycle
Bulldozer: 2 load and 1 store per cycle
Core: 2 load or 2 store per cycle

Quote:

Originally Posted by vietthanhpro

K8->K8L: 2 load or 1 load and 1 store per cycle
Bobcat: 1 load and 1 store per cycle
Bulldozer: 2 load and 1 store per cycle
Core: 2 load or 2 store per cycle

K10 can only do 64 bit stores. So for K10: 2 loads (128 bit) or 1 load and 1 store (64 bit) or 2 stores (64 bit) per cycle

Bulldozer: 2 loads and 1 store (likely 128 bit each) per cycle for each thread.
Thus BD could have about twice the L/S bandwidth per cycle compared to K10 (on average two 128 bit loads and one 128 bit store - actually 2x64 - every two cycles) when taking into account a 2R1W pattern.

One Sandy Bridge core could also do 2 loads and 1 store (128 bit each) every cycle or twice the width (256 bit) every two cycles or one 128 bit load and a 256 bit store thanks to it's 48B/cycle combined L/S bandwidth.
08-03-2010, 11:33 PM
generics_user

Quote:

Originally Posted by JF-AMD

Nope, AMD chipsets.

he meant the asus L1N64WS nforce 680a board which was an overcloking board and supports Shanghai with a modded bios ;)
08-04-2010, 12:35 AM
madcho

JF, you talked about +50% performance for opteron with 33% more core.

We all know that performance is P=IPC x Frequency.

So It's +50% performance ? or IPC ? And it's +50% for the 12 core highest frequency opteron, or 6 cores in spec int i suppose ?

If it's performance so you must already know final frequency, and ES should already running. That's a good news. I can't wait for the 24 ^^ :D
08-04-2010, 03:02 AM
superrugal

When I saw the "50% perf increase" news spread all over the world, I found a lot of people criticized AMD “Not impressive, Intel may still on top of the world!” , “WTF were AMD doing since many years ago??”
Hey man, when the product isn't ready please don't release any negative news about the product, or it will affect the stock!:D:shrug:
08-04-2010, 03:22 AM
FlanK3r

superrugal: right. And never seen new CPU product with 50% more performance. If we looking at the same clock on clock, not big increase (llok at one core Core architecture vs Nehalem architecture, im talking about one core, example not big diferent q9550 one core vs i7 Nehalem one core with HTT disabled)
08-04-2010, 03:24 AM
Sn0wm@n

Quote:

Originally Posted by superrugal

When I saw the "50% perf increase" news spread all over the world, I found a lot of people criticized AMD “Not impressive, Intel may still on top of the world!” , “WTF were AMD doing since many years ago??”
Hey man, when the product isn't ready please don't release any negative news about the product, or it will affect the stock!:D:shrug:

this post makes no sense .... thank you for contributing to this thread you made me lol
08-04-2010, 03:52 AM
-Boris-

Quote:

Originally Posted by FlanK3r

superrugal: right. And never seen new CPU product with 50% more performance. If we looking at the same clock on clock, not big increase (llok at one core Core architecture vs Nehalem architecture, im talking about one core, example not big diferent q9550 one core vs i7 Nehalem one core with HTT disabled)

But this isn't per core. It's 50% for the whole CPU. And that isn't special at all. Athlon X2 was nearly 100% faster than Athlon 64. ;)
08-04-2010, 04:00 AM
Sn0wm@n

Quote:

Originally Posted by -Boris-

But this isn't per core. It's 50% for the whole CPU. And that isn't special at all. Athlon X2 was nearly 100% faster than Athlon 64. ;)

did jf-amd clearly say that the 50% performance is per core ???
08-04-2010, 04:06 AM
-Boris-

Quote:

Originally Posted by Sn0wm@n

did jf-amd clearly say that the 50% performance is per core ???

No, he clearly said it's per CPU, or to be precise, per socket. That's why 50% increase isn't that exciting. Just let's hope it's a conservative claim.
08-04-2010, 04:10 AM
Sn0wm@n

Quote:

Originally Posted by -Boris-

No, he clearly said it's per CPU, or to be precise, per socket. That's why 50% increase isn't that exciting. Just let's hope it's a conservative claim.

50% could be in mem bandwith or other things .... btw bulldozer's mem controler will support higher speed ram .... this could make a difference of a big % and other trivial stuff ... since we barely know we should wait for amd to release their product in due time
08-04-2010, 04:11 AM
informal

Quote:

Originally Posted by -Boris-

But this isn't per core. It's 50% for the whole CPU. And that isn't special at all. Athlon X2 was nearly 100% faster than Athlon 64. ;)

If you think that 50% "for the whole chip" is not much than you need a reality check.Many applications don't scale perfectly to many cores so extracting the "single core" perf. increase out of this generic average figure is pretty much pointless(not knowing clock speeds makes it even more pointless).Not to mention this is a conservative estimate and the MC platform is drop in compatible with the BD CPUs.
08-04-2010, 05:12 AM
-Boris-

Quote:

Originally Posted by informal

If you think that 50% "for the whole chip" is not much than you need a reality check.Many applications don't scale perfectly to many cores so extracting the "single core" perf. increase out of this generic average figure is pretty much pointless(not knowing clock speeds makes it even more pointless).Not to mention this is a conservative estimate and the MC platform is drop in compatible with the BD CPUs.

My experience with numbers like these is that they are in theoretical benches like SPECint and SPECfp. So i guess the performance estimate from JF is pretty close to 50% more theoretical performance rather than practical.
BUT, I also believe that these benches are on a clockspeed that might be on the conservative side.

So if all goes well we see higher numbers than these.
08-04-2010, 05:24 AM
Calmatory

Of course it is SPECint and SPECfp. I definitely trust these more than stuff like Cinebench and 3DMark CPU test. :shakes:

More and more I am interested with the multithreaded performance. This single core Celeron M is getting on my nerves when doing some number crunching.
08-04-2010, 05:29 AM
Kej

Quote:

Originally Posted by -Boris-

My experience with numbers like these is that they are in theoretical benches like SPECint and SPECfp. So i guess the performance estimate from JF is pretty close to 50% more theoretical performance rather than practical.
BUT, I also believe that these benches are on a clockspeed that might be on the conservative side.

So if all goes well we see higher numbers than these.

Mr. Fruehe has made a slight clarification of the 50% number in a other thread.
I sincerely hopes that by major server workloads he doesn't mean syntetic
benchmarks, but real world workload. Will be interesting to see where we end up
in the end, and if AMD has been hyping to much.

Quote:

Originally Posted by JF-AMD

Don't expect performance at hot chips - that is a discussion around architecture.

As for performance, the 50% gain is an aggregate estimation of major server workloads. Estimates tend to be conservative, there is little to be gained from being overly aggressive.
08-04-2010, 05:36 AM
-Boris-

Quote:

Originally Posted by Kej

Mr. Fruehe has made a slight clarification of the 50% number in a other thread.
I sincerely hopes that by major server workloads he doesn't mean syntetic
benchmarks, but real world workload. Will be interesting to see where we end up
in the end, and if AMD has been hyping to much.

But it's reasonable to think that these major server workloads scales excellent with cores. Which just like SPECint and SPECfp. So my argument is still valid, 50% increase is most possibly close to a theoretical number, and isn't comparable with desktop workloads.
08-04-2010, 05:43 AM
Manicdan

Quote:

Originally Posted by informal

If you think that 50% "for the whole chip" is not much than you need a reality check.Many applications don't scale perfectly to many cores so extracting the "single core" perf. increase out of this generic average figure is pretty much pointless(not knowing clock speeds makes it even more pointless).Not to mention this is a conservative estimate and the MC platform is drop in compatible with the BD CPUs.

theres also the perspective that its the newest 32nm chip vs the best 45nm chip amd is offering

if we look at the growth of 45nm, we had a PII-920 and a 1055T, both 2.8ghz, both 125W TDP, but one has 50% more cores and turbo. (then we can even go farther and compare it to the 95W version, in reality, its sick how much they can improve something on the same process)

it might be a little doubtful that they can pack in more BD cores on the sever side chips, so 16 might be the max. but with better experience they will be improving that process enough that higher clocks should be expected at the same TDP across the life of 32nm BD
08-04-2010, 06:03 AM
Kej

Quote:

Originally Posted by -Boris-

But it's reasonable to think that these major server workloads scales excellent with cores. Which just like SPECint and SPECfp. So my argument is still valid, 50% increase is most possibly close to a theoretical number, and isn't comparable with desktop workloads.

I'm following your thoughts, not really arguing, just putting up some more
information/clarification.

Your line ...isn't comparable with desktop workloads cannot be stressed enough.
People are using info from the server side way to literally when they are talking
desktop chips in these threads about AMD's Bulldozer architecture.
08-04-2010, 10:12 PM
-Boris-

Quote:

Originally Posted by Kej

I'm following your thoughts, not really arguing, just putting up some more
information/clarification.

Your line ...isn't comparable with desktop workloads cannot be stressed enough.
People are using info from the server side way to literally when they are talking
desktop chips in these threads about AMD's Bulldozer architecture.

Yes, would be nice if someone mentioned a bit more about Zambezi. :)
08-22-2010, 12:24 PM
richierich

So Hot Chips finally arrives ;)
08-22-2010, 12:56 PM
Sunfire

Quote:

Originally Posted by richierich

So Hot Chips finally arrives ;)

Still a few days away:

Conference Day Two: August 24, 2010 (Tuesday)

Session 7: New Processor Architectures (Session Chair: Bevan Baas, UC Davis) 5:00 - 6:30 (pm)

* The Next-generation System z Micro-Processor
Authors: Brian Curran
Affiliations: IBM

* AMD "Bulldozer" Core - a new approach to multithreaded compute performance for maximum efficiency and throughput
Authors: Mike Butler
Affiliations: AMD

* AMD's "Bobcat" x86 Core - Small, Efficient and Strong
Authors: Brad Burgess
Affiliations: AMD
08-22-2010, 10:02 PM
Mav451

Is this being covered by anyone? I mean should I expect to hear anything before Wednesday morning (or very late Tuesday night)?
08-23-2010, 04:48 AM
Jowy Atreides

Quote:

Originally Posted by Mav451

Is this being covered by anyone? I mean should I expect to hear anything before Wednesday morning (or very late Tuesday night)?

Techpowerup will have an article up when it turns the 24th. Which is less than 24 hours now.

I've seen the pdf, it's about 20 pages of amd slides. Should be good reading when the embargo lifts
08-23-2010, 04:56 AM
Manicdan

just one question
are the slides green or red?
08-23-2010, 06:04 AM
Jowy Atreides

Quote:

Originally Posted by Manicdan

just one question
are the slides green or red?

Black

http://a.imageshack.us/img829/623/52244830.png
08-23-2010, 06:14 AM
Manicdan

black is good, but i was just curious if they stuck with their old fashioned green theme, or if they were going to surprise us with some ATI red that we hope sticks around after the drop of the name
08-23-2010, 06:28 AM
Jowy Atreides

Quote:

Originally Posted by Manicdan

black is good, but i was just curious if they stuck with their old fashioned green theme, or if they were going to surprise us with some ATI red that we hope sticks around after the drop of the name

I never thought about that, fair point :)

AMD also just hired Donald Newell. That, to me says something good.
08-23-2010, 06:35 AM
Manicdan

well i also wouldnt expect a change so quickly. these slides have probably been around longer than the announcement of the ATI name dropping.
2011 should be an xtremely interesting year for PCs, so much change coming and who knows what might stick to the walls
08-23-2010, 02:36 PM
Opteron146

Any infos about the time zone concerning the 24th ?
The 24th will start in Tokyo in ~16 hours, but in Los Angeles it is still 31hours away ;-)
08-23-2010, 02:39 PM
informal

It's still eternity away :D
08-23-2010, 02:40 PM
marten_larsson

It's already the 24th here in Sweden so you must have your dates mixed somehow. Can't be more than 24h tops until the 24th anywhere...
08-23-2010, 02:47 PM
Opteron146

Quote:

Originally Posted by marten_larsson

It's already the 24th here in Sweden so you must have your dates mixed somehow. Can't be more than 24h tops until the 24th anywhere...

Oops, det stämmer :ROTF:

Då tror jag att NDAen slutar i 5 timmar, dvs. 00:00 East time, ska alltså lägga mig nu.

Tack för tipset, ibland sover man redan framför datorn ;-)

Alex
08-23-2010, 02:57 PM
informal

Quote:

Originally Posted by Opteron146

Oops, det stämmer :ROTF:

Då tror jag att NDAen slutar i 5 timmar, dvs. 00:00 East time, ska alltså lägga mig nu.

Tack för tipset, ibland sover man redan framför datorn ;-)

Alex

Opteron ,that Gtranslate was painful ,but I hope the 24th brings us new details to drool on :D
08-23-2010, 03:00 PM
blindbox

I thought there's 5 hours left until 24th on almost all places in the world?
08-23-2010, 03:05 PM
Lightman

I can't go to sleep now because it's already 24th here in U.K. :eek:

It's funny how some people are knowledge hungry :up:
08-23-2010, 03:16 PM
Mats

Quote:

Originally Posted by Lightman

It's funny how some people are knowledge hungry :up:

Yeah, but given that BD is the single most interesting CPU since Opteron in 2003 I'm not surprised.
Let's hope they'll launch it exactly 8 years later, on 22 april 2011!:D
Tho it's probably at least three months too early..
08-23-2010, 03:17 PM
Metroid

If it comes right on time then Hexus will be the first once again eheh.
08-23-2010, 04:20 PM
Sn0wm@n

Quote:

Originally Posted by informal

It's still eternity away :D

yes ... we want benches :D LOL

but hot chip isnt the time when we will get them :(
08-23-2010, 06:16 PM
BatteryOperated

ok
so...hotchips. and jf amd is no where to be found. why doesn't this surprise me?
08-23-2010, 06:26 PM
Jowy Atreides

Quote:

Originally Posted by BatteryOperated

ok
so...hotchips. and jf amd is no where to be found. why doesn't this surprise me?

Because he works for amd and is attending hot chips.
duh

He's kinda the marketing director for server products and hotchips is a marketing press conference
08-23-2010, 06:40 PM
JF-AMD

No, because he has a job to do and he does most of his forum surfing at night or really early in the morning.

I won't be at hot chips, I am not an engineer. And we don't go on until late on Tuesday anyway.
08-23-2010, 07:02 PM
Mats

Quote:

Originally Posted by BatteryOperated

ok
so...hotchips. and jf amd is no where to be found. why doesn't this surprise me?

Hey, you obviously haven't read this thread..:shakes: :down:
08-23-2010, 07:05 PM
crazydiamond

Quote:

Originally Posted by BatteryOperated

ok
so...hotchips. and jf amd is no where to be found. why doesn't this surprise me?

thanks for the great input there guy :shakes:
08-23-2010, 07:05 PM
informal

Quote:

Originally Posted by BatteryOperated

ok
so...hotchips. and jf amd is no where to be found. why doesn't this surprise me?

Wow ,just wow :p:
08-23-2010, 07:09 PM
god_43

lol you know i would quote you BO....but it would be cliche at this point ;p.
08-24-2010, 11:32 AM
FlanK3r

maybe some crazy plan for ode about Bulldozer launch? :-) Tuning car to bulldozer with AMD logos :-D ?
08-24-2010, 11:56 AM
HelixPC

were waiting........................
08-24-2010, 12:21 PM
madcho

any live for the hotchips 22 ?
08-24-2010, 12:25 PM
Manicdan

i hear we will get the slides later, and some chats late tonight or tomorrow

Show 100 post(s) from this thread on one page

All times are GMT -8. The time now is 09:32 AM.

XtremeSystems