AMD FX "Bulldozer" Review - (4) !exclusive! Excuse for 1-Threaded Perf.

**To(V)bo Co(V)bo** · 10-14-2011, 07:46 PM

I think AMD should have given out more samples to us here at XS and this lack of support could have been fixed way before launch. We are almost doing all the R&D for them right now anyways. This was just a way to sloppy and a rushed release. I would of rather had a proper release with support over a half a$$ one with all this negative publicity hurting the product.

**dess** · 10-14-2011, 08:17 PM

Originally Posted by The Stilt

Indeed there should be a update coming for Windows which optimizes Turbo Core functionality on Zambezi.
Currently Windows (7 atleast) is throwing the load from core to core which sometimes neutralizes the effect of the Turbo Core feature.
This is because the load is not being run on the currently boosted core(s).

So you mean one that huddles threads as much as possible to enable max. turbo, right? But we want the opposite of it as the findings proves it's more beneficial to populate only one integer cluster per CU... So, we need only a rather little patch really that enables the SMT-aware scheduling Win7 already know, as some say.

But guys... please...

When talking about Zambezi please use the correct terms to avoid any further confusion.

A Zambezi node consists of: Four compute units and eight cores.
Each compute unit contains two cores.

In some of the slides a compute unit was called as a module, however thats not the official term.

Well, in the patent papers they call the former a core and the latter an integer cluster. Not so surprisingly, I may add.
Also, some people just refuse to call the latter a core, anyway, because it's rather marketing than technics.
I would use compute unit or CU for the former and integer cluster for the latter.
(Although, I find the "compute unit" a little laboured and awkward.)

**The Stilt** · 10-14-2011, 08:29 PM

Originally Posted by dess

So you mean one that huddles threads as much as possible to enable max. turbo, right? But we want the opposite of it as the findings proves it's more beneficial to populate only one integer cluster per CU... So, we need only a rather little patch really that enables the SMT-aware scheduling Win7 already know, as some say.

I´ll give you an example what currently happens:

A FX-8150 (Turbo Core enabled) running at stock settings, SuperPI (a single threaded software) is being executed:

Because there is load only on one thread (in theory) the Turbo Core feature boosts couple cores up to 4.2GHz while the rest are operating at 3.6 - 3.9GHz frequency. What currently happens is that Windows is unable to put the load (SuperPI) on the boosted core (4.2GHz) but throwing the load between the cores.
And executing such program (or any program actually) is naturally faster when it is executed on a core operating at 4.2GHz rather on one which is operating at 3.6-3.9GHz.

**dess** · 10-14-2011, 08:52 PM

I know that, and this idiotic behaviour needs to be addressed, as well, of course. Just I'm not sure it's going to happen with Win7, as well. In the short term the lesser patch would also be fine as it would boost performance of lightly threaded apps, like games, without even max. turbo.

Originally Posted by The Stilt

And executing such program (or any program actually) is naturally faster when it is executed on a core operating at 4.2GHz rather on one which is operating at 3.6-3.9GHz.

Not really any program. It's considerable faster to execute a program with 4 threads on 4 CU's, one cluster/CU, even at stock frequency (but usually all-cores turbo can work), than on 2 CU's, two cluster/CU, at max. turbo. Just see the findigs across the topic!

**BeepBeep2** · 10-14-2011, 09:04 PM

Originally Posted by dess

That's fine (except your wording is inaccurate). Somehow we should trick it to use this method for BD, as well...

It depends on if the penalty of forcing those closely related threads to communicate through L3 (instead of L2) is more or less than the gain on the lack of sharing resources. It seems most applications only benefits from it:

Attachment 121261

So, there could be a little patch that simply enables scheduling a' la SMT in Win7, that it already supports (if true)...

Quoted from the article:

And so the default behaviour will be separation (contrary to what JF said all along)? Would be just stupid if not... Of course, power consumption is higher because more modules are active, but here we can see also that with turbo enabled the the energy efficiency is really the same...

Well, unless there is a fix coming (HW or SW or both) that largely improves on the penalty of sharing resoruces. Just because the current numbers are much worse (anywhere between 95% to 160%, with one case of 180%) than what they've propagated (180% across the board), and so one can think there is some flaw somewhere here, as well. (And there is indeed the case of L1D trashing, that they claim to be responsible for only 3% decrease.)

What diagram? Do you mean this? Which part of it?

Do you mean, if we disable every other "core" in the BIOS? Then no, you will get this:
Core (Module) 0 - one cluster
Core (Module) 1 - one cluster
Core (Module) 2 - one cluster
Core (Module) 3 - one cluster

ps. perhaps the title of the thread should be changed to "Thread separation vs. turbo", or something like that, to be more meaningful.

Sorry, I should have typed out:
Core 0 - one cluster
Core 1 - disabled
Core 2 - one cluster
Core 3 - disabled
Core 4 - one cluster
Core 5 - disabled
Core 6 - one cluster
Core 7 - disabled

...but I thought everyone would be smart enough to get the picture.

The Stilt is also correct about module vs CU, compute/computational unit is the correct term.

**dess** · 10-14-2011, 09:21 PM

Originally Posted by BeepBeep2

Sorry, I should have typed out:
Core 0 - one cluster
Core 1 - disabled
Core 2 - one cluster
Core 3 - disabled
Core 4 - one cluster
Core 5 - disabled
Core 6 - one cluster
Core 7 - disabled

...but I thought everyone would be smart enough to get the picture.

I've thought you've called a CU a core (just like others before you), not just because you were listing four of it two times, but that you used terms like "shared" and "one cluster" next to it. Why calling something two names, anyway?

Now, care to elaborate for the stupid like me what it all means and what diagram:

AMD's diagram
Core 0 - shared
Core 1 - one cluster
Core 2 - shared
Core 3 - one cluster (uses all resources for 1 thread)

The Stilt is also correct about module vs CU, compute/computational unit is the correct term.

Effectively everybody uses the term Module here, but fine, let it be "Compute Unit"... (No, not Computational Unit, while it means the same, this term is not used in this form.)

**BeepBeep2** · 10-15-2011, 11:00 AM

Originally Posted by dess

I've thought you've called a CU a core (just like others before you), not just because you were listing four of it two times, but that you used terms like "shared" and "one cluster" next to it. Why calling something two names, anyway?

Now, care to elaborate for the stupid like me what it all means and what diagram:

Effectively everybody uses the term Module here, but fine, let it be "Compute Unit"... (No, not Computational Unit, while it means the same, this term is not used in this form.)

That image shows Thread 1 (a/b) running sharing a CU while Thread 2 has it's own CU, does it not?
By disabling every other core in BIOS, Thread 1 runs on it's own CU, Thread 2 is disabled.
That is what I was describing. There is no Thread 2 a/b in the diagram despite the whole article talking about how AMD wants threads to share modules and benefit from higher p-states.

I would also like to add that though I never called you stupid, I find your post quite rude and very demeaning.

**dess** · 10-15-2011, 05:09 PM

Originally Posted by BeepBeep2

That image shows Thread 1 (a/b) running sharing a CU while Thread 2 has it's own CU, does it not?

Yes.

By disabling every other core in BIOS, Thread 1 runs on it's own CU, Thread 2 is disabled.

No. That way all of Thread 1a, Thread 1b and Thread 2 would run on a separate CU.
(You can't "disable" a sw thread this way, anyway. If there are less cores enabled than threads to run, then those will simply share the available cores more.)

There is no Thread 2 a/b in the diagram despite the whole article talking about how AMD wants threads to share modules and benefit from higher p-states.

Why should there be Thread 2b? SW threads are given. The diagram represents a situation when there are three threads to run. Two of them closely reladed (Thread 1a and 1b), i.e. working on the same dataset. The 3rd being a separate one, Thread 2. And then there are two cases, regarding core/CU utilization, one being sub-optimal and the other being optimal (shown above). The latter being so because the related threads share a CU (and so they can share data in L2 cache), while the separate one can have a whole CU to itself, and the unneeded CU's can go to sleep, enabling higher turbo mode for the first two CU.

Now, according to the findings, running all these three threads simply on separate CU's (so every other cores [clusters] disabled) would in most cases still be better than allowing them to share the CU's(*) in order to limit the number of used CU's, and so have higher turbo. Because with unsharing you usually win more than with turbo... (This shouldn't be the case, if everything worked as planned, or at least as marketed, but still, it is.)

* At least if done in the wrong way (Thread 2 and Thread 1a/1b in one CU). But it's possible it's true even if it's done in the "optimal" way. It needs further tests to tell.

I would also like to add that though I never called you stupid, I find your post quite rude and very demeaning.

Well, if I was to you I would have asked for forgive being confusing, instead of what you wrote there.

**Manicdan** · 10-15-2011, 06:52 PM

i think this is going to get a very mixed opinion if your thinking about client vs server aplications

2 threads sharing a CU is going to be better for servers since that second thread is getting a pretty generous bonus for less additional power consumption.

for client side we have to keep in mind what happens when the cpu isnt running at 100% load. those same few threads sharing CUs would only be worth it if turbo was able to compensate. when the speedup was 1.8x a 10% bonus to clocks would work out fine. but since it seems we only get about 1.5x, we need turbo to increase speed by 25% for it work out. and that is not going to happen nomatter how much the process matures. so for client apps that use about 4 threads or less the optimal answer seems to be to split everything up (unless your trying to somehow save power, which is quite rare for desktops to think about ~10% efficiency.

i am curious about the whole 1a and 1b thread that want to share resources. i assume this means they use the same L2 to their advantage, but im curious if software can determine if they want to do that, and then can windows give them the right thread assignments to accomplish it? and even if they did, how much of a bonus would there be, if the end result is just a little less L2 being used then it could never be better than independent CUs when looking at it from just performance. but if the data sharing increases code efficiency then there could be alot more perf. but i dont really know much about this kind of stuff

chew* · 10-15-2011, 07:03 PM

Originally Posted by The Stilt

Indeed there should be a update coming for Windows which optimizes Turbo Core functionality on Zambezi.
Currently Windows (7 atleast) is throwing the load from core to core which sometimes neutralizes the effect of the Turbo Core feature.
This is because the load is not being run on the currently boosted core(s).

But guys... please...

When talking about Zambezi please use the correct terms to avoid any further confusion.

A Zambezi node consists of: Four compute units and eight cores.
Each compute unit contains two cores.

In some of the slides a compute unit was called as a module, however thats not the official term.

This is AMD's patent for BD.

It points to the case that in a node there are 4 cores and 8 compute units or clusters. It also details how clusters share resources.

Disabling in bios "cores" ( actually clusters 150B ) 1 3 5 7 effectively disables resource sharing.

Click image for larger version.

Name: BD.jpg
Views: 1559
Size: 49.0 KB
ID: 121294

**zhadoom** · 10-15-2011, 07:46 PM

Disabling in bios "cores" ( actually clusters 150B ) 1 3 5 7 effectively disables resource sharing.

Can I supose that the 256 bit FPU remain fully active ? so the fp performance will not be directly afected.

**dess** · 10-16-2011, 04:02 AM

Originally Posted by Manicdan

i am curious about the whole 1a and 1b thread that want to share resources. i assume this means they use the same L2 to their advantage, but im curious if software can determine if they want to do that, and then can windows give them the right thread assignments to accomplish it? and even if they did, how much of a bonus would there be, if the end result is just a little less L2 being used then it could never be better than independent CUs when looking at it from just performance. but if the data sharing increases code efficiency then there could be alot more perf. but i dont really know much about this kind of stuff

I wonder how Win8 judges which approach to choose "real-time". Perhaps it simply couples child-threads with their mother thread? That's not always good.

Anyway, if they're not going to implement it in Win7, as well, I hope at least they will enable the SMT-aware approach for BD (if Win7 really supports it already), that's still much better than the default one.

Originally Posted by zhadoom

Can I supose that the 256 bit FPU remain fully active ?

Most probably, but worth a test.

so the fp performance will not be directly afected.

Actually, it raises FP performance, as well, for less-threaded apps.

**demonkevy666** · 10-16-2011, 09:29 AM

Originally Posted by dess

I wonder how Win8 judges which approach to choose "real-time". Perhaps it simply couples child-threads with their mother thread? That's not always good.

Anyway, if they're not going to implement it in Win7, as well, I hope at least they will enable the SMT-aware approach for BD (if Win7 really supports it already), that's still much better than the default one.

Most probably, but worth a test.

Actually, it raises FP performance, as well, for less-threaded apps.

"real time" is mean to cpu's you can't alt Tab out of a game if you it on AOD i just tried on my fable 3 and GTA:IV i was stuck in game until i quit
it can stop key board and mouse movement

**dess** · 10-16-2011, 01:12 PM

Originally Posted by demonkevy666

"real time" is mean to cpu's you can't alt Tab out of a game if you it on AOD i just tried on my fable 3 and GTA:IV i was stuck in game until i quit
it can stop key board and mouse movement

I didn't mean the task priority option in Windows, I've used the term in a sense something happens just in time. Sorry if I wasn't clear enough.

chew* · 10-16-2011, 01:21 PM

Well the plot thickens, 23 chips later I came to a conclusion, while this sounds good on paper it's exactly why AMD won't release a 4core disabled this way.

Feel free to test this yourself but after testing i found that you will take a significant clock hit, so much that it defeats the gained effeciency.......

**Oese** · 10-16-2011, 01:40 PM

what do you mean exactly with clock hit? i dont think that you will get lower max clocks? how can that be?

or maybe u mean more clock needed that it eats the efficiency?

**Mechanical Man** · 10-16-2011, 01:44 PM

Originally Posted by chew*

Well the plot thickens, 23 chips later I came to a conclusion, while this sounds good on paper it's exactly why AMD won't release a 4core disabled this way.

Feel free to test this yourself but after testing i found that you will take a significant clock hit, so much that it defeats the gained effeciency.......

So are you basically saying that one int "core" in a module does not matter on power usage almost at all?

**BeepBeep2** · 10-16-2011, 01:44 PM

Originally Posted by Oese

what do you mean exactly with clock hit? i dont think that you will get lower max clocks? how can that be?

or maybe u mean more clock needed that it eats the efficiency?

No, he means that you get less clocks with 4 than 8.

**zhadoom** · 10-16-2011, 01:49 PM

Originally Posted by chew*

Well the plot thickens, 23 chips later I came to a conclusion, while this sounds good on paper it's exactly why AMD won't release a 4core disabled this way.

Feel free to test this yourself but after testing i found that you will take a significant clock hit, so much that it defeats the gained effeciency.......

If I understand what you try to said this will reduce the events where the max turbo ( 4.2GHz ) happens. This major effect will be in 2 thread apps because will use two cores of different compute units ( modules or something like ... ).

**zhadoom** · 10-16-2011, 01:50 PM

Originally Posted by chew*

Well the plot thickens, 23 chips later I came to a conclusion, while this sounds good on paper it's exactly why AMD won't release a 4core disabled this way.

Feel free to test this yourself but after testing i found that you will take a significant clock hit, so much that it defeats the gained effeciency.......

If I understand what you try to said this will reduce the events where the max turbo ( 4.2GHz ) happens. This major effect will be in 2 thread apps because will use two cores of different compute units ( modules or something like ... ).

**Mechanical Man** · 10-16-2011, 01:51 PM

Originally Posted by BeepBeep2

No, he means that you get less clocks with 4 than 8.

I tought he was comparing to 2CU FX-4100, basically with 2CU you get so much more clocks that it greatly outweights gain on 4CU with second int core disabled.

**Oese** · 10-16-2011, 01:58 PM

If only turbo would be reduced, i would not mind if i could disable it whatsoever.

from what i hear, turbo is very unpredictable and bugged, sometimes cores clock below stock even when fully loaded and c&q off (which doesnt seeem to work at least on some boards..)

max oc anyhow should be same or higher with cores disabled? if not, that would be very strange..

**dess** · 10-16-2011, 03:03 PM

Originally Posted by chew*

Well the plot thickens, 23 chips later I came to a conclusion, while this sounds good on paper it's exactly why AMD won't release a 4core disabled this way.

We were mostly talking about the case of lightly threaded apps on a 81xx, and it definitely works there.

Feel free to test this yourself but after testing i found that you will take a significant clock hit, so much that it defeats the gained effeciency.......

According to findings here and some other places, there is roughly a 10% performance advantage with 4M/4C at 3.6 GHz, compared to 2M/4C at 4.2 GHz.
In other words, even if the 4M/4C part would work only at 3.6 GHz, it would usually still be faster than a current 4170.
Of course, it more or less depends on the given application. What were your results?

---

[snip]

**Manicdan** · 10-16-2011, 07:13 PM

Originally Posted by dess

You've got it wrong, dude.

100/90=1.111, but it's not 11%, but 90%! So, you should get the reciprocals, and that will show you the percentages.

Chess 11800/8813=1.3389 -> 74.7%
Wprime 13.814/9.531=1.4494 -> 69%
Winrar 4467/3027=1.4757 -> 67.7%
3d06 5803/4134=1.4037 -> 71.4%
3dvantage 19215/12102=1.5878 -> 63%
3d11 6340/4289=1.4782 -> 67.6%
CB R10 20552/15033=1.3671 -> 73.1%
CB R11.5 6/3.8=1.5789 -> 63.3%
Blender 9.76/7.16=1.3631 -> 73.3%
X264 37.23/25.18=1.4786 -> 67.6%

(These are the 4CU/8C vs. 4CU/4C numbers!)

i really dont follow your math,
if 4CU/8T gets 100, and 4CU/4T gets 90, thats a speed up of 11% through CMT. saying that things are 90% slower by turning off alternating cores just makes a confusing statement, even if true.

i think his numbers were right, CMT is giving us 34-59% speedup

as for chews point, i think from a stock chip perspective of 4CU/4T vs 2CU/4T, i think is a little wrong.
there was a perf/power consumption test and with turbo they all came out near identical, ill try and find the chart.

EDIT: this chart

**dess** · 10-16-2011, 07:40 PM

Manicdan: You're right, my mistake. I was playing with numbers that was about relative performance before, and when I seen this list it escaped my attention that these are speed-up numbers, so rightfully the opposites. I quess I need some sleep.

(Edited out that part of my message.)

Thread: AMD FX "Bulldozer" Review - (4) !exclusive! Excuse for 1-Threaded Perf.

Thread Tools

Search Thread

Rate This Thread

Display

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions