AMD FX "Bulldozer" Review - (4) !exclusive! Excuse for 1-Threaded Perf.

**bondhahnmrt85** · 10-13-2011, 02:37 AM

Maybe need to patch for running core priority.

Now : Core 0 --> Core 1 --> Core 2 --> Core 3 --> Core 4 --> Core 5 --> Core 6 --> Core 7
Right priority : Core 0 --> Core 2 --> Core 4 --> Core 6 --> Core 1 --> Core 3 --> Core 5 --> Core 7

CMIIW

**Tao~** · 10-13-2011, 04:11 AM

@Thread starter: Good comparision. Could you also show Thuban numbers 4C and same clock. Also is turbo disabled on all (would be more accurate that way)

PS: where is the thank you button ?

**Daveburt714** · 10-13-2011, 08:52 AM

Originally Posted by bondhahnmrt85

Maybe need to patch for running core priority.

Now : Core 0 --> Core 1 --> Core 2 --> Core 3 --> Core 4 --> Core 5 --> Core 6 --> Core 7
Right priority : Core 0 --> Core 2 --> Core 4 --> Core 6 --> Core 1 --> Core 3 --> Core 5 --> Core 7

CMIIW

I don't have alot of knowledge about the inner workings of a CPU, but to a laymen, this sounds brilliant and should be easy enough to implement.
Thoughts?

**pumero** · 10-13-2011, 01:44 PM

Originally Posted by bondhahnmrt85

Maybe need to patch for running core priority.

Now : Core 0 --> Core 1 --> Core 2 --> Core 3 --> Core 4 --> Core 5 --> Core 6 --> Core 7
Right priority : Core 0 --> Core 2 --> Core 4 --> Core 6 --> Core 1 --> Core 3 --> Core 5 --> Core 7

CMIIW

Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.

However, according to AMD there are situations where you don't even want this behavior.
Take a look at the first two pictures at THG:
http://www.tomshardware.co.uk/fx-815...-32295-23.html

Because of the shared L1-Cache it makes indeed sense that in some cases it can be faster to use the whole module instead of splitting things up and utilize two modules partially. This means that the scheduler has to be more intelligent though, as it's not enough to just assign each new task to a new core like now, instead it must be able to guess which tasks should be grouped to one module and which should be split over two (more) modules.

I'm no coder but I can imagine easier projects than making the scheduler aware of such a complex problem.

**BeepBeep2** · 10-13-2011, 01:50 PM

Originally Posted by pumero

Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.

However, according to AMD there are situations where you don't even want this behavior.
Take a look at the first two pictures at THG:
http://www.tomshardware.co.uk/fx-815...-32295-23.html

Because of the shared L1-Cache it makes indeed sense that in some cases it can be faster to use the whole module instead of splitting things up and utilize two modules partially. This means that the scheduler has to be more intelligent though, as it's not enough to just assign each new task to a new core like now, instead it must be able to guess which tasks should be grouped to one module and which should be split over two (more) modules.

I'm no coder but I can imagine easier projects than making the scheduler aware of such a complex problem.

There is no real or logical core in BD.
There are clusters, simple as that.
When you disable a cluster in BIOS, you do the same thing as AMD's diagram.

AMD's diagram
Core 0 - shared
Core 1 - one cluster
Core 2 - shared
Core 3 - one cluster (uses all resources for 1 thread)

What were doing
Core 0 - one cluster
Core 1 - disabled
Core 2 - one cluster
Core 3 - disabled

**pumero** · 10-13-2011, 01:58 PM

I'm aware of that and I never said that it works like that on BD.
At the moment Windows sees the processor as having 8 real cores and assigns the tasks accordingly but doesn't care (know) about the whole module thing.

**dess** · 10-14-2011, 04:57 PM

Originally Posted by pumero

Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.

That's fine (except your wording is inaccurate). Somehow we should trick it to use this method for BD, as well...

However, according to AMD there are situations where you don't even want this behavior.

It depends on if the penalty of forcing those closely related threads to communicate through L3 (instead of L2) is more or less than the gain on the lack of sharing resources. It seems most applications only benefits from it:

img0033832.gif

So, there could be a little patch that simply enables scheduling a' la SMT in Win7, that it already supports (if true)...

Quoted from the article:

According to AMD, Windows 8 will more intelligently align threads so that, when they can benefit from sharing a module, they will. The implication is that when two threads can be consolidated onto one module (despite the fact that they’re forced to share resources), putting an entire module to sleep and potentially enabling a higher p-state (a faster Turbo Core setting) outweighs any performance penalty tied to sharing.

And so the default behaviour will be separation (contrary to what JF said all along)? Would be just stupid if not... Of course, power consumption is higher because more modules are active, but here we can see also that with turbo enabled the the energy efficiency is really the same...

Well, unless there is a fix coming (HW or SW or both) that largely improves on the penalty of sharing resoruces. Just because the current numbers are much worse (anywhere between 95% to 160%, with one case of 180%) than what they've propagated (180% across the board), and so one can think there is some flaw somewhere here, as well. (And there is indeed the case of L1D trashing, that they claim to be responsible for only 3% decrease.)

Originally Posted by BeepBeep2

When you disable a cluster in BIOS, you do the same thing as AMD's diagram.

What diagram? Do you mean this? Which part of it?

What were doing
Core 0 - one cluster
Core 1 - disabled
Core 2 - one cluster
Core 3 - disabled

Do you mean, if we disable every other "core" in the BIOS? Then no, you will get this:
Core (Module) 0 - one cluster
Core (Module) 1 - one cluster
Core (Module) 2 - one cluster
Core (Module) 3 - one cluster

ps. perhaps the title of the thread should be changed to "Thread separation vs. turbo", or something like that, to be more meaningful.

**BeepBeep2** · 10-14-2011, 09:04 PM

Originally Posted by dess

That's fine (except your wording is inaccurate). Somehow we should trick it to use this method for BD, as well...

It depends on if the penalty of forcing those closely related threads to communicate through L3 (instead of L2) is more or less than the gain on the lack of sharing resources. It seems most applications only benefits from it:

img0033832.gif

So, there could be a little patch that simply enables scheduling a' la SMT in Win7, that it already supports (if true)...

Quoted from the article:

And so the default behaviour will be separation (contrary to what JF said all along)? Would be just stupid if not... Of course, power consumption is higher because more modules are active, but here we can see also that with turbo enabled the the energy efficiency is really the same...

Well, unless there is a fix coming (HW or SW or both) that largely improves on the penalty of sharing resoruces. Just because the current numbers are much worse (anywhere between 95% to 160%, with one case of 180%) than what they've propagated (180% across the board), and so one can think there is some flaw somewhere here, as well. (And there is indeed the case of L1D trashing, that they claim to be responsible for only 3% decrease.)

What diagram? Do you mean this? Which part of it?

Do you mean, if we disable every other "core" in the BIOS? Then no, you will get this:
Core (Module) 0 - one cluster
Core (Module) 1 - one cluster
Core (Module) 2 - one cluster
Core (Module) 3 - one cluster

ps. perhaps the title of the thread should be changed to "Thread separation vs. turbo", or something like that, to be more meaningful.

Sorry, I should have typed out:
Core 0 - one cluster
Core 1 - disabled
Core 2 - one cluster
Core 3 - disabled
Core 4 - one cluster
Core 5 - disabled
Core 6 - one cluster
Core 7 - disabled

...but I thought everyone would be smart enough to get the picture.

The Stilt is also correct about module vs CU, compute/computational unit is the correct term.

**dess** · 10-14-2011, 09:21 PM

Originally Posted by BeepBeep2

Sorry, I should have typed out:
Core 0 - one cluster
Core 1 - disabled
Core 2 - one cluster
Core 3 - disabled
Core 4 - one cluster
Core 5 - disabled
Core 6 - one cluster
Core 7 - disabled

...but I thought everyone would be smart enough to get the picture.

I've thought you've called a CU a core (just like others before you), not just because you were listing four of it two times, but that you used terms like "shared" and "one cluster" next to it. Why calling something two names, anyway?

Now, care to elaborate for the stupid like me what it all means and what diagram:

AMD's diagram
Core 0 - shared
Core 1 - one cluster
Core 2 - shared
Core 3 - one cluster (uses all resources for 1 thread)

The Stilt is also correct about module vs CU, compute/computational unit is the correct term.

Effectively everybody uses the term Module here, but fine, let it be "Compute Unit"... (No, not Computational Unit, while it means the same, this term is not used in this form.)

**BeepBeep2** · 10-15-2011, 11:00 AM

Originally Posted by dess

I've thought you've called a CU a core (just like others before you), not just because you were listing four of it two times, but that you used terms like "shared" and "one cluster" next to it. Why calling something two names, anyway?

Now, care to elaborate for the stupid like me what it all means and what diagram:

Effectively everybody uses the term Module here, but fine, let it be "Compute Unit"... (No, not Computational Unit, while it means the same, this term is not used in this form.)

That image shows Thread 1 (a/b) running sharing a CU while Thread 2 has it's own CU, does it not?
By disabling every other core in BIOS, Thread 1 runs on it's own CU, Thread 2 is disabled.
That is what I was describing. There is no Thread 2 a/b in the diagram despite the whole article talking about how AMD wants threads to share modules and benefit from higher p-states.

I would also like to add that though I never called you stupid, I find your post quite rude and very demeaning.

**dess** · 10-15-2011, 05:09 PM

Originally Posted by BeepBeep2

That image shows Thread 1 (a/b) running sharing a CU while Thread 2 has it's own CU, does it not?

Yes.

By disabling every other core in BIOS, Thread 1 runs on it's own CU, Thread 2 is disabled.

No. That way all of Thread 1a, Thread 1b and Thread 2 would run on a separate CU.
(You can't "disable" a sw thread this way, anyway. If there are less cores enabled than threads to run, then those will simply share the available cores more.)

There is no Thread 2 a/b in the diagram despite the whole article talking about how AMD wants threads to share modules and benefit from higher p-states.

Why should there be Thread 2b? SW threads are given. The diagram represents a situation when there are three threads to run. Two of them closely reladed (Thread 1a and 1b), i.e. working on the same dataset. The 3rd being a separate one, Thread 2. And then there are two cases, regarding core/CU utilization, one being sub-optimal and the other being optimal (shown above). The latter being so because the related threads share a CU (and so they can share data in L2 cache), while the separate one can have a whole CU to itself, and the unneeded CU's can go to sleep, enabling higher turbo mode for the first two CU.

Now, according to the findings, running all these three threads simply on separate CU's (so every other cores [clusters] disabled) would in most cases still be better than allowing them to share the CU's(*) in order to limit the number of used CU's, and so have higher turbo. Because with unsharing you usually win more than with turbo... (This shouldn't be the case, if everything worked as planned, or at least as marketed, but still, it is.)

* At least if done in the wrong way (Thread 2 and Thread 1a/1b in one CU). But it's possible it's true even if it's done in the "optimal" way. It needs further tests to tell.

I would also like to add that though I never called you stupid, I find your post quite rude and very demeaning.

Well, if I was to you I would have asked for forgive being confusing, instead of what you wrote there.

Thread: AMD FX "Bulldozer" Review - (4) !exclusive! Excuse for 1-Threaded Perf.

Thread Tools

Search Thread

Rate This Thread

Display

Hybrid View

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions