AMD FX "Bulldozer" Review - (4) !exclusive! Excuse for 1-Threaded Perf.

Printable View

Show 100 post(s) from this thread on one page

10-13-2011, 04:11 AM
Tao~

@Thread starter: Good comparision. Could you also show Thuban numbers 4C and same clock. Also is turbo disabled on all (would be more accurate that way)

PS: where is the thank you button ? :p:
10-13-2011, 04:20 AM
dess

Quote:

Originally Posted by Particle

I appreciate your efforts. However, it's likely that you're largely seeing the performance hit of the cache thrashing issue recently discovered since disabling a core in each module would prevent that contention.

Quite possible, and I hope so. There shouldn't be this much impact and then the lack of it... It's CMT, not SMT, after all.

Although, it could also be that by inactivating every second integer cluster power consumption went down and so the all-cores turbo could kick in, contributing significatnly to the results. So, I wonder if the turbo modes was disabled or not.

EDIT: here is a similar test (Googlish): http://translate.google.com/translat...-2%3Fstart%3D5

Quote:

Originally Posted by savantu

I'm a bit puzzled by what's the news here. You're basically proven an axiom : in any CPU were you have resource sharing among 2 threads, running only one ( thus giving it the whole resources ) it will run better.

No, you can't give it the whole resources on CMT.

Quote:

If you take a Core 2 and disable 1 core, rest assured, the performance of that one thread will be better than running the same thread with both cores active.

Do you mean disabling the second thread on one core? Yes, with SMT the first thread will earn much.
10-13-2011, 05:32 AM
EniGmA1987

Quote:

Originally Posted by dess

No, you can't give it the whole resources on CMT.

Your right that the first core doesnt gain access to the second integer core, but it does gain access to the full FP section. The floating point part of the core is split in half when both cores need it and each core uses half the resources of it. When a single core is active only it has access to all of it.
10-13-2011, 05:43 AM
Mechanical Man

Could you test with heavy load affinity at c0,c2,c4,c6 & ligh loads at affinity c1,c3,c5,c7?
10-13-2011, 05:47 AM
savantu

Quote:

Originally Posted by dess

..

No, you can't give it the whole resources on CMT.

Why not ? What stops the shared decoder to feed instructions from the same thread to both integer clusters ? It would be the magical "reverse hyperthreading" everyone talks about.

Quote:

Do you mean disabling the second thread on one core? Yes, with SMT the first thread will earn much.

Core 2 isn't SMT enabled. You have 2 cores, 2 threads with a shared cache. If you disable one core, the other will have access to the whole cache and offer better single threaded performance. This is what we see with the 4m/4c aproach on BD. Inside the module you have 1 thread which isn't fighting for shared resources => perf. is better. Overall throughoutput is lower, it's basically an exercise.
2M/4c < 4m/4c < 4m/8c altough highest per thread performance is with the 4m/4c.
10-13-2011, 05:48 AM
Manicdan

Quote:

Originally Posted by Marc HFR

http://www.hardware.fr/medias/photos...IMG0033832.gif

Source : http://www.hardware.fr/articles/842-...acite-cmt.html (in french, sorry)

thanks!
we can see a few games were at the same speed while others got 5% faster with disabled cores.
10-13-2011, 06:14 AM
JumpingJack

So I just glanced over your data and have not fully digested it yet, but all the benches you ran are multithreaded. You have not shown anything that explains single threaded performances, rather what you have done is forced a situation where you have 4 available contexts, in one situation 4 threads are scheduled across 2 modules (a sharing situation) and in other case 4 threads spread over 4 modules (a non-shared situation). What you are showing is the performance hit taken when resources in the front end are shared (i.e. cache, TLBs, BTBs, etc etc). The results would be exactly what I would have expected.

What is more interesting about your data is you can now ascertain AMD's claims of 180% or that 1.8 scaling factor of a module vs two distinct cores.

Here is an experiment to do.... repeat this, but with just one application (Fritz Chess) because this app allows you to specify the number of threads spawn.

Do it for 1, 2, 3, and 4 threads. Then turn everything on (all modules, clusters, cores what ever you want to call them), and do the same runs for 1, 2, 3, 4, 5, 6, 7 and 8 threads and plot the scaling vs thread for the three different configurations.

You will find that the fritz run for 1 thread will not be different regardless of how you configure the modules, clusters, cores (again, call them whatever you want to call them).

jack
10-13-2011, 07:00 AM
dess

Quote:

Originally Posted by EniGmA1987

Your right that the first core doesnt gain access to the second integer core, but it does gain access to the full FP section. The floating point part of the core is split in half when both cores need it and each core uses half the resources of it. When a single core is active only it has access to all of it.

There are some contradicting information regarding this, even from AMD itself. One time they say both threads can utilize both FMAC's, other time they say it's only when executing 256-bit AVX.

Quote:

Originally Posted by savantu

Why not ? What stops the shared decoder to feed instructions from the same thread to both integer clusters ? It would be the magical "reverse hyperthreading" everyone talks about.

Because of separate register files and L1D's? Given the additional logics needed for implementing such a feature it's better yet drop CMT for SMT.

Quote:

Core 2 isn't SMT enabled. You have 2 cores, 2 threads with a shared cache. If you disable one core, the other will have access to the whole cache and offer better single threaded performance. This is what we see with the 4m/4c aproach on BD.

Expropriation of caches wouldn't give that much improvement as we see here...

IMHO not even the expropriation of front-end and back-end. Of the whole FPU (it it's possible), perhaps. But that accounts for certain cases, only.
10-13-2011, 07:11 AM
Brice MJ

Quote:

Originally Posted by Manicdan

thanks!
we can see a few games were at the same speed while others got 5% faster with disabled cores.

yes, still not enough to match phenom II at the same clocks, if you look at their review :shakes:

looks like Im keeping X6 1055 for a while
10-13-2011, 07:32 AM
Manicdan

Quote:

Originally Posted by Brice MJ

yes, still not enough to match phenom II at the same clocks, if you look at their review :shakes:

looks like Im keeping X6 1055 for a while

yup, and its a shame too, a 4M/4C at +5ghz with the same IPC as PII would have been much more respectable.
10-13-2011, 08:52 AM
Daveburt714

Quote:

Originally Posted by bondhahnmrt85

Maybe need to patch for running core priority.

Now : Core 0 --> Core 1 --> Core 2 --> Core 3 --> Core 4 --> Core 5 --> Core 6 --> Core 7
Right priority : Core 0 --> Core 2 --> Core 4 --> Core 6 --> Core 1 --> Core 3 --> Core 5 --> Core 7

CMIIW :)

I don't have alot of knowledge about the inner workings of a CPU, but to a laymen, this sounds brilliant and should be easy enough to implement.
Thoughts?
10-13-2011, 08:58 AM
AzureSky

Quote:

Originally Posted by savantu

I'm a bit puzzled by what's the news here. You're basically proven an axiom : in any CPU were you have resource sharing among 2 threads, running only one ( thus giving it the whole resources ) it will run better.
If you take a Core 2 and disable 1 core, rest assured, the performance of that one thread will be better than running the same thread with both cores active.

The whole point of AMD's aproach is to avoid exactly that : don't make a fat core ( what you're suggesting ), but skinny ones and lots of them. On desktop, as BD has proven, this is a failure.

AMD could have created BD as a 4 core with each module being transformed in a fat core, but that's SB reloaded.

i dont think its so much a failure, it apparently works for cray and server uses, I think the problem could and will be worked out with revisions, theres a cache invalidation issue so 18cycles can be wasted over and over when both clusters/cores/wtfe are loaded, theres rummor of a patch in testing from ms for win7 that could fix the problem, if true that would be great as it should boost perf 10-20% for multi threaded loads(or multi-single threaded loads)

the Idea is good, the execution has its quirks/flaws, this isnt something that only happens to AMD, Intel has made these kind of errors in the past, I have seen p4's/xeons with an error that when patched via microcode lost 40% perf, I have seen intel chipsets that would bug out under heavy memory loads that had bios and driver "fixes" that crippled them(been doing this a long time), Intel recently put out a sb-e that has bugged(see broken) hardware visualization, expect the next batch/run(Stepping) to be fixed, I expect AMD to do the same, if they either get the OS patch or a hardware fix(or better both) that could really help with these issues :)

and again, I dont care if the linux community thinks the proposed patch is a bad idea or whatever, i read the articals and the possible problem linus talks about is with legacy programs, in other words programs people should update or replace!!!!

well im waiting on 2nd gen dozer unless somebody gives me one as a gift :)
10-13-2011, 10:35 AM
demonkevy666

could someone run the sandra cache memory benches and other ones with 4 module/ 4cores then we'll see if it's cache trashing or not.
10-13-2011, 11:00 AM
Manicdan

i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.
10-13-2011, 11:47 AM
stangracin3

Quote:

Originally Posted by Manicdan

i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.

i think people are afraid to post in it.
10-13-2011, 11:58 AM
Spectrobozo

Quote:

Originally Posted by Manicdan

i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.

links to this thread were posted in many other forums, people got excited thinking disabling cores would give you the gain that going from 2m/4c to 4m/4c gives you, I think....

THG testes the Windows 8 fix for this:
http://www.tomshardware.com/reviews/...x,3043-23.html
10-13-2011, 12:00 PM
Manicdan

Quote:

Originally Posted by Spectrobozo

THG testes the Windows 8 fix for this:
http://www.tomshardware.com/reviews/...x,3043-23.html

thats not the same as limiting one core per module though is it, i think its actually the opposite where it send them all to a few modules so it can shut off the others and turbo better
10-13-2011, 12:11 PM
Spectrobozo

Quote:

Originally Posted by Manicdan

thats not the same as limiting one core per module though is it, i think its actually the opposite where it send them all to a few modules so it can shut off the others and turbo better

you are right, but I think they are achieving a similar result by making a better use of the resources, the framerate on WoW (single, dual thread?) increased by a good margin (but is it related just to this or any other windows, driver changes)?
10-13-2011, 12:23 PM
Manicdan

Quote:

Originally Posted by Spectrobozo

you are right, but I think they are achieving a similar result by making a better use of the resources, the framerate on WoW (single, dual thread?) increased by a good margin (but is it related just to this or any other windows, driver changes)?

im thinking we are seeing turbo help out, not IPC bonuses.
WoW is 1 major thread that will run about 90% of a core, and then a thread that takes up about 50%, and then a third that takes up like 20%. so basically it can be run with as low as 2 cores and not see a perf hit when going down from 3 (or very little of a hit).
10-13-2011, 12:30 PM
Spectrobozo

makes sense, it would be interesting to test it on Windows 8, using the default setting vs 1t per module, also turbo on and off with the different conditions, and investigate more with a single thread software...
10-13-2011, 01:29 PM
tbone8ty

Quote:

Originally Posted by Manicdan

i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.

i posted it in the comments section at techreport

wanna see if they can talk about in their next podcast and see if Scott will do some testing on it
10-13-2011, 01:32 PM
tbone8ty

Quote:

Originally Posted by Spectrobozo

THG testes the Windows 8 fix for this:
http://www.tomshardware.com/reviews/...x,3043-23.html

we need an "AMD bulldozer optimizer" driver for win7 stat! lol
10-13-2011, 01:43 PM
dess

DGLee: Was the turbo disabled here? I think it's important to know, as if it wasn't then the all-cores turbo could have kicked in because of the lesser power draw, and so it could affect the results, benefitting the "sole" threads, like if the sharing of some resources were more of an impact than it really is.
10-13-2011, 01:44 PM
pumero

Quote:

Originally Posted by bondhahnmrt85

Maybe need to patch for running core priority.

Now : Core 0 --> Core 1 --> Core 2 --> Core 3 --> Core 4 --> Core 5 --> Core 6 --> Core 7
Right priority : Core 0 --> Core 2 --> Core 4 --> Core 6 --> Core 1 --> Core 3 --> Core 5 --> Core 7

CMIIW :)

Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.

However, according to AMD there are situations where you don't even want this behavior.
Take a look at the first two pictures at THG:
http://www.tomshardware.co.uk/fx-815...-32295-23.html

Because of the shared L1-Cache it makes indeed sense that in some cases it can be faster to use the whole module instead of splitting things up and utilize two modules partially. This means that the scheduler has to be more intelligent though, as it's not enough to just assign each new task to a new core like now, instead it must be able to guess which tasks should be grouped to one module and which should be split over two (more) modules.

I'm no coder but I can imagine easier projects than making the scheduler aware of such a complex problem.
10-13-2011, 01:48 PM
informal

1 Attachment(s)

Quote:

Originally Posted by JumpingJack

So I just glanced over your data and have not fully digested it yet, but all the benches you ran are multithreaded. You have not shown anything that explains single threaded performances, rather what you have done is forced a situation where you have 4 available contexts, in one situation 4 threads are scheduled across 2 modules (a sharing situation) and in other case 4 threads spread over 4 modules (a non-shared situation). What you are showing is the performance hit taken when resources in the front end are shared (i.e. cache, TLBs, BTBs, etc etc). The results would be exactly what I would have expected.

What is more interesting about your data is you can now ascertain AMD's claims of 180% or that 1.8 scaling factor of a module vs two distinct cores.

Here is an experiment to do.... repeat this, but with just one application (Fritz Chess) because this app allows you to specify the number of threads spawn.

Do it for 1, 2, 3, and 4 threads. Then turn everything on (all modules, clusters, cores what ever you want to call them), and do the same runs for 1, 2, 3, 4, 5, 6, 7 and 8 threads and plot the scaling vs thread for the three different configurations.

You will find that the fritz run for 1 thread will not be different regardless of how you configure the modules, clusters, cores (again, call them whatever you want to call them).

jack

Jack, hardware.fr already tried something along those lines with 4m/4t and 2m/4t ,both with Turbo on. In 1st case maximum turbo for all 4 "threads" is 3.9Ghz since all modules are running. In second case it's 4.2Ghz across 2 modules(4 threads). The % difference in Turbo clock(~7%) is not nearly enough to make up for sharing losses as can be seen here:
http://www.hardware.fr/articles/842-...windows-8.html
Attachment 121226

4m/4t is 26% faster(!) than 4m/2t at fixed 3.6Ghz and 15% faster when both are running their maximum Turbo modes allowed. Now comes the power draw story.
If you look at the power draw you will see the faster config is 20% more power hungry and I suspect this is the reason why AMD didn't configure the core priorities in that way. I think when PD arrives,power draw will go down sufficiently in order to schedule the threads the faster way and still get good power numbers. Still,with present BD core, for 20% more power you gain 26% more performance this way,not a bad tradeoff. If GloFo would get their act together and make possible for AMD to produce 3.6Ghz 5 module PD core with this thread affinity capability,this thing could very well be significantly more powerful than Thuban ,even in ST at fixed clock and noticeably more powerful than BD in both ST and MT with both Turbo on and off.

By the way,great thread DGLee :)

Show 100 post(s) from this thread on one page