@Thread starter: Good comparision. Could you also show Thuban numbers 4C and same clock. Also is turbo disabled on all (would be more accurate that way)
PS: where is the thank you button ? :p:
@Thread starter: Good comparision. Could you also show Thuban numbers 4C and same clock. Also is turbo disabled on all (would be more accurate that way)
PS: where is the thank you button ? :p:
Quite possible, and I hope so. There shouldn't be this much impact and then the lack of it... It's CMT, not SMT, after all.
Although, it could also be that by inactivating every second integer cluster power consumption went down and so the all-cores turbo could kick in, contributing significatnly to the results. So, I wonder if the turbo modes was disabled or not.
EDIT: here is a similar test (Googlish): http://translate.google.com/translat...-2%3Fstart%3D5
No, you can't give it the whole resources on CMT.
Do you mean disabling the second thread on one core? Yes, with SMT the first thread will earn much.Quote:
If you take a Core 2 and disable 1 core, rest assured, the performance of that one thread will be better than running the same thread with both cores active.
Your right that the first core doesnt gain access to the second integer core, but it does gain access to the full FP section. The floating point part of the core is split in half when both cores need it and each core uses half the resources of it. When a single core is active only it has access to all of it.
Could you test with heavy load affinity at c0,c2,c4,c6 & ligh loads at affinity c1,c3,c5,c7?
Why not ? What stops the shared decoder to feed instructions from the same thread to both integer clusters ? It would be the magical "reverse hyperthreading" everyone talks about.
Core 2 isn't SMT enabled. You have 2 cores, 2 threads with a shared cache. If you disable one core, the other will have access to the whole cache and offer better single threaded performance. This is what we see with the 4m/4c aproach on BD. Inside the module you have 1 thread which isn't fighting for shared resources => perf. is better. Overall throughoutput is lower, it's basically an exercise.Quote:
Do you mean disabling the second thread on one core? Yes, with SMT the first thread will earn much.
2M/4c < 4m/4c < 4m/8c altough highest per thread performance is with the 4m/4c.
So I just glanced over your data and have not fully digested it yet, but all the benches you ran are multithreaded. You have not shown anything that explains single threaded performances, rather what you have done is forced a situation where you have 4 available contexts, in one situation 4 threads are scheduled across 2 modules (a sharing situation) and in other case 4 threads spread over 4 modules (a non-shared situation). What you are showing is the performance hit taken when resources in the front end are shared (i.e. cache, TLBs, BTBs, etc etc). The results would be exactly what I would have expected.
What is more interesting about your data is you can now ascertain AMD's claims of 180% or that 1.8 scaling factor of a module vs two distinct cores.
Here is an experiment to do.... repeat this, but with just one application (Fritz Chess) because this app allows you to specify the number of threads spawn.
Do it for 1, 2, 3, and 4 threads. Then turn everything on (all modules, clusters, cores what ever you want to call them), and do the same runs for 1, 2, 3, 4, 5, 6, 7 and 8 threads and plot the scaling vs thread for the three different configurations.
You will find that the fritz run for 1 thread will not be different regardless of how you configure the modules, clusters, cores (again, call them whatever you want to call them).
jack
There are some contradicting information regarding this, even from AMD itself. One time they say both threads can utilize both FMAC's, other time they say it's only when executing 256-bit AVX.
Because of separate register files and L1D's? Given the additional logics needed for implementing such a feature it's better yet drop CMT for SMT.
Expropriation of caches wouldn't give that much improvement as we see here...Quote:
Core 2 isn't SMT enabled. You have 2 cores, 2 threads with a shared cache. If you disable one core, the other will have access to the whole cache and offer better single threaded performance. This is what we see with the 4m/4c aproach on BD.
IMHO not even the expropriation of front-end and back-end. Of the whole FPU (it it's possible), perhaps. But that accounts for certain cases, only.
i dont think its so much a failure, it apparently works for cray and server uses, I think the problem could and will be worked out with revisions, theres a cache invalidation issue so 18cycles can be wasted over and over when both clusters/cores/wtfe are loaded, theres rummor of a patch in testing from ms for win7 that could fix the problem, if true that would be great as it should boost perf 10-20% for multi threaded loads(or multi-single threaded loads)
the Idea is good, the execution has its quirks/flaws, this isnt something that only happens to AMD, Intel has made these kind of errors in the past, I have seen p4's/xeons with an error that when patched via microcode lost 40% perf, I have seen intel chipsets that would bug out under heavy memory loads that had bios and driver "fixes" that crippled them(been doing this a long time), Intel recently put out a sb-e that has bugged(see broken) hardware visualization, expect the next batch/run(Stepping) to be fixed, I expect AMD to do the same, if they either get the OS patch or a hardware fix(or better both) that could really help with these issues :)
and again, I dont care if the linux community thinks the proposed patch is a bad idea or whatever, i read the articals and the possible problem linus talks about is with legacy programs, in other words programs people should update or replace!!!!
well im waiting on 2nd gen dozer unless somebody gives me one as a gift :)
could someone run the sandra cache memory benches and other ones with 4 module/ 4cores then we'll see if it's cache trashing or not.
i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.
links to this thread were posted in many other forums, people got excited thinking disabling cores would give you the gain that going from 2m/4c to 4m/4c gives you, I think....
THG testes the Windows 8 fix for this:
http://www.tomshardware.com/reviews/...x,3043-23.html
im thinking we are seeing turbo help out, not IPC bonuses.
WoW is 1 major thread that will run about 90% of a core, and then a thread that takes up about 50%, and then a third that takes up like 20%. so basically it can be run with as low as 2 cores and not see a perf hit when going down from 3 (or very little of a hit).
makes sense, it would be interesting to test it on Windows 8, using the default setting vs 1t per module, also turbo on and off with the different conditions, and investigate more with a single thread software...
DGLee: Was the turbo disabled here? I think it's important to know, as if it wasn't then the all-cores turbo could have kicked in because of the lesser power draw, and so it could affect the results, benefitting the "sole" threads, like if the sharing of some resources were more of an impact than it really is.
Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.
However, according to AMD there are situations where you don't even want this behavior.
Take a look at the first two pictures at THG:
http://www.tomshardware.co.uk/fx-815...-32295-23.html
Because of the shared L1-Cache it makes indeed sense that in some cases it can be faster to use the whole module instead of splitting things up and utilize two modules partially. This means that the scheduler has to be more intelligent though, as it's not enough to just assign each new task to a new core like now, instead it must be able to guess which tasks should be grouped to one module and which should be split over two (more) modules.
I'm no coder but I can imagine easier projects than making the scheduler aware of such a complex problem.
Jack, hardware.fr already tried something along those lines with 4m/4t and 2m/4t ,both with Turbo on. In 1st case maximum turbo for all 4 "threads" is 3.9Ghz since all modules are running. In second case it's 4.2Ghz across 2 modules(4 threads). The % difference in Turbo clock(~7%) is not nearly enough to make up for sharing losses as can be seen here:
http://www.hardware.fr/articles/842-...windows-8.html
Attachment 121226
4m/4t is 26% faster(!) than 4m/2t at fixed 3.6Ghz and 15% faster when both are running their maximum Turbo modes allowed. Now comes the power draw story.
If you look at the power draw you will see the faster config is 20% more power hungry and I suspect this is the reason why AMD didn't configure the core priorities in that way. I think when PD arrives,power draw will go down sufficiently in order to schedule the threads the faster way and still get good power numbers. Still,with present BD core, for 20% more power you gain 26% more performance this way,not a bad tradeoff. If GloFo would get their act together and make possible for AMD to produce 3.6Ghz 5 module PD core with this thread affinity capability,this thing could very well be significantly more powerful than Thuban ,even in ST at fixed clock and noticeably more powerful than BD in both ST and MT with both Turbo on and off.
By the way,great thread DGLee :)