@Thread starter: Good comparision. Could you also show Thuban numbers 4C and same clock. Also is turbo disabled on all (would be more accurate that way)
PS: where is the thank you button ?
@Thread starter: Good comparision. Could you also show Thuban numbers 4C and same clock. Also is turbo disabled on all (would be more accurate that way)
PS: where is the thank you button ?
Va fail, dh'oine.
"I am going to hunt down people who have strong opinions on subjects they dont understand " - Dogbert
Always rooting for the underdog ...
Quite possible, and I hope so. There shouldn't be this much impact and then the lack of it... It's CMT, not SMT, after all.
Although, it could also be that by inactivating every second integer cluster power consumption went down and so the all-cores turbo could kick in, contributing significatnly to the results. So, I wonder if the turbo modes was disabled or not.
EDIT: here is a similar test (Googlish): http://translate.google.com/translat...-2%3Fstart%3D5
No, you can't give it the whole resources on CMT.
Do you mean disabling the second thread on one core? Yes, with SMT the first thread will earn much.If you take a Core 2 and disable 1 core, rest assured, the performance of that one thread will be better than running the same thread with both cores active.
Last edited by dess; 10-13-2011 at 05:02 AM.
Your right that the first core doesnt gain access to the second integer core, but it does gain access to the full FP section. The floating point part of the core is split in half when both cores need it and each core uses half the resources of it. When a single core is active only it has access to all of it.
Rig 1:
ASUS P8Z77-V
Intel i5 3570K @ 4.75GHz
16GB of Team Xtreme DDR-2666 RAM (11-13-13-35-2T)
Nvidia GTX 670 4GB SLI
Rig 2:
Asus Sabertooth 990FX
AMD FX-8350 @ 5.6GHz
16GB of Mushkin DDR-1866 RAM (8-9-8-26-1T)
AMD 6950 with 6970 bios flash
Yamakasi Catleap 2B overclocked to 120Hz refresh rate
Audio-GD FUN DAC unit w/ AD797BRZ opamps
Sennheiser PC350 headset w/ hero mod
Could you test with heavy load affinity at c0,c2,c4,c6 & ligh loads at affinity c1,c3,c5,c7?
Why not ? What stops the shared decoder to feed instructions from the same thread to both integer clusters ? It would be the magical "reverse hyperthreading" everyone talks about.
Core 2 isn't SMT enabled. You have 2 cores, 2 threads with a shared cache. If you disable one core, the other will have access to the whole cache and offer better single threaded performance. This is what we see with the 4m/4c aproach on BD. Inside the module you have 1 thread which isn't fighting for shared resources => perf. is better. Overall throughoutput is lower, it's basically an exercise.Do you mean disabling the second thread on one core? Yes, with SMT the first thread will earn much.
2M/4c < 4m/4c < 4m/8c altough highest per thread performance is with the 4m/4c.
2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
XS Build Log for: My Latest Custom Case
So I just glanced over your data and have not fully digested it yet, but all the benches you ran are multithreaded. You have not shown anything that explains single threaded performances, rather what you have done is forced a situation where you have 4 available contexts, in one situation 4 threads are scheduled across 2 modules (a sharing situation) and in other case 4 threads spread over 4 modules (a non-shared situation). What you are showing is the performance hit taken when resources in the front end are shared (i.e. cache, TLBs, BTBs, etc etc). The results would be exactly what I would have expected.
What is more interesting about your data is you can now ascertain AMD's claims of 180% or that 1.8 scaling factor of a module vs two distinct cores.
Here is an experiment to do.... repeat this, but with just one application (Fritz Chess) because this app allows you to specify the number of threads spawn.
Do it for 1, 2, 3, and 4 threads. Then turn everything on (all modules, clusters, cores what ever you want to call them), and do the same runs for 1, 2, 3, 4, 5, 6, 7 and 8 threads and plot the scaling vs thread for the three different configurations.
You will find that the fritz run for 1 thread will not be different regardless of how you configure the modules, clusters, cores (again, call them whatever you want to call them).
jack
One hundred years from now It won't matter
What kind of car I drove What kind of house I lived in
How much money I had in the bank Nor what my cloths looked like.... But The world may be a little better Because, I was important In the life of a child.
-- from "Within My Power" by Forest Witcraft
There are some contradicting information regarding this, even from AMD itself. One time they say both threads can utilize both FMAC's, other time they say it's only when executing 256-bit AVX.
Because of separate register files and L1D's? Given the additional logics needed for implementing such a feature it's better yet drop CMT for SMT.
Expropriation of caches wouldn't give that much improvement as we see here...Core 2 isn't SMT enabled. You have 2 cores, 2 threads with a shared cache. If you disable one core, the other will have access to the whole cache and offer better single threaded performance. This is what we see with the 4m/4c aproach on BD.
IMHO not even the expropriation of front-end and back-end. Of the whole FPU (it it's possible), perhaps. But that accounts for certain cases, only.
JF-AMD / Hans de Vries / informal posting: IPC increases!!!!!!! How many times did I tell you!!!
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
.....}
until (12th October 2011)
2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
XS Build Log for: My Latest Custom Case
AMD FX-8350 (1237 PGN) | Asus Crosshair V Formula (bios 1703) | G.Skill 2133 CL9 @ 2230 9-11-10 | Sapphire HD 6870 | Samsung 830 128Gb SSD / 2 WD 1Tb Black SATA3 storage | Corsair TX750 PSU
Watercooled ST 120.3 & TC 120.1 / MCP35X XSPC Top / Apogee HD Block | WIN7 64 Bit HP | Corsair 800D Obsidian Case
First Computer: Commodore Vic 20 (circa 1981).
i dont think its so much a failure, it apparently works for cray and server uses, I think the problem could and will be worked out with revisions, theres a cache invalidation issue so 18cycles can be wasted over and over when both clusters/cores/wtfe are loaded, theres rummor of a patch in testing from ms for win7 that could fix the problem, if true that would be great as it should boost perf 10-20% for multi threaded loads(or multi-single threaded loads)
the Idea is good, the execution has its quirks/flaws, this isnt something that only happens to AMD, Intel has made these kind of errors in the past, I have seen p4's/xeons with an error that when patched via microcode lost 40% perf, I have seen intel chipsets that would bug out under heavy memory loads that had bios and driver "fixes" that crippled them(been doing this a long time), Intel recently put out a sb-e that has bugged(see broken) hardware visualization, expect the next batch/run(Stepping) to be fixed, I expect AMD to do the same, if they either get the OS patch or a hardware fix(or better both) that could really help with these issues
and again, I dont care if the linux community thinks the proposed patch is a bad idea or whatever, i read the articals and the possible problem linus talks about is with legacy programs, in other words programs people should update or replace!!!!
well im waiting on 2nd gen dozer unless somebody gives me one as a gift
could someone run the sandra cache memory benches and other ones with 4 module/ 4cores then we'll see if it's cache trashing or not.
i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.
2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
XS Build Log for: My Latest Custom Case
links to this thread were posted in many other forums, people got excited thinking disabling cores would give you the gain that going from 2m/4c to 4m/4c gives you, I think....
THG testes the Windows 8 fix for this:
http://www.tomshardware.com/reviews/...x,3043-23.html
2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
XS Build Log for: My Latest Custom Case
im thinking we are seeing turbo help out, not IPC bonuses.
WoW is 1 major thread that will run about 90% of a core, and then a thread that takes up about 50%, and then a third that takes up like 20%. so basically it can be run with as low as 2 cores and not see a perf hit when going down from 3 (or very little of a hit).
2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
XS Build Log for: My Latest Custom Case
makes sense, it would be interesting to test it on Windows 8, using the default setting vs 1t per module, also turbo on and off with the different conditions, and investigate more with a single thread software...
FX-8350(1249PGT) @ 4.7ghz 1.452v, Swiftech H220x
Asus Crosshair Formula 5 Am3+ bios v1703
G.skill Trident X (2x4gb) ~1200mhz @ 10-12-12-31-46-2T @ 1.66v
MSI 7950 TwinFrozr *1100/1500* Cat.14.9
OCZ ZX 850w psu
Lian-Li Lancool K62
Samsung 830 128g
2 x 1TB Samsung SpinpointF3, 2T Samsung
Win7 Home 64bit
My Rig
FX-8350(1249PGT) @ 4.7ghz 1.452v, Swiftech H220x
Asus Crosshair Formula 5 Am3+ bios v1703
G.skill Trident X (2x4gb) ~1200mhz @ 10-12-12-31-46-2T @ 1.66v
MSI 7950 TwinFrozr *1100/1500* Cat.14.9
OCZ ZX 850w psu
Lian-Li Lancool K62
Samsung 830 128g
2 x 1TB Samsung SpinpointF3, 2T Samsung
Win7 Home 64bit
My Rig
DGLee: Was the turbo disabled here? I think it's important to know, as if it wasn't then the all-cores turbo could have kicked in because of the lesser power draw, and so it could affect the results, benefitting the "sole" threads, like if the sharing of some resources were more of an impact than it really is.
Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.
However, according to AMD there are situations where you don't even want this behavior.
Take a look at the first two pictures at THG:
http://www.tomshardware.co.uk/fx-815...-32295-23.html
Because of the shared L1-Cache it makes indeed sense that in some cases it can be faster to use the whole module instead of splitting things up and utilize two modules partially. This means that the scheduler has to be more intelligent though, as it's not enough to just assign each new task to a new core like now, instead it must be able to guess which tasks should be grouped to one module and which should be split over two (more) modules.
I'm no coder but I can imagine easier projects than making the scheduler aware of such a complex problem.
Power Rig: Core i7-5930K, ASRock X99 Extreme6/3.1, 16GB G.Skill DDR4-2400, Asus Strix GTX980 OC
Time Sink: Core i7-5775C, ASRock Z97E-ITX/ac, 16GB AMD DDR3-2133, Silverstone PT-09 w/ 120W Power Brick
HTPC: Athlon 5350, ASRock AM1H-ITX, 4GB DDR3, Supermicro SC-101i
Jack, hardware.fr already tried something along those lines with 4m/4t and 2m/4t ,both with Turbo on. In 1st case maximum turbo for all 4 "threads" is 3.9Ghz since all modules are running. In second case it's 4.2Ghz across 2 modules(4 threads). The % difference in Turbo clock(~7%) is not nearly enough to make up for sharing losses as can be seen here:
http://www.hardware.fr/articles/842-...windows-8.html
IMG0033836.gif
4m/4t is 26% faster(!) than 4m/2t at fixed 3.6Ghz and 15% faster when both are running their maximum Turbo modes allowed. Now comes the power draw story.
If you look at the power draw you will see the faster config is 20% more power hungry and I suspect this is the reason why AMD didn't configure the core priorities in that way. I think when PD arrives,power draw will go down sufficiently in order to schedule the threads the faster way and still get good power numbers. Still,with present BD core, for 20% more power you gain 26% more performance this way,not a bad tradeoff. If GloFo would get their act together and make possible for AMD to produce 3.6Ghz 5 module PD core with this thread affinity capability,this thing could very well be significantly more powerful than Thuban ,even in ST at fixed clock and noticeably more powerful than BD in both ST and MT with both Turbo on and off.
By the way,great thread DGLee
Bookmarks