Page 3 of 11 FirstFirst 123456 ... LastLast
Results 51 to 75 of 267

Thread: AMD FX "Bulldozer" Review - (4) !exclusive! Excuse for 1-Threaded Perf.

  1. #51
    Xtreme Member
    Join Date
    Mar 2009
    Location
    Unknown
    Posts
    266
    @Thread starter: Good comparision. Could you also show Thuban numbers 4C and same clock. Also is turbo disabled on all (would be more accurate that way)

    PS: where is the thank you button ?
    Va fail, dh'oine.

    "I am going to hunt down people who have strong opinions on subjects they dont understand " - Dogbert

    Always rooting for the underdog ...

  2. #52
    Xtreme Member
    Join Date
    Nov 2007
    Posts
    103
    Quote Originally Posted by Particle View Post
    I appreciate your efforts. However, it's likely that you're largely seeing the performance hit of the cache thrashing issue recently discovered since disabling a core in each module would prevent that contention.
    Quite possible, and I hope so. There shouldn't be this much impact and then the lack of it... It's CMT, not SMT, after all.

    Although, it could also be that by inactivating every second integer cluster power consumption went down and so the all-cores turbo could kick in, contributing significatnly to the results. So, I wonder if the turbo modes was disabled or not.

    EDIT: here is a similar test (Googlish): http://translate.google.com/translat...-2%3Fstart%3D5

    Quote Originally Posted by savantu View Post
    I'm a bit puzzled by what's the news here. You're basically proven an axiom : in any CPU were you have resource sharing among 2 threads, running only one ( thus giving it the whole resources ) it will run better.
    No, you can't give it the whole resources on CMT.

    If you take a Core 2 and disable 1 core, rest assured, the performance of that one thread will be better than running the same thread with both cores active.
    Do you mean disabling the second thread on one core? Yes, with SMT the first thread will earn much.
    Last edited by dess; 10-13-2011 at 05:02 AM.

  3. #53
    Xtreme Addict
    Join Date
    Oct 2006
    Posts
    2,141
    Quote Originally Posted by dess View Post
    No, you can't give it the whole resources on CMT.
    Your right that the first core doesnt gain access to the second integer core, but it does gain access to the full FP section. The floating point part of the core is split in half when both cores need it and each core uses half the resources of it. When a single core is active only it has access to all of it.
    Rig 1:
    ASUS P8Z77-V
    Intel i5 3570K @ 4.75GHz
    16GB of Team Xtreme DDR-2666 RAM (11-13-13-35-2T)
    Nvidia GTX 670 4GB SLI

    Rig 2:
    Asus Sabertooth 990FX
    AMD FX-8350 @ 5.6GHz
    16GB of Mushkin DDR-1866 RAM (8-9-8-26-1T)
    AMD 6950 with 6970 bios flash

    Yamakasi Catleap 2B overclocked to 120Hz refresh rate
    Audio-GD FUN DAC unit w/ AD797BRZ opamps
    Sennheiser PC350 headset w/ hero mod

  4. #54
    Xtreme Enthusiast
    Join Date
    Nov 2009
    Posts
    526
    Could you test with heavy load affinity at c0,c2,c4,c6 & ligh loads at affinity c1,c3,c5,c7?

  5. #55
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,730
    Quote Originally Posted by dess View Post
    ..

    No, you can't give it the whole resources on CMT.
    Why not ? What stops the shared decoder to feed instructions from the same thread to both integer clusters ? It would be the magical "reverse hyperthreading" everyone talks about.

    Do you mean disabling the second thread on one core? Yes, with SMT the first thread will earn much.
    Core 2 isn't SMT enabled. You have 2 cores, 2 threads with a shared cache. If you disable one core, the other will have access to the whole cache and offer better single threaded performance. This is what we see with the 4m/4c aproach on BD. Inside the module you have 1 thread which isn't fighting for shared resources => perf. is better. Overall throughoutput is lower, it's basically an exercise.
    2M/4c < 4m/4c < 4m/8c altough highest per thread performance is with the 4m/4c.
    Quote Originally Posted by Heinz Guderian View Post
    There are no desperate situations, there are only desperate people.

  6. #56
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    thanks!
    we can see a few games were at the same speed while others got 5% faster with disabled cores.
    2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
    GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
    Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
    XS Build Log for: My Latest Custom Case

  7. #57
    Xtreme Mentor
    Join Date
    Mar 2006
    Posts
    2,978
    So I just glanced over your data and have not fully digested it yet, but all the benches you ran are multithreaded. You have not shown anything that explains single threaded performances, rather what you have done is forced a situation where you have 4 available contexts, in one situation 4 threads are scheduled across 2 modules (a sharing situation) and in other case 4 threads spread over 4 modules (a non-shared situation). What you are showing is the performance hit taken when resources in the front end are shared (i.e. cache, TLBs, BTBs, etc etc). The results would be exactly what I would have expected.

    What is more interesting about your data is you can now ascertain AMD's claims of 180% or that 1.8 scaling factor of a module vs two distinct cores.

    Here is an experiment to do.... repeat this, but with just one application (Fritz Chess) because this app allows you to specify the number of threads spawn.

    Do it for 1, 2, 3, and 4 threads. Then turn everything on (all modules, clusters, cores what ever you want to call them), and do the same runs for 1, 2, 3, 4, 5, 6, 7 and 8 threads and plot the scaling vs thread for the three different configurations.

    You will find that the fritz run for 1 thread will not be different regardless of how you configure the modules, clusters, cores (again, call them whatever you want to call them).

    jack
    One hundred years from now It won't matter
    What kind of car I drove What kind of house I lived in
    How much money I had in the bank Nor what my cloths looked like.... But The world may be a little better Because, I was important In the life of a child.
    -- from "Within My Power" by Forest Witcraft

  8. #58
    Xtreme Member
    Join Date
    Nov 2007
    Posts
    103
    Quote Originally Posted by EniGmA1987 View Post
    Your right that the first core doesnt gain access to the second integer core, but it does gain access to the full FP section. The floating point part of the core is split in half when both cores need it and each core uses half the resources of it. When a single core is active only it has access to all of it.
    There are some contradicting information regarding this, even from AMD itself. One time they say both threads can utilize both FMAC's, other time they say it's only when executing 256-bit AVX.

    Quote Originally Posted by savantu View Post
    Why not ? What stops the shared decoder to feed instructions from the same thread to both integer clusters ? It would be the magical "reverse hyperthreading" everyone talks about.
    Because of separate register files and L1D's? Given the additional logics needed for implementing such a feature it's better yet drop CMT for SMT.

    Core 2 isn't SMT enabled. You have 2 cores, 2 threads with a shared cache. If you disable one core, the other will have access to the whole cache and offer better single threaded performance. This is what we see with the 4m/4c aproach on BD.
    Expropriation of caches wouldn't give that much improvement as we see here...

    IMHO not even the expropriation of front-end and back-end. Of the whole FPU (it it's possible), perhaps. But that accounts for certain cases, only.

  9. #59
    Registered User
    Join Date
    Aug 2011
    Posts
    73
    Quote Originally Posted by Manicdan View Post
    thanks!
    we can see a few games were at the same speed while others got 5% faster with disabled cores.
    yes, still not enough to match phenom II at the same clocks, if you look at their review

    looks like Im keeping X6 1055 for a while
    JF-AMD / Hans de Vries / informal posting: IPC increases!!!!!!! How many times did I tell you!!!

    terrace215 post: IPC decreases, The more I post the more it decreases.
    terrace215 post: IPC decreases, The more I post the more it decreases.
    terrace215 post: IPC decreases, The more I post the more it decreases.
    .....}
    until (12th October 2011)

  10. #60
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    Quote Originally Posted by Brice MJ View Post
    yes, still not enough to match phenom II at the same clocks, if you look at their review

    looks like Im keeping X6 1055 for a while
    yup, and its a shame too, a 4M/4C at +5ghz with the same IPC as PII would have been much more respectable.
    2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
    GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
    Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
    XS Build Log for: My Latest Custom Case

  11. #61
    Xtreme Mentor
    Join Date
    Dec 2007
    Location
    State of Confusion, USA
    Posts
    2,513
    Quote Originally Posted by bondhahnmrt85 View Post
    Maybe need to patch for running core priority.

    Now : Core 0 --> Core 1 --> Core 2 --> Core 3 --> Core 4 --> Core 5 --> Core 6 --> Core 7
    Right priority : Core 0 --> Core 2 --> Core 4 --> Core 6 --> Core 1 --> Core 3 --> Core 5 --> Core 7

    CMIIW
    I don't have alot of knowledge about the inner workings of a CPU, but to a laymen, this sounds brilliant and should be easy enough to implement.
    Thoughts?
    AMD FX-8350 (1237 PGN) | Asus Crosshair V Formula (bios 1703) | G.Skill 2133 CL9 @ 2230 9-11-10 | Sapphire HD 6870 | Samsung 830 128Gb SSD / 2 WD 1Tb Black SATA3 storage | Corsair TX750 PSU
    Watercooled ST 120.3 & TC 120.1 / MCP35X XSPC Top / Apogee HD Block | WIN7 64 Bit HP | Corsair 800D Obsidian Case








    First Computer: Commodore Vic 20 (circa 1981).

  12. #62
    Registered User
    Join Date
    Apr 2010
    Posts
    57
    Quote Originally Posted by savantu View Post
    I'm a bit puzzled by what's the news here. You're basically proven an axiom : in any CPU were you have resource sharing among 2 threads, running only one ( thus giving it the whole resources ) it will run better.
    If you take a Core 2 and disable 1 core, rest assured, the performance of that one thread will be better than running the same thread with both cores active.

    The whole point of AMD's aproach is to avoid exactly that : don't make a fat core ( what you're suggesting ), but skinny ones and lots of them. On desktop, as BD has proven, this is a failure.

    AMD could have created BD as a 4 core with each module being transformed in a fat core, but that's SB reloaded.
    i dont think its so much a failure, it apparently works for cray and server uses, I think the problem could and will be worked out with revisions, theres a cache invalidation issue so 18cycles can be wasted over and over when both clusters/cores/wtfe are loaded, theres rummor of a patch in testing from ms for win7 that could fix the problem, if true that would be great as it should boost perf 10-20% for multi threaded loads(or multi-single threaded loads)

    the Idea is good, the execution has its quirks/flaws, this isnt something that only happens to AMD, Intel has made these kind of errors in the past, I have seen p4's/xeons with an error that when patched via microcode lost 40% perf, I have seen intel chipsets that would bug out under heavy memory loads that had bios and driver "fixes" that crippled them(been doing this a long time), Intel recently put out a sb-e that has bugged(see broken) hardware visualization, expect the next batch/run(Stepping) to be fixed, I expect AMD to do the same, if they either get the OS patch or a hardware fix(or better both) that could really help with these issues

    and again, I dont care if the linux community thinks the proposed patch is a bad idea or whatever, i read the articals and the possible problem linus talks about is with legacy programs, in other words programs people should update or replace!!!!

    well im waiting on 2nd gen dozer unless somebody gives me one as a gift

  13. #63
    Xtreme Mentor
    Join Date
    May 2008
    Location
    cleveland ohio
    Posts
    2,879
    could someone run the sandra cache memory benches and other ones with 4 module/ 4cores then we'll see if it's cache trashing or not.
    HAVE NO FEAR!
    "AMD fallen angel"
    Quote Originally Posted by Gamekiller View Post
    You didn't get the memo? 1 hour 'Fugger time' is equal to 12 hours of regular time.

  14. #64
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.
    2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
    GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
    Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
    XS Build Log for: My Latest Custom Case

  15. #65
    Banned Movieman...
    Join Date
    May 2009
    Location
    illinois
    Posts
    1,809
    Quote Originally Posted by Manicdan View Post
    i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.
    i think people are afraid to post in it.

  16. #66
    Xtreme Member
    Join Date
    Jan 2004
    Posts
    393
    Quote Originally Posted by Manicdan View Post
    i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.
    links to this thread were posted in many other forums, people got excited thinking disabling cores would give you the gain that going from 2m/4c to 4m/4c gives you, I think....

    THG testes the Windows 8 fix for this:
    http://www.tomshardware.com/reviews/...x,3043-23.html

  17. #67
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    Quote Originally Posted by Spectrobozo View Post
    THG testes the Windows 8 fix for this:
    http://www.tomshardware.com/reviews/...x,3043-23.html
    thats not the same as limiting one core per module though is it, i think its actually the opposite where it send them all to a few modules so it can shut off the others and turbo better
    2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
    GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
    Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
    XS Build Log for: My Latest Custom Case

  18. #68
    Xtreme Member
    Join Date
    Jan 2004
    Posts
    393
    Quote Originally Posted by Manicdan View Post
    thats not the same as limiting one core per module though is it, i think its actually the opposite where it send them all to a few modules so it can shut off the others and turbo better
    you are right, but I think they are achieving a similar result by making a better use of the resources, the framerate on WoW (single, dual thread?) increased by a good margin (but is it related just to this or any other windows, driver changes)?

  19. #69
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    Quote Originally Posted by Spectrobozo View Post
    you are right, but I think they are achieving a similar result by making a better use of the resources, the framerate on WoW (single, dual thread?) increased by a good margin (but is it related just to this or any other windows, driver changes)?
    im thinking we are seeing turbo help out, not IPC bonuses.
    WoW is 1 major thread that will run about 90% of a core, and then a thread that takes up about 50%, and then a third that takes up like 20%. so basically it can be run with as low as 2 cores and not see a perf hit when going down from 3 (or very little of a hit).
    2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
    GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
    Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
    XS Build Log for: My Latest Custom Case

  20. #70
    Xtreme Member
    Join Date
    Jan 2004
    Posts
    393
    makes sense, it would be interesting to test it on Windows 8, using the default setting vs 1t per module, also turbo on and off with the different conditions, and investigate more with a single thread software...

  21. #71
    Xtreme Mentor
    Join Date
    Feb 2007
    Location
    West hartford, CT
    Posts
    2,804
    Quote Originally Posted by Manicdan View Post
    i would just like to point out that this thread has nearly twice as many views than the official BD reviews thread in the news section.
    i posted it in the comments section at techreport

    wanna see if they can talk about in their next podcast and see if Scott will do some testing on it
    FX-8350(1249PGT) @ 4.7ghz 1.452v, Swiftech H220x
    Asus Crosshair Formula 5 Am3+ bios v1703
    G.skill Trident X (2x4gb) ~1200mhz @ 10-12-12-31-46-2T @ 1.66v
    MSI 7950 TwinFrozr *1100/1500* Cat.14.9
    OCZ ZX 850w psu
    Lian-Li Lancool K62
    Samsung 830 128g
    2 x 1TB Samsung SpinpointF3, 2T Samsung
    Win7 Home 64bit
    My Rig

  22. #72
    Xtreme Mentor
    Join Date
    Feb 2007
    Location
    West hartford, CT
    Posts
    2,804
    Quote Originally Posted by Spectrobozo View Post

    THG testes the Windows 8 fix for this:
    http://www.tomshardware.com/reviews/...x,3043-23.html
    we need an "AMD bulldozer optimizer" driver for win7 stat! lol
    FX-8350(1249PGT) @ 4.7ghz 1.452v, Swiftech H220x
    Asus Crosshair Formula 5 Am3+ bios v1703
    G.skill Trident X (2x4gb) ~1200mhz @ 10-12-12-31-46-2T @ 1.66v
    MSI 7950 TwinFrozr *1100/1500* Cat.14.9
    OCZ ZX 850w psu
    Lian-Li Lancool K62
    Samsung 830 128g
    2 x 1TB Samsung SpinpointF3, 2T Samsung
    Win7 Home 64bit
    My Rig

  23. #73
    Xtreme Member
    Join Date
    Nov 2007
    Posts
    103
    DGLee: Was the turbo disabled here? I think it's important to know, as if it wasn't then the all-cores turbo could have kicked in because of the lesser power draw, and so it could affect the results, benefitting the "sole" threads, like if the sharing of some resources were more of an impact than it really is.

  24. #74
    Registered User
    Join Date
    Oct 2005
    Location
    Austria
    Posts
    68
    Quote Originally Posted by bondhahnmrt85 View Post
    Maybe need to patch for running core priority.

    Now : Core 0 --> Core 1 --> Core 2 --> Core 3 --> Core 4 --> Core 5 --> Core 6 --> Core 7
    Right priority : Core 0 --> Core 2 --> Core 4 --> Core 6 --> Core 1 --> Core 3 --> Core 5 --> Core 7

    CMIIW
    Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.

    However, according to AMD there are situations where you don't even want this behavior.
    Take a look at the first two pictures at THG:
    http://www.tomshardware.co.uk/fx-815...-32295-23.html

    Because of the shared L1-Cache it makes indeed sense that in some cases it can be faster to use the whole module instead of splitting things up and utilize two modules partially. This means that the scheduler has to be more intelligent though, as it's not enough to just assign each new task to a new core like now, instead it must be able to guess which tasks should be grouped to one module and which should be split over two (more) modules.

    I'm no coder but I can imagine easier projects than making the scheduler aware of such a complex problem.
    Power Rig: Core i7-5930K, ASRock X99 Extreme6/3.1, 16GB G.Skill DDR4-2400, Asus Strix GTX980 OC
    Time Sink: Core i7-5775C, ASRock Z97E-ITX/ac, 16GB AMD DDR3-2133, Silverstone PT-09 w/ 120W Power Brick
    HTPC: Athlon 5350, ASRock AM1H-ITX, 4GB DDR3, Supermicro SC-101i

  25. #75
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Quote Originally Posted by JumpingJack View Post
    So I just glanced over your data and have not fully digested it yet, but all the benches you ran are multithreaded. You have not shown anything that explains single threaded performances, rather what you have done is forced a situation where you have 4 available contexts, in one situation 4 threads are scheduled across 2 modules (a sharing situation) and in other case 4 threads spread over 4 modules (a non-shared situation). What you are showing is the performance hit taken when resources in the front end are shared (i.e. cache, TLBs, BTBs, etc etc). The results would be exactly what I would have expected.

    What is more interesting about your data is you can now ascertain AMD's claims of 180% or that 1.8 scaling factor of a module vs two distinct cores.

    Here is an experiment to do.... repeat this, but with just one application (Fritz Chess) because this app allows you to specify the number of threads spawn.

    Do it for 1, 2, 3, and 4 threads. Then turn everything on (all modules, clusters, cores what ever you want to call them), and do the same runs for 1, 2, 3, 4, 5, 6, 7 and 8 threads and plot the scaling vs thread for the three different configurations.

    You will find that the fritz run for 1 thread will not be different regardless of how you configure the modules, clusters, cores (again, call them whatever you want to call them).

    jack
    Jack, hardware.fr already tried something along those lines with 4m/4t and 2m/4t ,both with Turbo on. In 1st case maximum turbo for all 4 "threads" is 3.9Ghz since all modules are running. In second case it's 4.2Ghz across 2 modules(4 threads). The % difference in Turbo clock(~7%) is not nearly enough to make up for sharing losses as can be seen here:
    http://www.hardware.fr/articles/842-...windows-8.html
    IMG0033836.gif

    4m/4t is 26% faster(!) than 4m/2t at fixed 3.6Ghz and 15% faster when both are running their maximum Turbo modes allowed. Now comes the power draw story.
    If you look at the power draw you will see the faster config is 20% more power hungry and I suspect this is the reason why AMD didn't configure the core priorities in that way. I think when PD arrives,power draw will go down sufficiently in order to schedule the threads the faster way and still get good power numbers. Still,with present BD core, for 20% more power you gain 26% more performance this way,not a bad tradeoff. If GloFo would get their act together and make possible for AMD to produce 3.6Ghz 5 module PD core with this thread affinity capability,this thing could very well be significantly more powerful than Thuban ,even in ST at fixed clock and noticeably more powerful than BD in both ST and MT with both Turbo on and off.

    By the way,great thread DGLee

Page 3 of 11 FirstFirst 123456 ... LastLast

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •