Page 4 of 11 FirstFirst 1234567 ... LastLast
Results 76 to 100 of 267

Thread: AMD FX "Bulldozer" Review - (4) !exclusive! Excuse for 1-Threaded Perf.

  1. #76
    Xtreme 3D Team
    Join Date
    Jan 2009
    Location
    Ohio
    Posts
    8,499
    Quote Originally Posted by pumero View Post
    Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.

    However, according to AMD there are situations where you don't even want this behavior.
    Take a look at the first two pictures at THG:
    http://www.tomshardware.co.uk/fx-815...-32295-23.html

    Because of the shared L1-Cache it makes indeed sense that in some cases it can be faster to use the whole module instead of splitting things up and utilize two modules partially. This means that the scheduler has to be more intelligent though, as it's not enough to just assign each new task to a new core like now, instead it must be able to guess which tasks should be grouped to one module and which should be split over two (more) modules.

    I'm no coder but I can imagine easier projects than making the scheduler aware of such a complex problem.
    There is no real or logical core in BD.
    There are clusters, simple as that.
    When you disable a cluster in BIOS, you do the same thing as AMD's diagram.

    AMD's diagram
    Core 0 - shared
    Core 1 - one cluster
    Core 2 - shared
    Core 3 - one cluster (uses all resources for 1 thread)

    What were doing
    Core 0 - one cluster
    Core 1 - disabled
    Core 2 - one cluster
    Core 3 - disabled
    Smile

  2. #77
    Registered User
    Join Date
    Oct 2005
    Location
    Austria
    Posts
    68
    I'm aware of that and I never said that it works like that on BD.
    At the moment Windows sees the processor as having 8 real cores and assigns the tasks accordingly but doesn't care (know) about the whole module thing.
    Power Rig: Core i7-5930K, ASRock X99 Extreme6/3.1, 16GB G.Skill DDR4-2400, Asus Strix GTX980 OC
    Time Sink: Core i7-5775C, ASRock Z97E-ITX/ac, 16GB AMD DDR3-2133, Silverstone PT-09 w/ 120W Power Brick
    HTPC: Athlon 5350, ASRock AM1H-ITX, 4GB DDR3, Supermicro SC-101i

  3. #78
    Xtreme Member
    Join Date
    Jan 2004
    Posts
    393
    Quote Originally Posted by informal View Post
    Jack, hardware.fr already tried something along those lines with 4m/4t and 2m/4t ,both with Turbo on. In 1st case maximum turbo for all 4 "threads" is 3.9Ghz since all modules are running. In second case it's 4.2Ghz across 2 modules(4 threads). The % difference in Turbo clock(~7%) is not nearly enough to make up for sharing losses as can be seen here:
    http://www.hardware.fr/articles/842-...windows-8.html
    Attachment 121226

    4m/4t is 26% faster(!) than 4m/2t at fixed 3.6Ghz and 15% faster when both are running their maximum Turbo modes allowed. Now comes the power draw story.
    If you look at the power draw you will see the faster config is 20% more power hungry and I suspect this is the reason why AMD didn't configure the core priorities in that way. I think when PD arrives,power draw will go down sufficiently in order to schedule the threads the faster way and still get good power numbers. Still,with present BD core, for 20% more power you gain 26% more performance this way,not a bad tradeoff. If GloFo would get their act together and make possible for AMD to produce 3.6Ghz 5 module PD core with this thread affinity capability,this thing could very well be significantly more powerful than Thuban ,even in ST at fixed clock and noticeably more powerful than BD in both ST and MT with both Turbo on and off.

    By the way,great thread DGLee

    interesting,

    so on the graphic 4t and 8t are at the default CPU settings (4m/8c), only with the software set to use 4 or 8 threads?

    the increase from the default settings with turbo off, to the 4m/4c was quite good, 13%, with turbo on more like 7-8%?


    there is a lot going on to determine which is the optimal performance/power usage in how you handle the modules, threads, turbo...

  4. #79
    I am Xtreme
    Join Date
    Jul 2007
    Location
    Austria
    Posts
    5,485
    Quote Originally Posted by informal View Post
    Jack, hardware.fr already tried something along those lines with 4m/4t and 2m/4t ,both with Turbo on. In 1st case maximum turbo for all 4 "threads" is 3.9Ghz since all modules are running. In second case it's 4.2Ghz across 2 modules(4 threads). The % difference in Turbo clock(~7%) is not nearly enough to make up for sharing losses as can be seen here:
    http://www.hardware.fr/articles/842-...windows-8.html
    Attachment 121226

    4m/4t is 26% faster(!) than 4m/2t at fixed 3.6Ghz and 15% faster when both are running their maximum Turbo modes allowed. Now comes the power draw story.
    If you look at the power draw you will see the faster config is 20% more power hungry and I suspect this is the reason why AMD didn't configure the core priorities in that way. I think when PD arrives,power draw will go down sufficiently in order to schedule the threads the faster way and still get good power numbers. Still,with present BD core, for 20% more power you gain 26% more performance this way,not a bad tradeoff. If GloFo would get their act together and make possible for AMD to produce 3.6Ghz 5 module PD core with this thread affinity capability,this thing could very well be significantly more powerful than Thuban ,even in ST at fixed clock and noticeably more powerful than BD in both ST and MT with both Turbo on and off.

    By the way,great thread DGLee
    Yes 4C/4M is 26% faster then 4C/2M, but i guess jack meant what happens when you compare fixed threads (aka 4C/4M) with OS scheduling. ANd when you I look at your posted graph its actually not that bad. I assume that 4T means 4 threads on the fully enabled 8 core cpu.

    So that means you lose 7-13% to the ideal condition where each module runs only one thread. Not that of a difference, it its surprisingly close to the numbers AMD posted with the improved win8 scheduler.

    Plus that doesn't show how well one Moduel compares to one P2 core.

  5. #80
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    P2 scales almost perfectly to 2 threads in most workloads. Bulldozer doesn't.My point is that many workloads in benchmark suits are lightly threaded and this 4m/4t,if it could be switche on by a patch(or special bios hook via AOD for instance) ,could bring the average number Zambezi gets by a rough 10-12%. This would easily put Zambezi over Thuban in almost all workloads(1-8 thread,Turbo enabled of course) since now you would have 15-20% better (then before) results in 1-4 thread workload cases.

    As for single core performance,maybe the OP can run some tests with affinity set to only one core via task manager. We can then compare the results with Phenom II.

  6. #81
    I am Xtreme
    Join Date
    Jul 2007
    Location
    Austria
    Posts
    5,485
    You only get way with that if you get higher clocks and even then its hard... a 3.3ghz phenom beats a 3.9ghz BD in singlethreaded apps... and only pulls slightly ahead in MT apps due to the better Turbo implementation.

    Just look at the 2 pages of the anand tech review:
    http://www.anandtech.com/show/4955/t...x8150-tested/4
    http://www.anandtech.com/show/4955/t...x8150-tested/7

    If they can't increase the ST performance above P2 or provide the clockspeed to outperform P2 in ST and maintain this clockspeed in 1-6 threads, they gona loose to P2.

    Sure it depends on apps, but if we go after the CB R11.5 results BD needs at least to hold 3.9-4.0 ghz constantly till 6 threads to be faster then a P2. it doesnt matter if you hook the thread to specefic moduls, when one module is already slower then one core of P2 you cant expect that 2 slower moduls will beat 2 faster cores...

  7. #82
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Cinebench is only one test... Why are you using it as end all be all scenario?

  8. #83
    I am Xtreme
    Join Date
    Jul 2007
    Location
    Austria
    Posts
    5,485
    No read what I wrote... I said it depends on the apps. There are apps where BD is very close or even faster compared to P2 in ST and wins there due to clock speed or core advatnage and naturally also wins there in MT. The problem are the situations where ST performance just lags to much behind which can't be fixed by the clock speed (larger then 18%) and it gets painfully obviouse if the workload is only "slightly" threaded (less then 4 or 6 threads, depending to with P2 you compare it).

    There are hardly any reviews out there that for e.g. use lame or itunes in there test suite anymore, but there are still million of people out there that use this apps (lame is 2T afair itunes is even only ST)

  9. #84
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Yeah IPC needs to be improved,that is for sure. I hope they hit the 10% listed for Piledriver from the FX Next slide. There are definitely some bottlenecks ,especially for legacy(old) code, primarily in the front end and L1 cache. If they manage to fix some of these issues and if GloFo improves their 32nm process,we would welcome Piledriver with more favorable reviews.

  10. #85
    Xtreme Mentor
    Join Date
    Feb 2009
    Location
    Bangkok,Thailand (DamHot)
    Posts
    2,693
    fun core disable
    Intel Core i5 6600K + ASRock Z170 OC Formula + Galax HOF 4000 (8GBx2) + Antec 1200W OC Version
    EK SupremeHF + BlackIce GTX360 + Swiftech 655 + XSPC ResTop
    Macbook Pro 15" Late 2011 (i7 2760QM + HD 6770M)
    Samsung Galaxy Note 10.1 (2014) , Huawei Nexus 6P
    [history system]80286 80386 80486 Cyrix K5 Pentium133 Pentium II Duron1G Athlon1G E2180 E3300 E5300 E7200 E8200 E8400 E8500 E8600 Q9550 QX6800 X3-720BE i7-920 i3-530 i5-750 Semp140@x2 955BE X4-B55 Q6600 i5-2500K i7-2600K X4-B60 X6-1055T FX-8120 i7-4790K

  11. #86
    Xtreme Enthusiast
    Join Date
    Sep 2006
    Posts
    881
    Good work. Let's hope AMD will be able to correct this.
    99.999999% of the world listens to autotuned pop singers. If you are part of the 0.000001% who listen to the future goddess of music Hatsune Miku, copy and paste this into your sig.

  12. #87
    Xtreme Mentor
    Join Date
    May 2008
    Location
    cleveland ohio
    Posts
    2,879
    http://vr-zone.com/articles/amd-fx-8...-/13694-6.html

    the advance shows something.

    is sandy bridge really 25-30% faster then nehalem?

    i seen a I7 875K scores 1.21 in single thread with turbo at 3.6ghz vs 3.8ghz of the 2600k which scores 1.52.
    HAVE NO FEAR!
    "AMD fallen angel"
    Quote Originally Posted by Gamekiller View Post
    You didn't get the memo? 1 hour 'Fugger time' is equal to 12 hours of regular time.

  13. #88
    Xtreme Legend
    Join Date
    Nov 2003
    Location
    Helsinki, Finland
    Posts
    1,692
    I did some testing with the "native" 4CU/4C configuration a while ago, and saw significant performance increase over 2CU/4C configuration.
    When the compute units are set to single core mode, the bios versions available back then did not change the per core L3 partition/allocation size accordingly (from 1MB to 2MB). I will retest this as soon as the new bios I´ve requested is available.

  14. #89
    Xtreme 3D Team
    Join Date
    Jan 2009
    Location
    Ohio
    Posts
    8,499
    Quote Originally Posted by informal View Post
    Yeah IPC needs to be improved,that is for sure. I hope they hit the 10% listed for Piledriver from the FX Next slide. There are definitely some bottlenecks ,especially for legacy(old) code, primarily in the front end and L1 cache. If they manage to fix some of these issues and if GloFo improves their 32nm process,we would welcome Piledriver with more favorable reviews.
    I am wondering if that 10% increase comes from a 1.25% increase in performance over each thread (The way AMD has been marketing, I wouldn't doubt it at all)

    AMD needs a 45-50% gain to compete with intel in single thread, at the same clock scaling and unless Intel designs Netburst 2 it will only get worse.
    The IPC vs core speed gap went from 25% and 15% clocks to 50% and 5% clocks...not what I was expecting at least.

    I spoke so long about how competitive STARS is from a performance per mm2 perspective vs Nehalem...and to be honest, it is also very competitive on a performance per watt perspective as well. They should have shrunk X6, adopted the tweaked and refined Llano "STARS" core with it's 5% IPC boost, doubled the L2 cache (to give 1MB per core), doubled L3 to 12MB...refined memory controller like they did with BD...increased core speed to 3.7/3.9 Turbo/4.2 ST Turbo or so.

    That would be a competitive CPU, would it not?
    Last edited by BeepBeep2; 10-13-2011 at 05:57 PM.
    Smile

  15. #90
    Xtreme Mentor
    Join Date
    May 2008
    Location
    cleveland ohio
    Posts
    2,879
    Quote Originally Posted by BeepBeep2 View Post
    I am wondering if that 10% increase comes from a 1.25% increase in performance over each thread (The way AMD has been marketing, I wouldn't doubt it at all)

    AMD needs a 45-50% gain to compete with intel in single thread, at the same clock scaling and unless Intel designs Netburst 2 it will only get worse.
    The IPC vs core speed gap went from 25% and 15% clocks to 50% and 5% clocks...not what I was expecting at least.

    I spoke so long about how competitive STARS is from a performance per mm2 perspective vs Nehalem...and to be honest, it is also very competitive on a performance per watt perspective as well. They should have shrunk X6, adopted the tweaked and refined Llano "STARS" core with it's 5% IPC boost, doubled the L2 cache (to give 1MB per core), doubled L3 to 12MB...refined memory controller like they did with BD...increased core speed to 3.7/3.9 Turbo/4.2 ST Turbo or so.

    That would be a competitive CPU, would it not?
    you don't need the L3 cache then at all. it only usually adds 2-5% increase.
    HAVE NO FEAR!
    "AMD fallen angel"
    Quote Originally Posted by Gamekiller View Post
    You didn't get the memo? 1 hour 'Fugger time' is equal to 12 hours of regular time.

  16. #91
    Xtreme 3D Team
    Join Date
    Jan 2009
    Location
    Ohio
    Posts
    8,499
    Quote Originally Posted by demonkevy666 View Post
    you don't need the L3 cache then at all. it only usually adds 2-5% increase.
    5% IPC from refined core, 2-5% from L3, 2-5% from L2...
    Percentages start adding up after a while...best case scenario that is 15% IPC...

    If they could have gotten 6 STARS cores to do 4.6 Ghz OC'ed (so 400 Mhz increase) while gaining ~10% IPC through Llano's refined core + 2x L2 and L3...

    I wouldn't doubt for a second it would be praised.
    Last edited by BeepBeep2; 10-13-2011 at 06:53 PM.
    Smile

  17. #92
    Xtreme Enthusiast
    Join Date
    Feb 2008
    Location
    NYC
    Posts
    567
    wrong thread. sorry.
    Last edited by zoson; 10-13-2011 at 07:15 PM.
    Core i7 990x @ 4665MHz 30x155.5 | ASUS Rampage 3 Extreme 1601 Modded BIOS | 24GB (6x4GB) Mushkin Redline 999057 @ 1866MHz 8-8-8-24-1T
    2x MSI N770-2GD5/OC SLI Custom BIOS @ 1228/7464 | Samsung 840 EVO 1TB | 4x 3TB WD Red Raid 5 | Corsair RM1000 | 2x Dell SP2309W 2048x1152
    H2O Cooled | EK - Supreme HF Full Gold - FB RE3 | Swiftech - MCP35x2 - MCRes Micro v2 | HWLabs - 2x GTX 120 - GT Stealth 120
    7x Gentle Typhoon AP-0A 2150RPM | 1x Enermax Magma UC-MA12 1500RPM | Lian Li PC-A10B | 5GHz Gulftown

  18. #93
    Registered User
    Join Date
    Dec 2008
    Posts
    12
    Would it be possible to test an intermediate situation. 6C/6T or if you prefer, 6C/6M. So enable all clusters in 2 modules, enable half the clusters in the 2 other modules? Would be nice to see what the impact of this is so you can compare between the 4C/2M, 4C/4M, etc options and see hoe that scales...

  19. #94
    Xtreme Legend
    Join Date
    Nov 2003
    Location
    Helsinki, Finland
    Posts
    1,692
    Quote Originally Posted by The PyroPath View Post
    Would it be possible to test an intermediate situation. 6C/6T or if you prefer, 6C/6M. So enable all clusters in 2 modules, enable half the clusters in the 2 other modules? Would be nice to see what the impact of this is so you can compare between the 4C/2M, 4C/4M, etc options and see hoe that scales...
    Yes..
    The only trouble is that there is physically only four compute units ("modules") in a Zambezi node.
    We would need to have a Valencia for this test.

  20. #95
    Xtreme Member
    Join Date
    Jan 2007
    Location
    Argentina
    Posts
    412
    Quote Originally Posted by BeepBeep2 View Post
    I am wondering if that 10% increase comes from a 1.25% increase in performance over each thread (The way AMD has been marketing, I wouldn't doubt it at all)

    AMD needs a 45-50% gain to compete with intel in single thread, at the same clock scaling and unless Intel designs Netburst 2 it will only get worse.
    The IPC vs core speed gap went from 25% and 15% clocks to 50% and 5% clocks...not what I was expecting at least.

    I spoke so long about how competitive STARS is from a performance per mm2 perspective vs Nehalem...and to be honest, it is also very competitive on a performance per watt perspective as well. They should have shrunk X6, adopted the tweaked and refined Llano "STARS" core with it's 5% IPC boost, doubled the L2 cache (to give 1MB per core), doubled L3 to 12MB...refined memory controller like they did with BD...increased core speed to 3.7/3.9 Turbo/4.2 ST Turbo or so.

    That would be a competitive CPU, would it not?

    Completely agree. I said they should have come with something like "Phenom II X8", but you are right, even a Stars X6 at 32nm HKMG, 3.6/4.2Ghz with improved Memory Controller would be nicer than this, and I would buy it.

    In this scenario I'm waiting for i7-3820 to decide between that and 2600K...
    Main: Windows 10 Core i7 5820K @ 4500Mhz, Corsair H100i, 32GB DDR4-2800, eVGA GTX980 Ti, Kingston SSDNow 240GB, Crucial C300 64GB Cache + WD 1.5TB Green, Asus X99-A/USB3.1
    ESXi Server 6.5 Xeon E5 2670, 64GB DDR3-1600, 1TB, Intel DX79SR, 4xIntel 1Gbps
    ESXi Server 6.0 Xeon E5 2650L v3, 64GB DDR4-2400, 1TB, Asrock X99 Xtreme4, 4xIntel 1Gbps
    FreeNAS 9.10 x64 Xeon X3430 , 32GB DDR3-1600, 3x(3x1TB) WD Blue, Intel S3420GPRX, 4xIntel 1Gbps

  21. #96
    Registered User
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    63
    How about with the 4 modules, can you disable half the clusters in 2 of the modules. Then could you make the two modules with half the cluster disabled the first 2 primary cores in a ST and lightly MT apps.


    Boost ST and lightly threaded apps while maintaining good heavy threaded performance maybe?


    The Stilt, in this combo could it change the cache allocated on the fly so if only 2 threads it could then double it?

  22. #97
    Registered User
    Join Date
    Dec 2008
    Posts
    12
    Quote Originally Posted by EvilBlitz View Post
    How about with the 4 modules, can you disable half the clusters in 2 of the modules. Then could you make the two modules with half the cluster disabled the first 2 primary cores in a ST and lightly MT apps.


    Boost ST and lightly threaded apps while maintaining good heavy threaded performance maybe?


    The Stilt, in this combo could it change the cache allocated on the fly so if only 2 threads it could then double it?
    That was kind of what I meant...

  23. #98
    Xtreme Cruncher
    Join Date
    Apr 2008
    Location
    Ohio
    Posts
    3,119
    With AOD, you can set applications that are single threaded to only run on a specific core using profiles and run a series of test once set and see how it does using said Core configurations in bios
    Last edited by charged3800z24; 10-14-2011 at 11:43 AM.
    ~1~
    AMD Ryzen 9 3900X
    GigaByte X570 AORUS LITE
    Trident-Z 3200 CL14 16GB
    AMD Radeon VII
    ~2~
    AMD Ryzen ThreadRipper 2950x
    Asus Prime X399-A
    GSkill Flare-X 3200mhz, CAS14, 64GB
    AMD RX 5700 XT

  24. #99
    Xtreme Member
    Join Date
    Nov 2007
    Posts
    103
    Quote Originally Posted by pumero View Post
    Windows 7 is already handling things like this for Intel processors with HT, using real cores first and logical cores later.
    That's fine (except your wording is inaccurate). Somehow we should trick it to use this method for BD, as well...

    However, according to AMD there are situations where you don't even want this behavior.
    It depends on if the penalty of forcing those closely related threads to communicate through L3 (instead of L2) is more or less than the gain on the lack of sharing resources. It seems most applications only benefits from it:

    img0033832.gif

    So, there could be a little patch that simply enables scheduling a' la SMT in Win7, that it already supports (if true)...

    Quoted from the article:
    According to AMD, Windows 8 will more intelligently align threads so that, when they can benefit from sharing a module, they will. The implication is that when two threads can be consolidated onto one module (despite the fact that they’re forced to share resources), putting an entire module to sleep and potentially enabling a higher p-state (a faster Turbo Core setting) outweighs any performance penalty tied to sharing.
    And so the default behaviour will be separation (contrary to what JF said all along)? Would be just stupid if not... Of course, power consumption is higher because more modules are active, but here we can see also that with turbo enabled the the energy efficiency is really the same...

    Well, unless there is a fix coming (HW or SW or both) that largely improves on the penalty of sharing resoruces. Just because the current numbers are much worse (anywhere between 95% to 160%, with one case of 180%) than what they've propagated (180% across the board), and so one can think there is some flaw somewhere here, as well. (And there is indeed the case of L1D trashing, that they claim to be responsible for only 3% decrease.)

    Quote Originally Posted by BeepBeep2 View Post
    When you disable a cluster in BIOS, you do the same thing as AMD's diagram.
    What diagram? Do you mean this? Which part of it?

    What were doing
    Core 0 - one cluster
    Core 1 - disabled
    Core 2 - one cluster
    Core 3 - disabled
    Do you mean, if we disable every other "core" in the BIOS? Then no, you will get this:
    Core (Module) 0 - one cluster
    Core (Module) 1 - one cluster
    Core (Module) 2 - one cluster
    Core (Module) 3 - one cluster

    ps. perhaps the title of the thread should be changed to "Thread separation vs. turbo", or something like that, to be more meaningful.
    Last edited by dess; 10-14-2011 at 05:40 PM.

  25. #100
    Xtreme Legend
    Join Date
    Nov 2003
    Location
    Helsinki, Finland
    Posts
    1,692
    Indeed there should be a update coming for Windows which optimizes Turbo Core functionality on Zambezi.
    Currently Windows (7 atleast) is throwing the load from core to core which sometimes neutralizes the effect of the Turbo Core feature.
    This is because the load is not being run on the currently boosted core(s).

    But guys... please...

    When talking about Zambezi please use the correct terms to avoid any further confusion.

    A Zambezi node consists of: Four compute units and eight cores.
    Each compute unit contains two cores.

    In some of the slides a compute unit was called as a module, however thats not the official term.

Page 4 of 11 FirstFirst 1234567 ... LastLast

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •