AMD Bulldozer Thread

Printable View

Show 100 post(s) from this thread on one page

11-18-2011, 12:21 PM
fellix_bg

Actually, Intel does a separate SRAM cell design for their L3 caches that's much denser. AMD simply re-uses the SRAM cells from its L2 design for the L3.
11-18-2011, 08:07 PM
mAJORD

Guys 2B never made sense in the first place when you did the rough sums, 1.2B sounds closer but too little IMO:

these figures may be slightly out, but close enough to get an idea how wrong 2B sounds.

4 Core deneb:
6M cache: 458M
2M L2: 152M
4 cores: 140M
cpu-NB misc: ~8M

Total : 758M

6 Core Thuban:
6M Cache: 458M
2MB l2: 228M
6 Cores: 210M
cpu-NB+misc: ~8M

Total 904M

4 Module Bulldozer:

Module transistor count based on AMD's pre release slide stating 268M Transistors for 1 module including 2MB cache

8MB L3 Cache: ~610M
8MB L2 Cache: ~610M
4 Modules: ~240M (at ~60M each)
cPUNB+Misc: ~8M

Total: ~1.46B
11-19-2011, 12:21 AM
tom1

Quote:

Originally Posted by Gambit_2K

How is that a review? It's an analysis of the architecture, they dont even have any form of performance test in the article...

don't worry about the second part is coming :)
11-19-2011, 01:03 AM
Leeghoofd

Never ever use performance slides from the manufacturer in a review... mostly that will backfire on you !!
11-22-2011, 03:52 AM
Olivon

AMD's Bulldozer server benchmarks are here, and they're a catastrophe - Ars Technica
11-22-2011, 04:20 AM
STaRGaZeR

I wouldn't call that a catastrophe, just horrible perfomance. AMD needs to abandon this architecture, and fast.
11-22-2011, 04:25 AM
informal

What a rubbish article... The guy is acknowledging that it's faster than 12C MC and Xeon BUT... He then says it's "not fast enough" since it has 33% more cores and scores a bit lower than that:"only" 27/32% faster in SPEC JBB2005/SAP. What happened to Ars Technica ? Don't bother with the 3rd page of the "article".
11-22-2011, 04:55 AM
Hornet331

Well everyone still clings to the 33% more cores 50% more performanec claim... that was taunted all over the internet for months like a gospel... and he has some point... How would have a h10 with 2 more cores on 32nm would have done? Presonally I think not much worse.
11-22-2011, 04:57 AM
-Boris-

Quote:

Originally Posted by informal

What a rubbish article... The guy is acknowledging that it's faster than 12C MC and Xeon BUT... He then says it's "not fast enough" since it has 33% more cores and scores a bit lower than that:"only" 27/32% faster in SPEC JBB2005/SAP. What happened to Ars Technica ? Don't bother with the 3rd page of the "article".

Well it has a much worse performance per dollar and performance per watt than both MC and Xeon, how is that not bad? It's only faster than Xeon when comparing to relatively cheap and slow Xeons. Per dollar performance is still worse.
11-22-2011, 08:08 AM
behrouz

this article (Ars Technica ) is not bad at all , but just said :

Quote:

AMD faces an uphill struggle just to compete with its own old chips—let alone with Intel.

did Anandtech ever say this ?

Quote:

So if performance/watt is your first priority, we think the current Xeons are your best option.

Quote:

If performance/dollar is your first priority, we think the Opteron 6276 is an attractive alternative.

from heise.de or English

in LINPACK GFlops : Opteron 6276 vs Xeon 5680 : 205 ~239 Gflops vs 144 Ggflops

With AMD-Compiler open64 vs Intel Composer2011 SP1 : an integer in comparison with 454 to 349 and 337 to 246 floating

also 502 MFLOPS / watt (6276) compared with 311 MFLOPS / Watt (5680)
11-23-2011, 10:55 AM
savantu

The comparison simply shows how FMA can double your FP throughput. FYI, Intel claims AVX enabled 8 core SB Xeons will get 2.1x improvement in Linpack over current high end Xeons. That would mean 300 GFLOPs, completely changing the situation.
11-24-2011, 12:46 AM
flyck

Quote:

Originally Posted by savantu

The comparison simply shows how FMA can double your FP throughput. FYI, Intel claims AVX enabled 8 core SB Xeons will get 2.1x improvement in Linpack over current high end Xeons. That would mean 300 GFLOPs, completely changing the situation.

And if you take the topmodel of AMD, intel will have a ~30GFLops advantage in linpack when both use optimized compilers. That indeed changes the situation from 100GFlops slower to 30GFlops faster.
11-24-2011, 04:09 AM
informal

Quote:

Originally Posted by flyck

And if you take the topmodel of AMD, intel will have a ~30GFLops advantage in linpack when both use optimized compilers. That indeed changes the situation from 100GFlops slower to 30GFlops faster.

Never mind the fact that 6282SE will not be the top model forever. Whenever intel launches the new 8C SB-E that scores 300Gflops in linpack,AMD will be refreshing their lineup by that time. We can expect 2.8Ghz stock model so it's roughly around 2.8/2.3=1.21 or 21% faster than what 6276 gets in linpack (or around 289Gflops). This is just a tad(~3%) behind projected intel's performance with AVX enabled on their highest(?) end model. Price difference will be huge between two chips though.
11-24-2011, 05:37 AM
savantu

Quote:

Originally Posted by informal

Never mind the fact that 6282SE will not be the top model forever. Whenever intel launches the new 8C SB-E that scores 300Gflops in linpack,AMD will be refreshing their lineup by that time.

You assume the process will improve significantly in 2-3 months. The 6282SE is a 140w chip, pumping the stock frequency another 200MHz could be an issue without a new stepping.

Quote:

We can expect 2.8Ghz stock model so it's roughly around 2.8/2.3=1.21 or 21% faster than what 6276 gets in linpack (or around 289Gflops). This is just a tad(~3%) behind projected intel's performance with AVX enabled on their highest(?) end model. Price difference will be huge between two chips though.

Intel was never top dog in Linpack. MC pushed a lot more GFLOPs at a significantly lower cost $/Gflops. Looking at the HPC wins, I'd say price is less of an factor than assumed, otherwise Xeon wouldn't dominate. It would be interesting to see how 16 ( assuming 2P nodes ) really fat SB cores will do compared with 32 skinnier BD cores in HPC codes ( except Linpack, which is best case for both ).
11-24-2011, 05:47 AM
informal

Quote:

Originally Posted by savantu

You assume the process will improve significantly in 2-3 months. The 6282SE is a 140w chip, pumping the stock frequency another 200MHz could be an issue without a new stepping.

Intel was never top dog in Linpack. MC pushed a lot more GFLOPs at a significantly lower cost $/Gflops. Looking at the HPC wins, I'd say price is less of an factor than assumed, otherwise Xeon wouldn't dominate. It would be interesting to see how 16 ( assuming 2P nodes ) really fat SB cores will do compared with 32 skinnier BD cores in HPC codes ( except Linpack, which is best case for both ).

Well the guy who knows about glofo stuff(rich_wargo @ SA forum) hints at improved process node in Q1. So maybe they will fix yield and clock/power issues that obviously plague both Llano and Bulldozer. They managed to launch 16C/8M 2.6Ghz chip within the max. TDP bracket on G34,on this crappy process. So I expect another speed bump in Q1. 100Mhz is too low for a speed bump so next step is 2.8Ghz. This chip would put AMD in good position in spec rate tests(both integer and fp throughput). It would be a good duel to watch in HPC workloads: 4P 8C SB-EP @ 3Ghz @ 150W vs 2.8Ghz 8M/16C Opteron @ 140W.
11-24-2011, 06:15 AM
savantu

Quote:

Originally Posted by informal

Well the guy who knows about glofo stuff(rich_wargo @ SA forum) hints at improved process node in Q1. So maybe they will fix yield and clock/power issues that obviously plague both Llano and Bulldozer. They managed to launch 16C/8M 2.6Ghz chip within the max. TDP bracket on G34,on this crappy process. So I expect another speed bump in Q1. 100Mhz is too low for a speed bump so next step is 2.8Ghz. This chip would put AMD in good position in spec rate tests(both integer and fp throughput). It would be a good duel to watch in HPC workloads: 4P 8C SB-EP @ 3Ghz @ 150W vs 2.8Ghz 8M/16C Opteron @ 140W.

C'mon, rich knows nada. And I doubt the process is solely to blame. BD is massive and it's high speed nature could mean it's just like Prescott reloaded : no matter how good the process is/was, it can't make BD/Prescott shine. Intel's 90nm was outstanding by any metric and Dothan fully showed that. However that couldn't save Prescott's bacon. I have the impression something similar is going on here : the process is reasonably ok, yields are poorer than planned due the intrisic things like gate first, BUT, BD and Llano aren't first class engineering jobs.

And with the relation getting really sour, GF probably doesn't give a damn about AMD's issues with 32nm and simply wait for the pay-only-good-die deal to end. GF is taking huge losses and part of the blame is the design which they have no influence upon.

And their other customers care more about 28nm bulk than 32nm SOI HKMG. Last yield figures put 28nm at 1-2 good dies per wafer. They must be dancing in the isles at GF.

Edit : just found something to reinforce my point that the process is acceptable :

Quote:

Meanwhile, Globalfoundries said it would not comment on its customer's foundry selection process or on their products unless they did so first. The spokesman also said problems with Llano had been specific to that product and that yields for AMD's 32/28nm Bulldozer products were on target and not affecting AMD's ability to meet customer commitments.

“We are still the only foundry producing HKMG products that can be purchased in stores now,” the Globalfoundries spokesman said, noting that the fab expected to ship “far more” HKMG volume in 2011 than all other foundries combined.

http://www.eetimes.com/electronics-n...benefits-TSMC-
11-24-2011, 07:46 AM
SEA

Quote:

Originally Posted by savantu

[.... and that yields for AMD's 32/28nm Bulldozer products were on target and not affecting AMD's ability to meet customer commitments]

It is about yield only, the % of good chips.
Consider this: different power drainage chips can be delivered with same yield.
11-24-2011, 10:30 AM
LightSpeed

Quote:

Originally Posted by savantu

I have the impression something similar is going on here : the process is reasonably ok, yields are poorer than planned due the intrisic things like gate first, BUT, BD and Llano aren't first class engineering jobs.

There are some serious issues with the process. Anand mentioned it along with a few other informed people. Yes, maybe the engineering has some issues as well since they have tried something very different so hopefully it will be fixed in later revisions. But the process definitely is not reasonably ok.
11-26-2011, 09:37 AM
Dresdenboy

Quote:

Originally Posted by mAJORD

Guys 2B never made sense in the first place when you did the rough sums, 1.2B sounds closer but too little IMO:

these figures may be slightly out, but close enough to get an idea how wrong 2B sounds.

4 Core deneb:
6M cache: 458M
2M L2: 152M
4 cores: 140M
cpu-NB misc: ~8M

Total : 758M

6 Core Thuban:
6M Cache: 458M
2MB l2: 228M
6 Cores: 210M
cpu-NB+misc: ~8M

Total 904M

4 Module Bulldozer:

Module transistor count based on AMD's pre release slide stating 268M Transistors for 1 module including 2MB cache

8MB L3 Cache: ~610M
8MB L2 Cache: ~610M
4 Modules: ~240M (at ~60M each)
cPUNB+Misc: ~8M

Total: ~1.46B

Each module with 2MB L2 has 213M transistors according to AMDs ISSCC papers.
11-26-2011, 09:40 AM
freeloader

Quote:

Originally Posted by savantu

C'mon, rich knows nada. And I doubt the process is solely to blame. BD is massive and it's high speed nature could mean it's just like Prescott reloaded : no matter how good the process is/was, it can't make BD/Prescott shine. Intel's 90nm was outstanding by any metric and Dothan fully showed that. However that couldn't save Prescott's bacon. I have the impression something similar is going on here : the process is reasonably ok, yields are poorer than planned due the intrisic things like gate first, BUT, BD and Llano aren't first class engineering jobs.

And with the relation getting really sour, GF probably doesn't give a damn about AMD's issues with 32nm and simply wait for the pay-only-good-die deal to end. GF is taking huge losses and part of the blame is the design which they have no influence upon.

And their other customers care more about 28nm bulk than 32nm SOI HKMG. Last yield figures put 28nm at 1-2 good dies per wafer. They must be dancing in the isles at GF.

Edit : just found something to reinforce my point that the process is acceptable :

http://www.eetimes.com/electronics-n...benefits-TSMC-

Jesus Christ, the world is coming to an end when I agree with Savantu. :)
11-26-2011, 11:49 PM
Dresdenboy

The Netburst based Prescott design used a lot of high speed dynamic logic, which is not only faster (as required for an aggressive 8 FO4 (IIRC) frequency goal) but uses much more power and more transistors. BD is a static CMOS design using faster logic styles for single speed paths.

A look into the BD/Llano ISSCC papers (incl. the L3 schmoo plot) should indicate, how they expected the designs to behave using the 32nm process.
11-27-2011, 02:45 AM
Dresdenboy

Quote:

Originally Posted by STaRGaZeR

I know you like to suppose a lot, but the official figures are ~2B transistors for the die and this is pure BS. That or AMD's PR department just hit another level of mediocrity. 1,2B on a process that is known to be more dense than the competition? ;);)

There are many areas on the die which seem to be empty and might just contain wires and repeaters.

And as already said there are different types of transisors w/ different specs and size.

IIRC Llano contains 1B T.

AMD also works with macro blocks containing specific logic circuits. These might cause a little bit less efficient placement while being size optimized in itself.

Sent from my GT-I9000 using Tapatalk
12-02-2011, 02:17 AM
informal

Quote:

Originally Posted by Dresdenboy

There are many areas on the die which seem to be empty and might just contain wires and repeaters.

And as already said there are different types of transisors w/ different specs and size.

IIRC Llano contains 1B T.

AMD also works with macro blocks containing specific logic circuits. These might cause a little bit less efficient placement while being size optimized in itself.

Sent from my GT-I9000 using Tapatalk

It's official now. AMD contacted AT:
http://www.anandtech.com/show/5176/a...unt-12b-not-2b

Quote:

This is a bit unusual. I got an email from AMD PR this week asking me to correct the Bulldozer transistor count in our Sandy Bridge E review. The incorrect number, provided to me (and other reviewers) by AMD PR around 3 months ago was 2 billion transistors. The actual transistor count for Bulldozer is apparently 1.2 billion transistors...
12-02-2011, 02:38 AM
STaRGaZeR

Quote:

Originally Posted by informal

It's official now. AMD contacted AT:
http://www.anandtech.com/show/5176/a...unt-12b-not-2b

Wasn't official 2 weeks ago? ;)

Quote:

Originally Posted by informal

Official number has been corrected now,it's 1.2B and die size is 315mm^2.
12-02-2011, 02:55 AM
informal

Quote:

Originally Posted by STaRGaZeR

Wasn't official 2 weeks ago? ;)

Well it kinda was but the website that claimed they were contacted by AMD never really posted what AMD said. Apparently AMD contacted several websites and AT was the only one to post anything substantial.

The funny thing is we still didn't get the explanation about the 2B figure...
12-02-2011, 03:15 AM
Lightman

Quote:

Originally Posted by informal

Well it kinda was but the website that claimed they were contacted by AMD never really posted what AMD said. Apparently AMD contacted several websites and AT was the only one to post anything substantial.

The funny thing is we still didn't get the explanation about the 2B figure...

As a joke:

"We designed 2B CPU but had to disable 800M of them due to bugs/thermal issues. When we fix all problems, possibly in BDv2 or BDv3 chips size will stay almost the same but transistor count will return to 2B ;)"
12-02-2011, 09:00 AM
Mats

Quote:

Originally Posted by Lightman

As a joke:

"We designed 2B CPU but had to disable 800M of them due to bugs/thermal issues. When we fix all problems, possibly in BDv2 or BDv3 chips size will stay almost the same but transistor count will return to 2B ;)"

Exactly. Or another way to look at it: AMD will enable the extra powah if/when needed, sometime in the future.. :D
12-02-2011, 09:16 AM
informal

@ Lightman

That was funny indeed :D
12-02-2011, 06:01 PM
STEvil

Quote:

Originally Posted by Mats

Exactly. Or another way to look at it: AMD will enable the extra powah if/when needed, sometime in the future.. :D

Sounds kind of like Fermi
12-02-2011, 08:54 PM
Tenknics

Quote:

Originally Posted by Mats

Exactly. Or another way to look at it: AMD will enable the extra powah if/when needed, sometime in the future.. :D

if/when needed?

uh how bout now/2 years ago?
12-03-2011, 10:53 AM
n!Cola

Quote:

This is a bit unusual. I got an email from AMD PR this week asking me to correct the Bulldozer transistor count in our Sandy Bridge E review. The incorrect number, provided to me (and other reviewers) by AMD PR around 3 months ago was 2 billion transistors. The actual transistor count for Bulldozer is apparently 1.2 billion transistors. I don't have an explanation as to why the original number was wrong, just that the new number has been triple checked by my contact and is indeed right. The total die area for a 4-module/8-core Bulldozer remains correct at 315mm2.

http://www.anandtech.com/show/5176/a...unt-12b-not-2b
12-03-2011, 11:24 AM
DeathReborn

The only thing with the number of transistors that bugs me is why it took them so long to realise they'd given out bad information. I mean seriously, is their QA so bad that even their counting/marketing is faulty?
12-03-2011, 11:38 AM
stangracin3

Quote:

Originally Posted by DeathReborn

The only thing with the number of transistors that bugs me is why it took them so long to realise they'd given out bad information. I mean seriously, is their QA so bad that even their counting/marketing is faulty?

are you just now figuring out the there marketing sucks ass?
12-03-2011, 06:41 PM
Mats

Quote:

Originally Posted by Tenknics

if/when needed?

uh how bout now/2 years ago?

Irony.
12-04-2011, 04:07 AM
Dresdenboy

The people in a big company might do very good work but this might get lost/reduced in quality due to processes and low quality decisions further up in the hierarchy.
12-04-2011, 06:16 AM
Augustus

I saw a review where the reviewer had an FX (retail) that had 24M of cache rather than 16M. Could another 8M of L3 be those missing transistors? I wouldn't know so just asking the question.

Could it also be that the server variants have the full compliment of cache as the speeds are slower so fit the TDP but the retail parts cannot fit inside the TDP with all that cache so some of it has been disabled?

The transistors might still be there but just not used in retail parts.

The marketing dept have then failed to make the rationalisation between server transistor count (used) and retail transistor count (used) which then compounded the 'so many transistors for so little performance' argument? If so, no wonder they got culled.

Just some musings from me here.
12-04-2011, 06:48 AM
informal

BSN* is reporting that allegedly AMD's CFO Thomas Seiffert has been let go yesterday... If true than it's one of those management decisions that make zero sense(like Killebrew,Moorhead,David Hoff, Rick Bergman,Dirk Meyer etc.). AMD's debt right now is around 1.8B while back in 2009 when he was appointed to his position it was ~7B.
12-04-2011, 08:32 AM
Dresdenboy

Quote:

Originally Posted by Augustus

I saw a review where the reviewer had an FX (retail) that had 24M of cache rather than 16M. Could another 8M of L3 be those missing transistors? I wouldn't know so just asking the question.

Could it also be that the server variants have the full compliment of cache as the speeds are slower so fit the TDP but the retail parts cannot fit inside the TDP with all that cache so some of it has been disabled?

The transistors might still be there but just not used in retail parts.

The marketing dept have then failed to make the rationalisation between server transistor count (used) and retail transistor count (used) which then compounded the 'so many transistors for so little performance' argument? If so, no wonder they got culled.

Just some musings from me here.

Since Interlagos uses two of the same dies as the desktop variant and the die shot has been shown to the public, there would have been comments if there were more than the known 8MB of L3.

BTW, the transistor density of the modules (213M on 30.9 sqmm) is rather high (6.89M/sqmm) and the 1.2B number also seems to be a bit too low.
12-04-2011, 08:37 AM
god_43

Quote:

Originally Posted by informal

BSN* is reporting that allegedly AMD's CFO Thomas Seiffert has been let go yesterday... If true than it's one of those management decisions that make zero sense(like Killebrew,Moorhead,David Hoff, Rick Bergman,Dirk Meyer etc.). AMD's debt right now is around 1.8B while back in 2009 when he was appointed to his position it was ~7B.

link; or it didn't happen.
12-04-2011, 08:52 AM
JumpingJack

Quote:

Originally Posted by god_43

link; or it didn't happen.

It was up for a short while then taken down.
12-04-2011, 09:48 AM
Dresdenboy

Quote:

Originally Posted by JumpingJack

It was up for a short while then taken down.

The internet is full of people (besides google cache) creating quick copies:
http://investorvillage.com/smbd.asp?...g&mid=11217202
12-04-2011, 09:53 AM
JumpingJack

Quote:

Originally Posted by Dresdenboy

The internet is full of people (besides google cache) creating quick copies:
http://investorvillage.com/smbd.asp?...g&mid=11217202

+1. Yeah I just happened upon it and literally an hour or two it was gone, even trying to get it via my history was unsuccessful.

AMD is in massive flux right now. The direction and outcome is highly uncertain.
12-07-2011, 01:40 PM
savantu

Since AMD did not want to submit Spec_INT/FP scores, Intel did it instead, just to rub salt on the wounds :

SPECint_base2006/SPECfp_base2006 (autoparallel=yes)

i7-2700k (3.5/3.9 GHz) 45.5 / 56.1
FX-8150 (3.6/4.2 GHz) 20.8 / 25.7
X6-1100T (3.3/3.7 GHz) 25.0 / 32.2

http://www.spec.org/cpu2006/results/res2011q4/

In the most widely used and accepted industry standard benchmark, it clearly shows how BD has significantly lower IPC than K10.5 despite a 300-500MHz advantage.
12-07-2011, 02:03 PM
Loque

wowa.. I love bulldozers and this cpu is giving them a bad name, also making it harder for me to find awesome bulldozer media on the internet :p
12-07-2011, 11:04 PM
sergiojr

Quote:

Originally Posted by savantu

Since AMD did not want to submit Spec_INT/FP scores, Intel did it instead, just to rub salt on the wounds :

SPECint_base2006/SPECfp_base2006 (autoparallel=yes)

i7-2700k (3.5/3.9 GHz) 45.5 / 56.1
FX-8150 (3.6/4.2 GHz) 20.8 / 25.7
X6-1100T (3.3/3.7 GHz) 25.0 / 32.2

http://www.spec.org/cpu2006/results/res2011q4/

In the most widely used and accepted industry standard benchmark, it clearly shows how BD has significantly lower IPC than K10.5 despite a 300-500MHz advantage.

Autoparallel FAIL for ICC actually. It doesn't seem to work well for more than 4 cores (i7-3960X loses significantly to i7-2700K in lots of tests). They definitely should run it with OMP_NUM_THREADS=4 for FX8150 and set core affinity accordingly to "first cores in module". But it is Intel and I don't think they have intention to make AMD processor look good.
And comparing SSE3 code for AMD and AVX code for Intel is totally irrelevant.
12-08-2011, 12:24 AM
savantu

Quote:

Originally Posted by sergiojr

Autoparallel FAIL for ICC actually. It doesn't seem to work well for more than 4 cores (i7-3960X loses significantly to i7-2700K in lots of tests). They definitely should run it with OMP_NUM_THREADS=4 for FX8150 and set core affinity accordingly to "first cores in module". But it is Intel and I don't think they have intention to make AMD processor look good.

It is Spec_INT, more or less single threaded. Autoparallel cannot offer significant speedups..

Quote:

And comparing SSE3 code for AMD and AVX code for Intel is totally irrelevant.

Why ? AVX and SSE have the same throughput for BD since it did not spend any transistors to optimize for AVX. The only question is whether having used FMA would have made a difference.
12-08-2011, 01:12 AM
sergiojr

Quote:

Originally Posted by savantu

It is Spec_INT, more or less single threaded. Autoparallel cannot offer significant speedups..

It offers both speedups (e.g. libquantum) and slowdowns(e.g. h264ref) depending on the subtest and hardware. To limit slowdowns Intel set number of threads to number of cores. For example slowdown of 3960X compared to 2700K in h264ref perfectly correlates with the fact that FX8150 loses the most in this subtest compared to 1100T.

Quote:

Originally Posted by savantu

Why ? AVX and SSE have the same throughput for BD since it did not spend any transistors to optimize for AVX. The only question is whether having used FMA would have made a difference.

What? You know there is a reason for 3-operand instructions. If you don't understand it from developers' viewpoint, at least you can relay on tests comparing SSE and AVX versions of x264 or other software.
And besides this ICC doesn't allow SSE instructions above SSE3 on AMD hardware so it is AVX (which is full superset of SSE instructions) for Intel vs SSE w/o SSSE3, SSE4.1 and SSE4.2 for AMD.
12-08-2011, 02:01 AM
savantu

Quote:

Originally Posted by sergiojr

It offers both speedups (e.g. libquantum) and slowdowns(e.g. h264ref) depending on the subtest and hardware. To limit slowdowns Intel set number of threads to number of cores. For example slowdown of 3960X compared to 2700K in h264ref perfectly correlates with the fact that FX8150 loses the most in this subtest compared to 1100T.

That's mostly with the compilers cracking libquantum ( known for a long time ). And what's the problem with setting the number of threads to the number of cores? The most fair aproach.

Quote:

What? You know there is a reason for 3-operand instructions. If you don't understand it from developers' viewpoint, at least you can relay on tests comparing SSE and AVX versions of x264 or other software.
And besides this ICC doesn't allow SSE instructions above SSE3 on AMD hardware so it is AVX (which is full superset of SSE instructions) for Intel vs SSE w/o SSSE3, SSE4.1 and SSE4.2 for AMD.

The difference between 128bit SSE and 128bit AVX is extremely limited even on SB. Constantly refering to speed ups of AVX ( 256bit ) vs. SSE on SB and extrapolating that to BD is faulty; SB executes 256bit instructions in a single cycle, BD breaks them into 2x128bit ones. I'll restate my point : BD AVX speedups are limited because it wasn't designed to perform due to time pressure, just to get it compatible. The difference between 128bit SSE and 128/256bit AVX on BD is going to be in the noise region ( actually 256bit AVX is discouraged since it incurs penalties in the breaking up and recombining phase ).
12-08-2011, 04:41 AM
sergiojr

Quote:

Originally Posted by savantu

And what's the problem with setting the number of threads to the number of cores? The most fair aproach.

There is a risk of slowing application down if autoparallelization is done on hardware that has a lot of shared resources between cores due to the overhead that additional threads add. And for the record I do not blame Intel for doing this test that way as they are not supposed to know how to get the best performance from AMD's processor. I just say that better result on 8150 could be achieved when autoparallelization will be limited to 4 threads and thread affinity distributed between modules.

Quote:

Originally Posted by savantu

I'll restate my point : BD AVX speedups are limited because it wasn't designed to perform due to time pressure, just to get it compatible.

You are wrong. 3-operand SSE5 instructions that later appeared as 3-operand AVX versions of SSE instructions reduce number of registers needed in code, reduce latencies due to removal of unnecessary MOVs and reduce size of code. All of this allows to achieve higher utilization of FPUs functional units and increase performance w/o need to increase theoretical throughput of functional units. x264 shows pretty good example of such increased performance as vectInt throughput is the same in SB and yet there is quite substantial boost in performance on both SB and FX8150.
And as I said before not only ICC doesn't allow building AVX code for BD but also doesn't allow to use large part of SSEx instruction set.

If you want to compare single-threaded performance between FX8150, i7 2700 and PhIIX6 using ICC and SPECInt then you should turn off autoparallelization, build for SSE3 target and then compare resulting performance. Everything else is just marketing.
12-08-2011, 05:31 AM
Hans de Vries

The Open64 compiler produces up to 25% faster code as Intel's latest version 12 compilers even
though the intentionally crippled results submitted by Intel run on a 40% higher clocked Bulldozer.....

Open64 4.2.5.2 Compiler suite: (SPEC results submitted by Dell)
2.6 GHz Bulldozer: SPEC_int_rate 134, SPEC_FP_rate 100

Intel Studio XE 12.0.3.176 compilers: (SPEC results submitted by Intel)
3.6 GHz Bulldozer: SPEC_int_rate 115, SPEC_FP_rate 79.8

http://www.spec.org/cpu2006/results/res2011q4/

Hans
12-08-2011, 07:34 AM
savantu

Quote:

Originally Posted by Hans de Vries

The Open64 compiler produces up to 25% faster code as Intel's latest version 12 compilers even
though the intentionally crippled results submitted by Intel run on a 40% higher clocked Bulldozer.....

Open64 4.2.5.2 Compiler suite: (SPEC results submitted by Dell)
2.6 GHz Bulldozer: SPEC_int_rate 134, SPEC_FP_rate 100

Intel Studio XE 12.0.3.176 compilers: (SPEC results submitted by Intel)
3.6 GHz Bulldozer: SPEC_int_rate 115, SPEC_FP_rate 79.8

http://www.spec.org/cpu2006/results/res2011q4/

Hans

I'm sorry, but I couldn't find your DELL result. It looks to me that you looked for highest scores and divided by 4 to get the same number of cores. Example :

For FP, you got the 100 from here : http://www.spec.org/cpu2006/results/...107-18771.html
For INT, you got the 134 from here : http://www.spec.org/cpu2006/results/...107-18768.html

Also you quoted peak values, instead of base values :

DELL Opteron 6276 2.6GHz SpecInt_rate/FP_rate : 117 / 93
Intel FX 3.6GHz SpecInt_rate/FP_rate : 106 / 79

Are the systems really comparable ? A desktop system with 8GB RAM running Windows 7 vs. a 128GB server running Linux Red Hat 6.1 ?
Apart from the HW differences, it is kind of expected that AMD's inhouse compiler (?) produces better results than ICC which probably if oblivious to BD's existence. Without knowing BD's caveats and new instructions set ( XOP, FMA ) the scores are not unexpected.
12-09-2011, 01:16 AM
R101

I tested Bulldozer yesterday for some of my h264 compression real-world duties.. X6 is way better. I suspect task scheduling and process/core assignment has to do with how big the difference is, but I have no problem with Intel's HT whatsoever.

I think AMD would have been better off with adding two cores on X6 and improving it a bit. This 'work in progress' stuff is not good.. :(
12-09-2011, 01:33 AM
Dresdenboy

Quote:

Originally Posted by Hans de Vries

The Open64 compiler produces up to 25% faster code as Intel's latest version 12 compilers even
though the intentionally crippled results submitted by Intel run on a 40% higher clocked Bulldozer.....

Open64 4.2.5.2 Compiler suite: (SPEC results submitted by Dell)
2.6 GHz Bulldozer: SPEC_int_rate 134, SPEC_FP_rate 100

Intel Studio XE 12.0.3.176 compilers: (SPEC results submitted by Intel)
3.6 GHz Bulldozer: SPEC_int_rate 115, SPEC_FP_rate 79.8

http://www.spec.org/cpu2006/results/res2011q4/

Hans

Andreas Stiller (ct mag) wrote in his article, that the Intel 12.1 compilers create ~25% faster code (SPECfp_rate2006) compared to 12.0 while still using SSE3. AVX256 doesn't help much. AVX128 might show a better performance by using the 3 operand format (although FP moves are free). FMA4 is not being used as everyone would expect.

On i7-2600K the 12.1 compilers create ~9% faster code in SPECfp vs. 12.0:

Intel Compiler 12.1 results for i7-2600K

Intel Compiler 12.0 results for i7-2600K

Patching the GenuineIntel string and processorfamily in the SPECint executables resulted in a 45% boost in libquantum and ~20% in Xalancmbk according to him.
12-09-2011, 06:55 AM
flyck

Quote:

Originally Posted by Dresdenboy

Andreas Stiller (ct mag) wrote in his article, that the Intel 12.1 compilers create ~25% faster code (SPECfp_rate2006) compared to 12.0 while still using SSE3. AVX256 doesn't help much. AVX128 might show a better performance by using the 3 operand format (although FP moves are free). FMA4 is not being used as everyone would expect.

On i7-2600K the 12.1 compilers create ~9% faster code in SPECfp vs. 12.0:

Intel Compiler 12.1 results for i7-2600K

Intel Compiler 12.0 results for i7-2600K

Patching the GenuineIntel string and processorfamily in the SPECint executables resulted in a 45% boost in libquantum and ~20% in Xalancmbk according to him.

The dual quad core opterons 6204 i think (posted on semiaccurate) does blow the Zambezini score out of the water in the SPEC_rate scores though.
about 50% more in integer_rate and close to 100% in fp rate. (there is no non rate score of those).
This gives it a 50% advantage in SPEC FP_rate compared to the 2700K and equalling the SPEC INT_rate of the same 2700K.

So it clearly does have an impact. (or the submitted score of zambezini is crippled... or the opteron score is rigged..) :)
12-09-2011, 07:15 AM
Dresdenboy

Quote:

Originally Posted by flyck

The dual quad core opterons 6204 i think (posted on semiaccurate) does blow the Zambezini score out of the water in the SPEC_rate scores though.
about 50% more in integer_rate and close to 100% in fp rate. (there is no non rate score of those).
This gives it a 50% advantage in SPEC FP_rate compared to the 2700K and equalling the SPEC INT_rate of the same 2700K.

So it clearly does have an impact. (or the submitted score of zambezini is crippled... or the opteron score is rigged..) :)

The 6204 is a special model for specific data stuffing tasks. Clocks aside it has much more uncore stuff per 4 cores than Zambezi for 8 cores.
Zambezi:
4M/8C
4x2MB L2
1x8MB L3
2xDDR3

Opteron 6204:
2x1M/2C
2x2MB L2
2x8MB L3
8xDDR3
12-09-2011, 11:41 AM
demonkevy666

Quote:

Originally Posted by Dresdenboy

The 6204 is a special model for specific data stuffing tasks. Clocks aside it has much more uncore stuff per 4 cores than Zambezi for 8 cores.
Zambezi:
4M/8C
4x2MB L2
1x8MB L3
2xDDR3

Opteron 6204:
2x1M/2C
2x2MB L2
2x8MB L3
8xDDR3

I don't believe that's a 2 module chip it scores far better then a two module would.
12-09-2011, 11:42 AM
demonkevy666

Quote:

Originally Posted by Dresdenboy

The 6204 is a special model for specific data stuffing tasks. Clocks aside it has much more uncore stuff per 4 cores than Zambezi for 8 cores.
Zambezi:
4M/8C
4x2MB L2
1x8MB L3
2xDDR3

Opteron 6204:
2x1M/2C
2x2MB L2
2x8MB L3
8xDDR3

I don't believe that's a 2 module chip it scores far better then a two module would.
12-09-2011, 11:51 AM
Manicdan

the numbers seem off too

2x 1M/2C
but then also 2x8MB L3
that means its an MCM of 1/4 chips
12-09-2011, 03:17 PM
Dresdenboy

Quote:

Originally Posted by demonkevy666

I don't believe that's a 2 module chip it scores far better then a two module would.

I said, that it is different.

Edit/Addendum:

6200 series: G34 socket with 4 memory channels.
L3 is given as 16MB on this site:
http://www.amd.com/de/products/serve...l-numbers.aspx
So this means 2 dies.

It looks like it even doesn't have turbo mode. So it might be there to do some specific high frequency trading tasks like pattern matching of tons of data (tick data). Memory throughput is as important as latency then.
12-16-2011, 05:07 AM
tom1

other

http://www.xtremehardware.it/eng-rev...-201112156202/

:)

Show 100 post(s) from this thread on one page

All times are GMT -8. The time now is 11:56 PM.

XtremeSystems