AMD Bulldozer Thread

Printable View

Show 100 post(s) from this thread on one page

12-02-2011, 03:15 AM
Lightman

Quote:

Originally Posted by informal

Well it kinda was but the website that claimed they were contacted by AMD never really posted what AMD said. Apparently AMD contacted several websites and AT was the only one to post anything substantial.

The funny thing is we still didn't get the explanation about the 2B figure...

As a joke:

"We designed 2B CPU but had to disable 800M of them due to bugs/thermal issues. When we fix all problems, possibly in BDv2 or BDv3 chips size will stay almost the same but transistor count will return to 2B ;)"
12-02-2011, 09:00 AM
Mats

Quote:

Originally Posted by Lightman

As a joke:

"We designed 2B CPU but had to disable 800M of them due to bugs/thermal issues. When we fix all problems, possibly in BDv2 or BDv3 chips size will stay almost the same but transistor count will return to 2B ;)"

Exactly. Or another way to look at it: AMD will enable the extra powah if/when needed, sometime in the future.. :D
12-02-2011, 09:16 AM
informal

@ Lightman

That was funny indeed :D
12-02-2011, 06:01 PM
STEvil

Quote:

Originally Posted by Mats

Exactly. Or another way to look at it: AMD will enable the extra powah if/when needed, sometime in the future.. :D

Sounds kind of like Fermi
12-02-2011, 08:54 PM
Tenknics

Quote:

Originally Posted by Mats

Exactly. Or another way to look at it: AMD will enable the extra powah if/when needed, sometime in the future.. :D

if/when needed?

uh how bout now/2 years ago?
12-03-2011, 10:53 AM
n!Cola

Quote:

This is a bit unusual. I got an email from AMD PR this week asking me to correct the Bulldozer transistor count in our Sandy Bridge E review. The incorrect number, provided to me (and other reviewers) by AMD PR around 3 months ago was 2 billion transistors. The actual transistor count for Bulldozer is apparently 1.2 billion transistors. I don't have an explanation as to why the original number was wrong, just that the new number has been triple checked by my contact and is indeed right. The total die area for a 4-module/8-core Bulldozer remains correct at 315mm2.

http://www.anandtech.com/show/5176/a...unt-12b-not-2b
12-03-2011, 11:24 AM
DeathReborn

The only thing with the number of transistors that bugs me is why it took them so long to realise they'd given out bad information. I mean seriously, is their QA so bad that even their counting/marketing is faulty?
12-03-2011, 11:38 AM
stangracin3

Quote:

Originally Posted by DeathReborn

The only thing with the number of transistors that bugs me is why it took them so long to realise they'd given out bad information. I mean seriously, is their QA so bad that even their counting/marketing is faulty?

are you just now figuring out the there marketing sucks ass?
12-03-2011, 06:41 PM
Mats

Quote:

Originally Posted by Tenknics

if/when needed?

uh how bout now/2 years ago?

Irony.
12-04-2011, 04:07 AM
Dresdenboy

The people in a big company might do very good work but this might get lost/reduced in quality due to processes and low quality decisions further up in the hierarchy.
12-04-2011, 06:16 AM
Augustus

I saw a review where the reviewer had an FX (retail) that had 24M of cache rather than 16M. Could another 8M of L3 be those missing transistors? I wouldn't know so just asking the question.

Could it also be that the server variants have the full compliment of cache as the speeds are slower so fit the TDP but the retail parts cannot fit inside the TDP with all that cache so some of it has been disabled?

The transistors might still be there but just not used in retail parts.

The marketing dept have then failed to make the rationalisation between server transistor count (used) and retail transistor count (used) which then compounded the 'so many transistors for so little performance' argument? If so, no wonder they got culled.

Just some musings from me here.
12-04-2011, 06:48 AM
informal

BSN* is reporting that allegedly AMD's CFO Thomas Seiffert has been let go yesterday... If true than it's one of those management decisions that make zero sense(like Killebrew,Moorhead,David Hoff, Rick Bergman,Dirk Meyer etc.). AMD's debt right now is around 1.8B while back in 2009 when he was appointed to his position it was ~7B.
12-04-2011, 08:32 AM
Dresdenboy

Quote:

Originally Posted by Augustus

I saw a review where the reviewer had an FX (retail) that had 24M of cache rather than 16M. Could another 8M of L3 be those missing transistors? I wouldn't know so just asking the question.

Could it also be that the server variants have the full compliment of cache as the speeds are slower so fit the TDP but the retail parts cannot fit inside the TDP with all that cache so some of it has been disabled?

The transistors might still be there but just not used in retail parts.

The marketing dept have then failed to make the rationalisation between server transistor count (used) and retail transistor count (used) which then compounded the 'so many transistors for so little performance' argument? If so, no wonder they got culled.

Just some musings from me here.

Since Interlagos uses two of the same dies as the desktop variant and the die shot has been shown to the public, there would have been comments if there were more than the known 8MB of L3.

BTW, the transistor density of the modules (213M on 30.9 sqmm) is rather high (6.89M/sqmm) and the 1.2B number also seems to be a bit too low.
12-04-2011, 08:37 AM
god_43

Quote:

Originally Posted by informal

BSN* is reporting that allegedly AMD's CFO Thomas Seiffert has been let go yesterday... If true than it's one of those management decisions that make zero sense(like Killebrew,Moorhead,David Hoff, Rick Bergman,Dirk Meyer etc.). AMD's debt right now is around 1.8B while back in 2009 when he was appointed to his position it was ~7B.

link; or it didn't happen.
12-04-2011, 08:52 AM
JumpingJack

Quote:

Originally Posted by god_43

link; or it didn't happen.

It was up for a short while then taken down.
12-04-2011, 09:48 AM
Dresdenboy

Quote:

Originally Posted by JumpingJack

It was up for a short while then taken down.

The internet is full of people (besides google cache) creating quick copies:
http://investorvillage.com/smbd.asp?...g&mid=11217202
12-04-2011, 09:53 AM
JumpingJack

Quote:

Originally Posted by Dresdenboy

The internet is full of people (besides google cache) creating quick copies:
http://investorvillage.com/smbd.asp?...g&mid=11217202

+1. Yeah I just happened upon it and literally an hour or two it was gone, even trying to get it via my history was unsuccessful.

AMD is in massive flux right now. The direction and outcome is highly uncertain.
12-07-2011, 01:40 PM
savantu

Since AMD did not want to submit Spec_INT/FP scores, Intel did it instead, just to rub salt on the wounds :

SPECint_base2006/SPECfp_base2006 (autoparallel=yes)

i7-2700k (3.5/3.9 GHz) 45.5 / 56.1
FX-8150 (3.6/4.2 GHz) 20.8 / 25.7
X6-1100T (3.3/3.7 GHz) 25.0 / 32.2

http://www.spec.org/cpu2006/results/res2011q4/

In the most widely used and accepted industry standard benchmark, it clearly shows how BD has significantly lower IPC than K10.5 despite a 300-500MHz advantage.
12-07-2011, 02:03 PM
Loque

wowa.. I love bulldozers and this cpu is giving them a bad name, also making it harder for me to find awesome bulldozer media on the internet :p
12-07-2011, 11:04 PM
sergiojr

Quote:

Originally Posted by savantu

Since AMD did not want to submit Spec_INT/FP scores, Intel did it instead, just to rub salt on the wounds :

SPECint_base2006/SPECfp_base2006 (autoparallel=yes)

i7-2700k (3.5/3.9 GHz) 45.5 / 56.1
FX-8150 (3.6/4.2 GHz) 20.8 / 25.7
X6-1100T (3.3/3.7 GHz) 25.0 / 32.2

http://www.spec.org/cpu2006/results/res2011q4/

In the most widely used and accepted industry standard benchmark, it clearly shows how BD has significantly lower IPC than K10.5 despite a 300-500MHz advantage.

Autoparallel FAIL for ICC actually. It doesn't seem to work well for more than 4 cores (i7-3960X loses significantly to i7-2700K in lots of tests). They definitely should run it with OMP_NUM_THREADS=4 for FX8150 and set core affinity accordingly to "first cores in module". But it is Intel and I don't think they have intention to make AMD processor look good.
And comparing SSE3 code for AMD and AVX code for Intel is totally irrelevant.
12-08-2011, 12:24 AM
savantu

Quote:

Originally Posted by sergiojr

Autoparallel FAIL for ICC actually. It doesn't seem to work well for more than 4 cores (i7-3960X loses significantly to i7-2700K in lots of tests). They definitely should run it with OMP_NUM_THREADS=4 for FX8150 and set core affinity accordingly to "first cores in module". But it is Intel and I don't think they have intention to make AMD processor look good.

It is Spec_INT, more or less single threaded. Autoparallel cannot offer significant speedups..

Quote:

And comparing SSE3 code for AMD and AVX code for Intel is totally irrelevant.

Why ? AVX and SSE have the same throughput for BD since it did not spend any transistors to optimize for AVX. The only question is whether having used FMA would have made a difference.
12-08-2011, 01:12 AM
sergiojr

Quote:

Originally Posted by savantu

It is Spec_INT, more or less single threaded. Autoparallel cannot offer significant speedups..

It offers both speedups (e.g. libquantum) and slowdowns(e.g. h264ref) depending on the subtest and hardware. To limit slowdowns Intel set number of threads to number of cores. For example slowdown of 3960X compared to 2700K in h264ref perfectly correlates with the fact that FX8150 loses the most in this subtest compared to 1100T.

Quote:

Originally Posted by savantu

Why ? AVX and SSE have the same throughput for BD since it did not spend any transistors to optimize for AVX. The only question is whether having used FMA would have made a difference.

What? You know there is a reason for 3-operand instructions. If you don't understand it from developers' viewpoint, at least you can relay on tests comparing SSE and AVX versions of x264 or other software.
And besides this ICC doesn't allow SSE instructions above SSE3 on AMD hardware so it is AVX (which is full superset of SSE instructions) for Intel vs SSE w/o SSSE3, SSE4.1 and SSE4.2 for AMD.
12-08-2011, 02:01 AM
savantu

Quote:

Originally Posted by sergiojr

It offers both speedups (e.g. libquantum) and slowdowns(e.g. h264ref) depending on the subtest and hardware. To limit slowdowns Intel set number of threads to number of cores. For example slowdown of 3960X compared to 2700K in h264ref perfectly correlates with the fact that FX8150 loses the most in this subtest compared to 1100T.

That's mostly with the compilers cracking libquantum ( known for a long time ). And what's the problem with setting the number of threads to the number of cores? The most fair aproach.

Quote:

What? You know there is a reason for 3-operand instructions. If you don't understand it from developers' viewpoint, at least you can relay on tests comparing SSE and AVX versions of x264 or other software.
And besides this ICC doesn't allow SSE instructions above SSE3 on AMD hardware so it is AVX (which is full superset of SSE instructions) for Intel vs SSE w/o SSSE3, SSE4.1 and SSE4.2 for AMD.

The difference between 128bit SSE and 128bit AVX is extremely limited even on SB. Constantly refering to speed ups of AVX ( 256bit ) vs. SSE on SB and extrapolating that to BD is faulty; SB executes 256bit instructions in a single cycle, BD breaks them into 2x128bit ones. I'll restate my point : BD AVX speedups are limited because it wasn't designed to perform due to time pressure, just to get it compatible. The difference between 128bit SSE and 128/256bit AVX on BD is going to be in the noise region ( actually 256bit AVX is discouraged since it incurs penalties in the breaking up and recombining phase ).
12-08-2011, 04:41 AM
sergiojr

Quote:

Originally Posted by savantu

And what's the problem with setting the number of threads to the number of cores? The most fair aproach.

There is a risk of slowing application down if autoparallelization is done on hardware that has a lot of shared resources between cores due to the overhead that additional threads add. And for the record I do not blame Intel for doing this test that way as they are not supposed to know how to get the best performance from AMD's processor. I just say that better result on 8150 could be achieved when autoparallelization will be limited to 4 threads and thread affinity distributed between modules.

Quote:

Originally Posted by savantu

I'll restate my point : BD AVX speedups are limited because it wasn't designed to perform due to time pressure, just to get it compatible.

You are wrong. 3-operand SSE5 instructions that later appeared as 3-operand AVX versions of SSE instructions reduce number of registers needed in code, reduce latencies due to removal of unnecessary MOVs and reduce size of code. All of this allows to achieve higher utilization of FPUs functional units and increase performance w/o need to increase theoretical throughput of functional units. x264 shows pretty good example of such increased performance as vectInt throughput is the same in SB and yet there is quite substantial boost in performance on both SB and FX8150.
And as I said before not only ICC doesn't allow building AVX code for BD but also doesn't allow to use large part of SSEx instruction set.

If you want to compare single-threaded performance between FX8150, i7 2700 and PhIIX6 using ICC and SPECInt then you should turn off autoparallelization, build for SSE3 target and then compare resulting performance. Everything else is just marketing.
12-08-2011, 05:31 AM
Hans de Vries

The Open64 compiler produces up to 25% faster code as Intel's latest version 12 compilers even
though the intentionally crippled results submitted by Intel run on a 40% higher clocked Bulldozer.....

Open64 4.2.5.2 Compiler suite: (SPEC results submitted by Dell)
2.6 GHz Bulldozer: SPEC_int_rate 134, SPEC_FP_rate 100

Intel Studio XE 12.0.3.176 compilers: (SPEC results submitted by Intel)
3.6 GHz Bulldozer: SPEC_int_rate 115, SPEC_FP_rate 79.8

http://www.spec.org/cpu2006/results/res2011q4/

Hans

Show 100 post(s) from this thread on one page