As a joke:
"We designed 2B CPU but had to disable 800M of them due to bugs/thermal issues. When we fix all problems, possibly in BDv2 or BDv3 chips size will stay almost the same but transistor count will return to 2B ;)"
Printable View
@ Lightman
That was funny indeed :D
http://www.anandtech.com/show/5176/a...unt-12b-not-2bQuote:
This is a bit unusual. I got an email from AMD PR this week asking me to correct the Bulldozer transistor count in our Sandy Bridge E review. The incorrect number, provided to me (and other reviewers) by AMD PR around 3 months ago was 2 billion transistors. The actual transistor count for Bulldozer is apparently 1.2 billion transistors. I don't have an explanation as to why the original number was wrong, just that the new number has been triple checked by my contact and is indeed right. The total die area for a 4-module/8-core Bulldozer remains correct at 315mm2.
The only thing with the number of transistors that bugs me is why it took them so long to realise they'd given out bad information. I mean seriously, is their QA so bad that even their counting/marketing is faulty?
The people in a big company might do very good work but this might get lost/reduced in quality due to processes and low quality decisions further up in the hierarchy.
I saw a review where the reviewer had an FX (retail) that had 24M of cache rather than 16M. Could another 8M of L3 be those missing transistors? I wouldn't know so just asking the question.
Could it also be that the server variants have the full compliment of cache as the speeds are slower so fit the TDP but the retail parts cannot fit inside the TDP with all that cache so some of it has been disabled?
The transistors might still be there but just not used in retail parts.
The marketing dept have then failed to make the rationalisation between server transistor count (used) and retail transistor count (used) which then compounded the 'so many transistors for so little performance' argument? If so, no wonder they got culled.
Just some musings from me here.
BSN* is reporting that allegedly AMD's CFO Thomas Seiffert has been let go yesterday... If true than it's one of those management decisions that make zero sense(like Killebrew,Moorhead,David Hoff, Rick Bergman,Dirk Meyer etc.). AMD's debt right now is around 1.8B while back in 2009 when he was appointed to his position it was ~7B.
Since Interlagos uses two of the same dies as the desktop variant and the die shot has been shown to the public, there would have been comments if there were more than the known 8MB of L3.
BTW, the transistor density of the modules (213M on 30.9 sqmm) is rather high (6.89M/sqmm) and the 1.2B number also seems to be a bit too low.
The internet is full of people (besides google cache) creating quick copies:
http://investorvillage.com/smbd.asp?...g&mid=11217202
Since AMD did not want to submit Spec_INT/FP scores, Intel did it instead, just to rub salt on the wounds :
SPECint_base2006/SPECfp_base2006 (autoparallel=yes)
i7-2700k (3.5/3.9 GHz) 45.5 / 56.1
FX-8150 (3.6/4.2 GHz) 20.8 / 25.7
X6-1100T (3.3/3.7 GHz) 25.0 / 32.2
http://www.spec.org/cpu2006/results/res2011q4/
In the most widely used and accepted industry standard benchmark, it clearly shows how BD has significantly lower IPC than K10.5 despite a 300-500MHz advantage.
wowa.. I love bulldozers and this cpu is giving them a bad name, also making it harder for me to find awesome bulldozer media on the internet :p
Autoparallel FAIL for ICC actually. It doesn't seem to work well for more than 4 cores (i7-3960X loses significantly to i7-2700K in lots of tests). They definitely should run it with OMP_NUM_THREADS=4 for FX8150 and set core affinity accordingly to "first cores in module". But it is Intel and I don't think they have intention to make AMD processor look good.
And comparing SSE3 code for AMD and AVX code for Intel is totally irrelevant.
It is Spec_INT, more or less single threaded. Autoparallel cannot offer significant speedups..
Why ? AVX and SSE have the same throughput for BD since it did not spend any transistors to optimize for AVX. The only question is whether having used FMA would have made a difference.Quote:
And comparing SSE3 code for AMD and AVX code for Intel is totally irrelevant.
It offers both speedups (e.g. libquantum) and slowdowns(e.g. h264ref) depending on the subtest and hardware. To limit slowdowns Intel set number of threads to number of cores. For example slowdown of 3960X compared to 2700K in h264ref perfectly correlates with the fact that FX8150 loses the most in this subtest compared to 1100T.
What? You know there is a reason for 3-operand instructions. If you don't understand it from developers' viewpoint, at least you can relay on tests comparing SSE and AVX versions of x264 or other software.
And besides this ICC doesn't allow SSE instructions above SSE3 on AMD hardware so it is AVX (which is full superset of SSE instructions) for Intel vs SSE w/o SSSE3, SSE4.1 and SSE4.2 for AMD.
That's mostly with the compilers cracking libquantum ( known for a long time ). And what's the problem with setting the number of threads to the number of cores? The most fair aproach.
The difference between 128bit SSE and 128bit AVX is extremely limited even on SB. Constantly refering to speed ups of AVX ( 256bit ) vs. SSE on SB and extrapolating that to BD is faulty; SB executes 256bit instructions in a single cycle, BD breaks them into 2x128bit ones. I'll restate my point : BD AVX speedups are limited because it wasn't designed to perform due to time pressure, just to get it compatible. The difference between 128bit SSE and 128/256bit AVX on BD is going to be in the noise region ( actually 256bit AVX is discouraged since it incurs penalties in the breaking up and recombining phase ).Quote:
What? You know there is a reason for 3-operand instructions. If you don't understand it from developers' viewpoint, at least you can relay on tests comparing SSE and AVX versions of x264 or other software.
And besides this ICC doesn't allow SSE instructions above SSE3 on AMD hardware so it is AVX (which is full superset of SSE instructions) for Intel vs SSE w/o SSSE3, SSE4.1 and SSE4.2 for AMD.
There is a risk of slowing application down if autoparallelization is done on hardware that has a lot of shared resources between cores due to the overhead that additional threads add. And for the record I do not blame Intel for doing this test that way as they are not supposed to know how to get the best performance from AMD's processor. I just say that better result on 8150 could be achieved when autoparallelization will be limited to 4 threads and thread affinity distributed between modules.
You are wrong. 3-operand SSE5 instructions that later appeared as 3-operand AVX versions of SSE instructions reduce number of registers needed in code, reduce latencies due to removal of unnecessary MOVs and reduce size of code. All of this allows to achieve higher utilization of FPUs functional units and increase performance w/o need to increase theoretical throughput of functional units. x264 shows pretty good example of such increased performance as vectInt throughput is the same in SB and yet there is quite substantial boost in performance on both SB and FX8150.
And as I said before not only ICC doesn't allow building AVX code for BD but also doesn't allow to use large part of SSEx instruction set.
If you want to compare single-threaded performance between FX8150, i7 2700 and PhIIX6 using ICC and SPECInt then you should turn off autoparallelization, build for SSE3 target and then compare resulting performance. Everything else is just marketing.
The Open64 compiler produces up to 25% faster code as Intel's latest version 12 compilers even
though the intentionally crippled results submitted by Intel run on a 40% higher clocked Bulldozer.....
Open64 4.2.5.2 Compiler suite: (SPEC results submitted by Dell)
2.6 GHz Bulldozer: SPEC_int_rate 134, SPEC_FP_rate 100
Intel Studio XE 12.0.3.176 compilers: (SPEC results submitted by Intel)
3.6 GHz Bulldozer: SPEC_int_rate 115, SPEC_FP_rate 79.8
http://www.spec.org/cpu2006/results/res2011q4/
Hans