Actually, Intel does a separate SRAM cell design for their L3 caches that's much denser. AMD simply re-uses the SRAM cells from its L2 design for the L3.
Printable View
Actually, Intel does a separate SRAM cell design for their L3 caches that's much denser. AMD simply re-uses the SRAM cells from its L2 design for the L3.
Guys 2B never made sense in the first place when you did the rough sums, 1.2B sounds closer but too little IMO:
these figures may be slightly out, but close enough to get an idea how wrong 2B sounds.
4 Core deneb:
6M cache: 458M
2M L2: 152M
4 cores: 140M
cpu-NB misc: ~8M
Total : 758M
6 Core Thuban:
6M Cache: 458M
2MB l2: 228M
6 Cores: 210M
cpu-NB+misc: ~8M
Total 904M
4 Module Bulldozer:
Module transistor count based on AMD's pre release slide stating 268M Transistors for 1 module including 2MB cache
8MB L3 Cache: ~610M
8MB L2 Cache: ~610M
4 Modules: ~240M (at ~60M each)
cPUNB+Misc: ~8M
Total: ~1.46B
Never ever use performance slides from the manufacturer in a review... mostly that will backfire on you !!
I wouldn't call that a catastrophe, just horrible perfomance. AMD needs to abandon this architecture, and fast.
What a rubbish article... The guy is acknowledging that it's faster than 12C MC and Xeon BUT... He then says it's "not fast enough" since it has 33% more cores and scores a bit lower than that:"only" 27/32% faster in SPEC JBB2005/SAP. What happened to Ars Technica ? Don't bother with the 3rd page of the "article".
Well everyone still clings to the 33% more cores 50% more performanec claim... that was taunted all over the internet for months like a gospel... and he has some point... How would have a h10 with 2 more cores on 32nm would have done? Presonally I think not much worse.
this article (Ars Technica ) is not bad at all , but just said :
did Anandtech ever say this ?Quote:
AMD faces an uphill struggle just to compete with its own old chips—let alone with Intel.
Quote:
So if performance/watt is your first priority, we think the current Xeons are your best option.
from heise.de or EnglishQuote:
If performance/dollar is your first priority, we think the Opteron 6276 is an attractive alternative.
in LINPACK GFlops : Opteron 6276 vs Xeon 5680 : 205 ~239 Gflops vs 144 Ggflops
With AMD-Compiler open64 vs Intel Composer2011 SP1 : an integer in comparison with 454 to 349 and 337 to 246 floating
also 502 MFLOPS / watt (6276) compared with 311 MFLOPS / Watt (5680)
The comparison simply shows how FMA can double your FP throughput. FYI, Intel claims AVX enabled 8 core SB Xeons will get 2.1x improvement in Linpack over current high end Xeons. That would mean 300 GFLOPs, completely changing the situation.
Never mind the fact that 6282SE will not be the top model forever. Whenever intel launches the new 8C SB-E that scores 300Gflops in linpack,AMD will be refreshing their lineup by that time. We can expect 2.8Ghz stock model so it's roughly around 2.8/2.3=1.21 or 21% faster than what 6276 gets in linpack (or around 289Gflops). This is just a tad(~3%) behind projected intel's performance with AVX enabled on their highest(?) end model. Price difference will be huge between two chips though.
You assume the process will improve significantly in 2-3 months. The 6282SE is a 140w chip, pumping the stock frequency another 200MHz could be an issue without a new stepping.
Intel was never top dog in Linpack. MC pushed a lot more GFLOPs at a significantly lower cost $/Gflops. Looking at the HPC wins, I'd say price is less of an factor than assumed, otherwise Xeon wouldn't dominate. It would be interesting to see how 16 ( assuming 2P nodes ) really fat SB cores will do compared with 32 skinnier BD cores in HPC codes ( except Linpack, which is best case for both ).Quote:
We can expect 2.8Ghz stock model so it's roughly around 2.8/2.3=1.21 or 21% faster than what 6276 gets in linpack (or around 289Gflops). This is just a tad(~3%) behind projected intel's performance with AVX enabled on their highest(?) end model. Price difference will be huge between two chips though.
Well the guy who knows about glofo stuff(rich_wargo @ SA forum) hints at improved process node in Q1. So maybe they will fix yield and clock/power issues that obviously plague both Llano and Bulldozer. They managed to launch 16C/8M 2.6Ghz chip within the max. TDP bracket on G34,on this crappy process. So I expect another speed bump in Q1. 100Mhz is too low for a speed bump so next step is 2.8Ghz. This chip would put AMD in good position in spec rate tests(both integer and fp throughput). It would be a good duel to watch in HPC workloads: 4P 8C SB-EP @ 3Ghz @ 150W vs 2.8Ghz 8M/16C Opteron @ 140W.
C'mon, rich knows nada. And I doubt the process is solely to blame. BD is massive and it's high speed nature could mean it's just like Prescott reloaded : no matter how good the process is/was, it can't make BD/Prescott shine. Intel's 90nm was outstanding by any metric and Dothan fully showed that. However that couldn't save Prescott's bacon. I have the impression something similar is going on here : the process is reasonably ok, yields are poorer than planned due the intrisic things like gate first, BUT, BD and Llano aren't first class engineering jobs.
And with the relation getting really sour, GF probably doesn't give a damn about AMD's issues with 32nm and simply wait for the pay-only-good-die deal to end. GF is taking huge losses and part of the blame is the design which they have no influence upon.
And their other customers care more about 28nm bulk than 32nm SOI HKMG. Last yield figures put 28nm at 1-2 good dies per wafer. They must be dancing in the isles at GF.
Edit : just found something to reinforce my point that the process is acceptable :
http://www.eetimes.com/electronics-n...benefits-TSMC-Quote:
Meanwhile, Globalfoundries said it would not comment on its customer's foundry selection process or on their products unless they did so first. The spokesman also said problems with Llano had been specific to that product and that yields for AMD's 32/28nm Bulldozer products were on target and not affecting AMD's ability to meet customer commitments.
“We are still the only foundry producing HKMG products that can be purchased in stores now,” the Globalfoundries spokesman said, noting that the fab expected to ship “far more” HKMG volume in 2011 than all other foundries combined.
There are some serious issues with the process. Anand mentioned it along with a few other informed people. Yes, maybe the engineering has some issues as well since they have tried something very different so hopefully it will be fixed in later revisions. But the process definitely is not reasonably ok.
The Netburst based Prescott design used a lot of high speed dynamic logic, which is not only faster (as required for an aggressive 8 FO4 (IIRC) frequency goal) but uses much more power and more transistors. BD is a static CMOS design using faster logic styles for single speed paths.
A look into the BD/Llano ISSCC papers (incl. the L3 schmoo plot) should indicate, how they expected the designs to behave using the 32nm process.
There are many areas on the die which seem to be empty and might just contain wires and repeaters.
And as already said there are different types of transisors w/ different specs and size.
IIRC Llano contains 1B T.
AMD also works with macro blocks containing specific logic circuits. These might cause a little bit less efficient placement while being size optimized in itself.
Sent from my GT-I9000 using Tapatalk
It's official now. AMD contacted AT:
http://www.anandtech.com/show/5176/a...unt-12b-not-2b
Quote:
This is a bit unusual. I got an email from AMD PR this week asking me to correct the Bulldozer transistor count in our Sandy Bridge E review. The incorrect number, provided to me (and other reviewers) by AMD PR around 3 months ago was 2 billion transistors. The actual transistor count for Bulldozer is apparently 1.2 billion transistors...
Well it kinda was but the website that claimed they were contacted by AMD never really posted what AMD said. Apparently AMD contacted several websites and AT was the only one to post anything substantial.
The funny thing is we still didn't get the explanation about the 2B figure...
@ Lightman
That was funny indeed :D
http://www.anandtech.com/show/5176/a...unt-12b-not-2bQuote:
This is a bit unusual. I got an email from AMD PR this week asking me to correct the Bulldozer transistor count in our Sandy Bridge E review. The incorrect number, provided to me (and other reviewers) by AMD PR around 3 months ago was 2 billion transistors. The actual transistor count for Bulldozer is apparently 1.2 billion transistors. I don't have an explanation as to why the original number was wrong, just that the new number has been triple checked by my contact and is indeed right. The total die area for a 4-module/8-core Bulldozer remains correct at 315mm2.
The only thing with the number of transistors that bugs me is why it took them so long to realise they'd given out bad information. I mean seriously, is their QA so bad that even their counting/marketing is faulty?
The people in a big company might do very good work but this might get lost/reduced in quality due to processes and low quality decisions further up in the hierarchy.
I saw a review where the reviewer had an FX (retail) that had 24M of cache rather than 16M. Could another 8M of L3 be those missing transistors? I wouldn't know so just asking the question.
Could it also be that the server variants have the full compliment of cache as the speeds are slower so fit the TDP but the retail parts cannot fit inside the TDP with all that cache so some of it has been disabled?
The transistors might still be there but just not used in retail parts.
The marketing dept have then failed to make the rationalisation between server transistor count (used) and retail transistor count (used) which then compounded the 'so many transistors for so little performance' argument? If so, no wonder they got culled.
Just some musings from me here.
BSN* is reporting that allegedly AMD's CFO Thomas Seiffert has been let go yesterday... If true than it's one of those management decisions that make zero sense(like Killebrew,Moorhead,David Hoff, Rick Bergman,Dirk Meyer etc.). AMD's debt right now is around 1.8B while back in 2009 when he was appointed to his position it was ~7B.
Since Interlagos uses two of the same dies as the desktop variant and the die shot has been shown to the public, there would have been comments if there were more than the known 8MB of L3.
BTW, the transistor density of the modules (213M on 30.9 sqmm) is rather high (6.89M/sqmm) and the 1.2B number also seems to be a bit too low.
The internet is full of people (besides google cache) creating quick copies:
http://investorvillage.com/smbd.asp?...g&mid=11217202
Since AMD did not want to submit Spec_INT/FP scores, Intel did it instead, just to rub salt on the wounds :
SPECint_base2006/SPECfp_base2006 (autoparallel=yes)
i7-2700k (3.5/3.9 GHz) 45.5 / 56.1
FX-8150 (3.6/4.2 GHz) 20.8 / 25.7
X6-1100T (3.3/3.7 GHz) 25.0 / 32.2
http://www.spec.org/cpu2006/results/res2011q4/
In the most widely used and accepted industry standard benchmark, it clearly shows how BD has significantly lower IPC than K10.5 despite a 300-500MHz advantage.
wowa.. I love bulldozers and this cpu is giving them a bad name, also making it harder for me to find awesome bulldozer media on the internet :p
Autoparallel FAIL for ICC actually. It doesn't seem to work well for more than 4 cores (i7-3960X loses significantly to i7-2700K in lots of tests). They definitely should run it with OMP_NUM_THREADS=4 for FX8150 and set core affinity accordingly to "first cores in module". But it is Intel and I don't think they have intention to make AMD processor look good.
And comparing SSE3 code for AMD and AVX code for Intel is totally irrelevant.
It is Spec_INT, more or less single threaded. Autoparallel cannot offer significant speedups..
Why ? AVX and SSE have the same throughput for BD since it did not spend any transistors to optimize for AVX. The only question is whether having used FMA would have made a difference.Quote:
And comparing SSE3 code for AMD and AVX code for Intel is totally irrelevant.
It offers both speedups (e.g. libquantum) and slowdowns(e.g. h264ref) depending on the subtest and hardware. To limit slowdowns Intel set number of threads to number of cores. For example slowdown of 3960X compared to 2700K in h264ref perfectly correlates with the fact that FX8150 loses the most in this subtest compared to 1100T.
What? You know there is a reason for 3-operand instructions. If you don't understand it from developers' viewpoint, at least you can relay on tests comparing SSE and AVX versions of x264 or other software.
And besides this ICC doesn't allow SSE instructions above SSE3 on AMD hardware so it is AVX (which is full superset of SSE instructions) for Intel vs SSE w/o SSSE3, SSE4.1 and SSE4.2 for AMD.
That's mostly with the compilers cracking libquantum ( known for a long time ). And what's the problem with setting the number of threads to the number of cores? The most fair aproach.
The difference between 128bit SSE and 128bit AVX is extremely limited even on SB. Constantly refering to speed ups of AVX ( 256bit ) vs. SSE on SB and extrapolating that to BD is faulty; SB executes 256bit instructions in a single cycle, BD breaks them into 2x128bit ones. I'll restate my point : BD AVX speedups are limited because it wasn't designed to perform due to time pressure, just to get it compatible. The difference between 128bit SSE and 128/256bit AVX on BD is going to be in the noise region ( actually 256bit AVX is discouraged since it incurs penalties in the breaking up and recombining phase ).Quote:
What? You know there is a reason for 3-operand instructions. If you don't understand it from developers' viewpoint, at least you can relay on tests comparing SSE and AVX versions of x264 or other software.
And besides this ICC doesn't allow SSE instructions above SSE3 on AMD hardware so it is AVX (which is full superset of SSE instructions) for Intel vs SSE w/o SSSE3, SSE4.1 and SSE4.2 for AMD.
There is a risk of slowing application down if autoparallelization is done on hardware that has a lot of shared resources between cores due to the overhead that additional threads add. And for the record I do not blame Intel for doing this test that way as they are not supposed to know how to get the best performance from AMD's processor. I just say that better result on 8150 could be achieved when autoparallelization will be limited to 4 threads and thread affinity distributed between modules.
You are wrong. 3-operand SSE5 instructions that later appeared as 3-operand AVX versions of SSE instructions reduce number of registers needed in code, reduce latencies due to removal of unnecessary MOVs and reduce size of code. All of this allows to achieve higher utilization of FPUs functional units and increase performance w/o need to increase theoretical throughput of functional units. x264 shows pretty good example of such increased performance as vectInt throughput is the same in SB and yet there is quite substantial boost in performance on both SB and FX8150.
And as I said before not only ICC doesn't allow building AVX code for BD but also doesn't allow to use large part of SSEx instruction set.
If you want to compare single-threaded performance between FX8150, i7 2700 and PhIIX6 using ICC and SPECInt then you should turn off autoparallelization, build for SSE3 target and then compare resulting performance. Everything else is just marketing.
The Open64 compiler produces up to 25% faster code as Intel's latest version 12 compilers even
though the intentionally crippled results submitted by Intel run on a 40% higher clocked Bulldozer.....
Open64 4.2.5.2 Compiler suite: (SPEC results submitted by Dell)
2.6 GHz Bulldozer: SPEC_int_rate 134, SPEC_FP_rate 100
Intel Studio XE 12.0.3.176 compilers: (SPEC results submitted by Intel)
3.6 GHz Bulldozer: SPEC_int_rate 115, SPEC_FP_rate 79.8
http://www.spec.org/cpu2006/results/res2011q4/
Hans
I'm sorry, but I couldn't find your DELL result. It looks to me that you looked for highest scores and divided by 4 to get the same number of cores. Example :
For FP, you got the 100 from here : http://www.spec.org/cpu2006/results/...107-18771.html
For INT, you got the 134 from here : http://www.spec.org/cpu2006/results/...107-18768.html
Also you quoted peak values, instead of base values :
DELL Opteron 6276 2.6GHz SpecInt_rate/FP_rate : 117 / 93
Intel FX 3.6GHz SpecInt_rate/FP_rate : 106 / 79
Are the systems really comparable ? A desktop system with 8GB RAM running Windows 7 vs. a 128GB server running Linux Red Hat 6.1 ?
Apart from the HW differences, it is kind of expected that AMD's inhouse compiler (?) produces better results than ICC which probably if oblivious to BD's existence. Without knowing BD's caveats and new instructions set ( XOP, FMA ) the scores are not unexpected.
I tested Bulldozer yesterday for some of my h264 compression real-world duties.. X6 is way better. I suspect task scheduling and process/core assignment has to do with how big the difference is, but I have no problem with Intel's HT whatsoever.
I think AMD would have been better off with adding two cores on X6 and improving it a bit. This 'work in progress' stuff is not good.. :(
Andreas Stiller (ct mag) wrote in his article, that the Intel 12.1 compilers create ~25% faster code (SPECfp_rate2006) compared to 12.0 while still using SSE3. AVX256 doesn't help much. AVX128 might show a better performance by using the 3 operand format (although FP moves are free). FMA4 is not being used as everyone would expect.
On i7-2600K the 12.1 compilers create ~9% faster code in SPECfp vs. 12.0:
Intel Compiler 12.1 results for i7-2600K
Intel Compiler 12.0 results for i7-2600K
Patching the GenuineIntel string and processorfamily in the SPECint executables resulted in a 45% boost in libquantum and ~20% in Xalancmbk according to him.
The dual quad core opterons 6204 i think (posted on semiaccurate) does blow the Zambezini score out of the water in the SPEC_rate scores though.
about 50% more in integer_rate and close to 100% in fp rate. (there is no non rate score of those).
This gives it a 50% advantage in SPEC FP_rate compared to the 2700K and equalling the SPEC INT_rate of the same 2700K.
So it clearly does have an impact. (or the submitted score of zambezini is crippled... or the opteron score is rigged..) :)
the numbers seem off too
2x 1M/2C
but then also 2x8MB L3
that means its an MCM of 1/4 chips
I said, that it is different.
Edit/Addendum:
6200 series: G34 socket with 4 memory channels.
L3 is given as 16MB on this site:
http://www.amd.com/de/products/serve...l-numbers.aspx
So this means 2 dies.
It looks like it even doesn't have turbo mode. So it might be there to do some specific high frequency trading tasks like pattern matching of tons of data (tick data). Memory throughput is as important as latency then.