http://www.hpcwire.com/hpcwire/2011-...ig_supers.html
Printable View
Fiery who is the developer of the popular AIDA64 said that the whole BD microarchitecture is crappy and 8-core Zambezi can't even compete with 6-core Thuban in some cases.
He wouldn't be surprised if Zambezi will never be launched. Looks like that they have an Interlagos sample.
link
From OBRs test results which I've always believed were genuine, Bulldozer doesn't perform that bad, it should beat Thuban in most cases.
Oliverda he is comparing to a SB-E and not SB. Did you think some pseudo 8core(more like 4core+HT) can compete with +500 dollar chip 6 cores (12 threads). I wouldn't mind but it was unreal from the start.
here is something interesting:
http://forums.anandtech.com/showpost...25&postcount=2
You can see the difference so If AIDA author has another Interlagos as you said then it can perform as this one from september comparerd to July tested model so the reality may differ quite a bit.
By the way: I didn't read everything in original I was too lazy and its quite a hard language to read but It didn't look like he knows any final numbers just he is saying BD won't be a messiah destroying Intel's lineup with sheer power.
That's what I have thought and said at #2133. 8120 is scheduled at 2012 Q1, why it suddenly ahead of schedule and with lots of problem? They're accelerating the respinning and the spec changing more frequently than before, I'm afraid amd is playing renaming game to us, that final naming scheme and spec will be differ. Please everyone dont only focus on benchmark.
http://www.bug.hr/_cache/bcf715ee2a6...8f8ed6aa70.jpg
this vr-zone test is "sh1t"...No real performance, because in wprime is slower than Athlon x4 with more lower clocks and in R11.5 slower than x6 1055T
someone deleted my thread in VRFORUMS (english version). Saying that they need to clarify with AMD what's wrong with the result. Maybe some L3 prefetching and CnQ microcode issue in BIOS?
see the screen shot.
http://i54.tinypic.com/23keoap.jpg
The issue with previous stepping seems related to power and only power, not internal ľarch.
Anyway, I just put my hand on a final, retail CPU with retail box. The box art is similar than the ones leaked some months ago, except they changed the colors and replaced the LGA CPU with a PGA one on the front. About performances, nothing changed. So, if this CPU isn't with "shipping performance", the fix will come post launch. And that would sux a lot.
Jack,it's rather simple. If yo urun single thread on a module that is SIMD heavy,all FPU resources(2xFMAC) will be dedicated to one core. Then ,if you ran same test but MT ,across all 4 modules,all 8 cores will then share 4 FLexFPs (in total 4 256bit units). You will have scaling from ST to MT akin to sclaing from single core to QC,only this time your single trhead results SHOULD be very high as you are using one double-sized very powerful FPU(whole FLexFP within a module). Scaling is better than pure 4x since now SMT works within FlexFP as 2 cores per module share 2 128bit FMACs and this improves performance additionally.
The problem is,however,that one FLexFP is somehow slower than one Thuban core
@xsecret
Well vrzone just ran the same chip on 2 different motherboards xsecret.Guess what,on different boards,same chip performed with 80% delta... So it's the firmware problem it seems.
Well, all those "performance are not final ! not shipping performance ! don't trust leaks !" seem now bullshat and a way for AMD to prevent leaks. I have the same performances with the B2 chip since weeks and more important, the benchmarks published under NDA by AMD itself are really close to that.
That smells really really bad ... :shakes:
Ignore all those "tests" it's just pure bull:banana::banana::banana::banana:. Compared to Thuban the singelthread performance is higher and also the multithread. Of course it's not slower.
The FPU is doubled up from the K10 generation. Intel also doubled the FPU with Sandy Bridge, but with the same core count. Also, I should call Bulldozer for a 4 core design with 8 thread execution capability, so the 4 256-bit FPU's matches the cpu well.
AVX will only use 256-bit mode, so in all other situations, the flex-fp should shine, and of course with a lot of optimization.
Bulldozer will rock and wait for final release. Those test's is all fake or with a earlier ES cpu and is not representative.
Well I don't know what AMD is showing in their NDA docs but if this is the final performance then it's pretty bad indeed. They barely match their previous six core product.
All I know is that the chip vrzone run on 2 different boards performed very differently. This could be due to different power delivery to the CPU which in turn just limits it clock wise and forces it to throttle down. The "better" of the two results posted(the one in their forums and not the one on the homepage),shows 3.1Ghz 8C to be roughly 25% slower than Thuban in wprime. Still makes no sense if you ask me. Latest sisoft leak shows Interlagos having at least 30% better results per clock and per core than MC in SIMD heavy workloads. I don't know whether they optimize for FMA,but if they did just AVX then results don't change much when Bulldozer is in question(roughly 10% higher versus SSE2/3).
@JPQY
Link. The homepage results ,which were even worse with the same CPU(on different motherboard) are now gone.
FlexFP can do 2x128-bit FMUL or 2x128-bit FADD or 1x256-bit FADD or FMUL per cycle. Simultaneously it can execute up to 2x128-bit FP SIMD + 2 int/ALU SIMD instructions per cycle.
Throughput of 128-bit SSE or AVX FADD or FMUL instructions is 2x larger for module than SB core. Combined 128-bit FADD + FMUL has same throughput as SB core and combined 256-bit AVX FADD and FMUL has half of throughput as SB core.
With FMA4 and XOP with one 256-bit instruction two 128-bit flex FP can execute 256-bit FADD and 256-bit FMUL per cycle - same as SB core with AVX.
Per core with FMA4 one 128-bit FlexFP has same throughput as K10 core FPU. Throughput of FADD OR FMUL type instructions is also same, but in combination K10 core can execute up to two 128-bit SSEx instructions. In various combinations K10 FPU is slower than one 128-bit FlexFP, but in extreme conditions it is faster.
Same code could run faster with FMA4 than AVX because of issuing one instruction for two different arithmetic operations which is executed at single unit. Also rounding has better accuracy with FMA4, there is less register pressure because FMA4 is four operand instruction set. On BD FMA4 optimised code could be much faster than AVX.
It can if other unit is not busy because FlexFP is shared. It can also do additional two bitwise ALU FP operations with two other units (pipe 2 and pipe 3). But if units are full utilised, one core can use only one FMAC and one MMX unit.
FlexFP should shine in 256-bit mode with XOP and FMA instructions.Quote:
AVX will only use 256-bit mode, so in all other situations, the flex-fp should shine, and of course with a lot of optimization.
:DQuote:
Bulldozer will rock and wait for final release. Those test's is all fake or with a earlier ES cpu and is not representative.
AMD is trapped with its own bullshat. If you compare a FX-6100 (6-core @ 3.3 GHz according to AMD marketing terminology) with a 1100T (6-core @ 3.3 GHz), the 1100T will, or course, be faster. Why ? Because the FX-6100 is a 3-core CPU w/ CMT and ľarch tweaking. It just can't compete with a real 6-core CPU, even with an old arch. You can translate the case to Intel to understand how stupid it is. If you compare a Core i5 750 (4-cores, 4-threads, 2.66 GHz, 1st gen arch) with a Core i3 2120T (2-cores SMT, 4 threads, so "4-cores" by AMD terminology, 2.60 GHz, 2nd gen arch), the i5 750 will, of course, be much more faster in SMT environment, despite the presence of 4 threads in both case.
The only problem with the SMT analogy is that in intel's case you have the same number of execution units being shared with 2x the thread count. In AMD's case you have a dedicated hardware part in the silicon that takes care of each thread. So your "3C 6T CMT" chip in reality does have 6 integer cores and 6 FMAC units. None of those are shared between 2 threads(yes I said 6 FMAC units ,not 3 FlexFP units,there is a difference). So to sum it up : in AMD's case each thread has a dedicated hardware(int and fp) in the chip which is equal to a core(more or less,who cares if the scaling is 1.8x) while in intel's case each core is shared among the 2 (weak) threads.
Now that the difference is clearly defined,it really doesn't matter much for AMD if they have dedicated HW that runs the thread IF it does it at sub par performance to their previous design (Thuban). It will be regarded as subpar,to even K10,let alone SB or Nehalem.
informal
First of all integer cluster and FPU is not the only thing what makes a core what it is, without the rest you can't do anything.
And to your dedicated hardware parts
HT is using the integers because their ALU isn't utilized at max.
BD shrank the number of ALU but made a dedicated integer in a core because it doesn't use HT.
If you want to use AVX then you can't say it has dedicated FMAC per core but per module so your point about dedicated hardware is flawed.
If you want a real 8 core and not just 4core+CMT(cluster based multithreading), get an 8 module Interlagos and deactivate the second integer in every module, That would be a true 8 core and not this hybrid.
freeloader bassically you are right, If it has the performance then I wouldn't care even if they call it based on the number of ALUs
You just can't compare the hypothetic maximum bandwidth of a compute unit with a real complete core. The term "core" used by AMD is a non-sense. What's a core ? A core includes dispatch, prefetch, decoding and executions unit. A core is not a cluster of Integer ALU/AGU or an ALU alone. You have only 4 FP scheduler for a "8-core" Bulldozer, not 8. You have 4 I-Cache, not 8. 4 blocks of L2 cache, not 8. The hypothetic maximum throughput of a unit is a great information, but is not representative of actual, real performances. If you don't feed the engine with good decoder/dispatcher, the FP units will sux. They tried to copy the CMT architecture build by Alpha (Alpha never called that a "core") but they failed to implement that correctly due to .... well, i should shut up now :) This said, a big problem in BD ľarch will comes from the Integer unit.
Even in AVX you have a dedicated HW per thread. In 128bit AVX mode you have one FMAC doing one 128bit AVX instruction.In 256bit mode when both cores have scheduled 256bit FP instructions on the FP co-processor(so called FlexFP) you have 1 256bit instruction being done in 2 cycles for each core,that's all. So my logic is not flawed. Threads in Bulldozer have dedicated HW doing work. But ,like I said before, it won't matter much if performance is not there.It will be regarded as slow 8 core chip,something like 8 K7 level cores.
Is it X6 1100T faster than i7 2600K ? 2600K has only 4 real cores, but X6 with 6 real cores isn't faster than 2600K. Your comparison with SMT isn't correct.
On average FX6100 should be significantly faster than 1100T, but may be slower in Prime95 or SSE128 optimised Linpack. However, with proper FMA optimisation FX6100 can do 24 DP FLOPS per cycle, same as Thuban with SSE128, but more flexible.
I can tell you one thing.Even if we consider that pure integer speed maybe goes down a bit versus K10,those FP/SIMD numbers from FX8120 really do seem odd and out of place.That's why I think something is going on with the platform itself. According to Interlagos' Sisoft Multimedia numbers,we are left with around 32% better fp performance than MC,per core and per clock. This has not been seen in other Zambezi leaks.On the contrary,we have seen 30%+ lower performance per core than Deneb/Thuban. I think that those Interlagos numbers are with AVX in mind,but again,AVX brings very little to Bulldozer performance since you have fixed number of FP resources which is practically the same in legacy and AVX workloads(peak flops are the same). I also doubt that sisoft uses FMA for Multimedia benchmark when Interlagos is in question(just "plain" AVX).