AMD Zambezi news, info, fans !

Printable View

Show 100 post(s) from this thread on one page

09-09-2011, 10:30 PM
jimbo75

Quote:

Originally Posted by CrazyNutz

I did take that into consideration. However the upgrade path to 256bit FPU should be a very good upgrade since we know most organizations using a cray will most likely benefit from that for their scientific research.
Also we've only seen AMD handing over 1 box of CPU's to Cray, this may only be for a single cray customer, how many others did not chose this upgrade?

http://www.hpcwire.com/hpcwire/2011-...ig_supers.html
09-10-2011, 12:04 AM
Oliverda

Fiery who is the developer of the popular AIDA64 said that the whole BD microarchitecture is crappy and 8-core Zambezi can't even compete with 6-core Thuban in some cases.
He wouldn't be surprised if Zambezi will never be launched. Looks like that they have an Interlagos sample.

link
09-10-2011, 12:48 AM
AKM

From OBRs test results which I've always believed were genuine, Bulldozer doesn't perform that bad, it should beat Thuban in most cases.
09-10-2011, 12:53 AM
TESKATLIPOKA

Oliverda he is comparing to a SB-E and not SB. Did you think some pseudo 8core(more like 4core+HT) can compete with +500 dollar chip 6 cores (12 threads). I wouldn't mind but it was unreal from the start.

here is something interesting:
http://forums.anandtech.com/showpost...25&postcount=2

You can see the difference so If AIDA author has another Interlagos as you said then it can perform as this one from september comparerd to July tested model so the reality may differ quite a bit.

By the way: I didn't read everything in original I was too lazy and its quite a hard language to read but It didn't look like he knows any final numbers just he is saying BD won't be a messiah destroying Intel's lineup with sheer power.
09-10-2011, 02:11 AM
undone

Quote:

Originally Posted by haylui

bro take a look at this
http://chinese.vr-zone.com/index.php...8120-09092011/

fx8120 benchmark by VR-ZONE

That's what I have thought and said at #2133. 8120 is scheduled at 2012 Q1, why it suddenly ahead of schedule and with lots of problem？ They're accelerating the respinning and the spec changing more frequently than before, I'm afraid amd is playing renaming game to us, that final naming scheme and spec will be differ. Please everyone dont only focus on benchmark.

http://www.bug.hr/_cache/bcf715ee2a6...8f8ed6aa70.jpg
09-10-2011, 02:12 AM
FlanK3r

this vr-zone test is "sh1t"...No real performance, because in wprime is slower than Athlon x4 with more lower clocks and in R11.5 slower than x6 1055T
09-10-2011, 02:13 AM
haylui

Quote:

Originally Posted by informal

You mean this is shipping performance? B2 stepping,even if it's 6-7% slower or even 15% slower,it sucks badly since it is slower/or barely equal to 1100T. Rumored price from AMD themselves 300$. Rumored price from one dude having them listed on his own site : 260$. Todays 1100T price :190$ (will go down after Zambezi launches). If as you say price reflects performance then you will have 1100T performance (+-10/15%) with 30+% higher price. Is this logical?

In the link above (Vr-zone),just one glance at C10 64bit single core test tells you something is off. You have a single Bulldozer core using 256bit FPU for itself and running at 4Ghz.It gets 3769 pts with some of the features turned off in BIOS(best result they managed). Now ,take a look at single Thuban core, running at 3.7Ghz in same benchmark. It scores 4103pts. That is 17% faster than what Zambezi would get at 3.7Ghz and still faster (8%) than what Zambezi gets at 4Ghz. This is the brand new,double sized,improved,SMT capable FlexFP,with free reg-reg moves(no cost instruction according to AMD), and million other improvements versus K10? Yeah,call me crazy but I don't think so.

someone deleted my thread in VRFORUMS (english version). Saying that they need to clarify with AMD what's wrong with the result. Maybe some L3 prefetching and CnQ microcode issue in BIOS?
see the screen shot.
http://i54.tinypic.com/23keoap.jpg
09-10-2011, 02:37 AM
xsecret

The issue with previous stepping seems related to power and only power, not internal µarch.

Anyway, I just put my hand on a final, retail CPU with retail box. The box art is similar than the ones leaked some months ago, except they changed the colors and replaced the LGA CPU with a PGA one on the front. About performances, nothing changed. So, if this CPU isn't with "shipping performance", the fix will come post launch. And that would sux a lot.
09-10-2011, 02:42 AM
informal

Quote:

Originally Posted by JumpingJack

I don't quite follow, but I am likely not thinking about it right. First question, has CB been recompiled for AVX? Second question, if not, then CB will take advantage of 8 128 bit FPUs will it not?

Jack,it's rather simple. If yo urun single thread on a module that is SIMD heavy,all FPU resources(2xFMAC) will be dedicated to one core. Then ,if you ran same test but MT ,across all 4 modules,all 8 cores will then share 4 FLexFPs (in total 4 256bit units). You will have scaling from ST to MT akin to sclaing from single core to QC,only this time your single trhead results SHOULD be very high as you are using one double-sized very powerful FPU(whole FLexFP within a module). Scaling is better than pure 4x since now SMT works within FlexFP as 2 cores per module share 2 128bit FMACs and this improves performance additionally.
The problem is,however,that one FLexFP is somehow slower than one Thuban core

@xsecret
Well vrzone just ran the same chip on 2 different motherboards xsecret.Guess what,on different boards,same chip performed with 80% delta... So it's the firmware problem it seems.
09-10-2011, 03:10 AM
JPQY

Quote:

Originally Posted by informal

@xsecret
Well vrzone just ran the same chip on 2 different motherboards xsecret.Guess what,on different boards,same chip performed with 80% delta... So it's the firmware problem it seems.

Hi informal..have you a link to this test!

THanks,
JP.
09-10-2011, 03:14 AM
xsecret

Quote:

Originally Posted by informal

@xsecret
Well vrzone just ran the same chip on 2 different motherboards xsecret.Guess what,on different boards,same chip performed with 80% delta... So it's the firmware problem it seems.

Well, all those "performance are not final ! not shipping performance ! don't trust leaks !" seem now bullshat and a way for AMD to prevent leaks. I have the same performances with the B2 chip since weeks and more important, the benchmarks published under NDA by AMD itself are really close to that.
09-10-2011, 03:19 AM
Olivon

That smells really really bad ... :shakes:
09-10-2011, 03:27 AM
2good4you

Quote:

Originally Posted by informal

Jack,it's rather simple. If yo urun single thread on a module that is SIMD heavy,all FPU resources(2xFMAC) will be dedicated to one core. Then ,if you ran same test but MT ,across all 4 modules,all 8 cores will then share 4 FLexFPs (in total 4 256bit units). You will have scaling from ST to MT akin to sclaing from single core to QC,only this time your single trhead results SHOULD be very high as you are using one double-sized very powerful FPU(whole FLexFP within a module). Scaling is better than pure 4x since now SMT works within FlexFP as 2 cores per module share 2 128bit FMACs and this improves performance additionally.
The problem is,however,that one FLexFP is somehow slower than one Thuban core

@xsecret
Well vrzone just ran the same chip on 2 different motherboards xsecret.Guess what,on different boards,same chip performed with 80% delta... So it's the firmware problem it seems.

Ignore all those "tests" it's just pure bull:banana::banana::banana::banana:. Compared to Thuban the singelthread performance is higher and also the multithread. Of course it's not slower.

The FPU is doubled up from the K10 generation. Intel also doubled the FPU with Sandy Bridge, but with the same core count. Also, I should call Bulldozer for a 4 core design with 8 thread execution capability, so the 4 256-bit FPU's matches the cpu well.

AVX will only use 256-bit mode, so in all other situations, the flex-fp should shine, and of course with a lot of optimization.

Bulldozer will rock and wait for final release. Those test's is all fake or with a earlier ES cpu and is not representative.
09-10-2011, 03:48 AM
informal

Quote:

Originally Posted by xsecret

Well, all those "performance are not final ! not shipping performance ! don't trust leaks !" seem now bullshat and a way for AMD to prevent leaks. I have the same performances with the B2 chip since weeks and more important, the benchmarks published under NDA by AMD itself are really close to that.

Well I don't know what AMD is showing in their NDA docs but if this is the final performance then it's pretty bad indeed. They barely match their previous six core product.
All I know is that the chip vrzone run on 2 different boards performed very differently. This could be due to different power delivery to the CPU which in turn just limits it clock wise and forces it to throttle down. The "better" of the two results posted(the one in their forums and not the one on the homepage),shows 3.1Ghz 8C to be roughly 25% slower than Thuban in wprime. Still makes no sense if you ask me. Latest sisoft leak shows Interlagos having at least 30% better results per clock and per core than MC in SIMD heavy workloads. I don't know whether they optimize for FMA,but if they did just AVX then results don't change much when Bulldozer is in question(roughly 10% higher versus SSE2/3).

@JPQY
Link. The homepage results ,which were even worse with the same CPU(on different motherboard) are now gone.
09-10-2011, 03:50 AM
xsecret

Quote:

Originally Posted by 2good4you

so in all other situations, the flex-fp should shine, and of course with a lot of optimization.

One core is not able to do 2x128-bit FP operation...
09-10-2011, 03:54 AM
drfedja

Quote:

Originally Posted by 2good4you

The FPU is doubled up from the K10 generation. Intel also doubled the FPU with Sandy Bridge, but with the same core count. Also, I should call Bulldozer for a 4 core design with 8 thread execution capability, so the 4 256-bit FPU's matches the cpu well.

FlexFP can do 2x128-bit FMUL or 2x128-bit FADD or 1x256-bit FADD or FMUL per cycle. Simultaneously it can execute up to 2x128-bit FP SIMD + 2 int/ALU SIMD instructions per cycle.
Throughput of 128-bit SSE or AVX FADD or FMUL instructions is 2x larger for module than SB core. Combined 128-bit FADD + FMUL has same throughput as SB core and combined 256-bit AVX FADD and FMUL has half of throughput as SB core.
With FMA4 and XOP with one 256-bit instruction two 128-bit flex FP can execute 256-bit FADD and 256-bit FMUL per cycle - same as SB core with AVX.

Per core with FMA4 one 128-bit FlexFP has same throughput as K10 core FPU. Throughput of FADD OR FMUL type instructions is also same, but in combination K10 core can execute up to two 128-bit SSEx instructions. In various combinations K10 FPU is slower than one 128-bit FlexFP, but in extreme conditions it is faster.

Same code could run faster with FMA4 than AVX because of issuing one instruction for two different arithmetic operations which is executed at single unit. Also rounding has better accuracy with FMA4, there is less register pressure because FMA4 is four operand instruction set. On BD FMA4 optimised code could be much faster than AVX.

Quote:

Originally Posted by xsecret

One core is not able to do 2x128-bit FP operation...

It can if other unit is not busy because FlexFP is shared. It can also do additional two bitwise ALU FP operations with two other units (pipe 2 and pipe 3). But if units are full utilised, one core can use only one FMAC and one MMX unit.

Quote:

AVX will only use 256-bit mode, so in all other situations, the flex-fp should shine, and of course with a lot of optimization.

FlexFP should shine in 256-bit mode with XOP and FMA instructions.

Quote:

Bulldozer will rock and wait for final release. Those test's is all fake or with a earlier ES cpu and is not representative.

:D
09-10-2011, 04:26 AM
xsecret

AMD is trapped with its own bullshat. If you compare a FX-6100 (6-core @ 3.3 GHz according to AMD marketing terminology) with a 1100T (6-core @ 3.3 GHz), the 1100T will, or course, be faster. Why ? Because the FX-6100 is a 3-core CPU w/ CMT and µarch tweaking. It just can't compete with a real 6-core CPU, even with an old arch. You can translate the case to Intel to understand how stupid it is. If you compare a Core i5 750 (4-cores, 4-threads, 2.66 GHz, 1st gen arch) with a Core i3 2120T (2-cores SMT, 4 threads, so "4-cores" by AMD terminology, 2.60 GHz, 2nd gen arch), the i5 750 will, of course, be much more faster in SMT environment, despite the presence of 4 threads in both case.
09-10-2011, 04:42 AM
informal

The only problem with the SMT analogy is that in intel's case you have the same number of execution units being shared with 2x the thread count. In AMD's case you have a dedicated hardware part in the silicon that takes care of each thread. So your "3C 6T CMT" chip in reality does have 6 integer cores and 6 FMAC units. None of those are shared between 2 threads(yes I said 6 FMAC units ,not 3 FlexFP units,there is a difference). So to sum it up : in AMD's case each thread has a dedicated hardware(int and fp) in the chip which is equal to a core(more or less,who cares if the scaling is 1.8x) while in intel's case each core is shared among the 2 (weak) threads.
Now that the difference is clearly defined,it really doesn't matter much for AMD if they have dedicated HW that runs the thread IF it does it at sub par performance to their previous design (Thuban). It will be regarded as subpar,to even K10,let alone SB or Nehalem.
09-10-2011, 04:48 AM
freeloader

Quote:

Originally Posted by informal

The only problem with the SMT analogy is that in intel's case you have the same number of execution units being shared with 2x the thread count. In AMD's case you have a dedicated hardware part in the silicon that takes care of each thread. So your "3C 6T CMT" chip in reality does have 6 integer cores and 6 FMAC units. None of those are shared between 2 threads(yes I said 6 FMAC units ,not 3 FlexFP units,there is a difference). So to sum it up : in AMD's case each thread has a dedicated hardware(int and fp) in the chip which is equal to a core(more or less,who cares if the scaling is 1.8x) while in intel's case each core is shared among the 2 (weak) threads.
Now that the difference is clearly defined,it really doesn't matter much for AMD if they have dedicated HW that runs the thread IF it does it at sub par performance to their previous design (Thuban). It will be regarded as subpar,to even K10,let alone SB or Nehalem.

That's all that really matters, the performance. If BD is not even faster than Thuban, then AMD is going to lose many life long customers to Intel, myself included. At this point in time, it's not looking to rosy for BD.
09-10-2011, 05:08 AM
TESKATLIPOKA

informal
First of all integer cluster and FPU is not the only thing what makes a core what it is, without the rest you can't do anything.
And to your dedicated hardware parts
HT is using the integers because their ALU isn't utilized at max.
BD shrank the number of ALU but made a dedicated integer in a core because it doesn't use HT.
If you want to use AVX then you can't say it has dedicated FMAC per core but per module so your point about dedicated hardware is flawed.

If you want a real 8 core and not just 4core+CMT(cluster based multithreading), get an 8 module Interlagos and deactivate the second integer in every module, That would be a true 8 core and not this hybrid.
09-10-2011, 05:23 AM
TESKATLIPOKA

freeloader bassically you are right, If it has the performance then I wouldn't care even if they call it based on the number of ALUs
09-10-2011, 05:24 AM
xsecret

Quote:

Originally Posted by informal

The only problem with the SMT analogy is that in intel's case you have the same number of execution units being shared with 2x the thread count. In AMD's case you have a dedicated hardware part in the silicon that takes care of each thread. So your "3C 6T CMT" chip in reality does have 6 integer cores and 6 FMAC units. None of those are shared between 2 threads(yes I said 6 FMAC units ,not 3 FlexFP units,there is a difference). So to sum it up : in AMD's case each thread has a dedicated hardware(int and fp) in the chip which is equal to a core(more or less,who cares if the scaling is 1.8x) while in intel's case each core is shared among the 2 (weak) threads.
Now that the difference is clearly defined,it really doesn't matter much for AMD if they have dedicated HW that runs the thread IF it does it at sub par performance to their previous design (Thuban). It will be regarded as subpar,to even K10,let alone SB or Nehalem.

You just can't compare the hypothetic maximum bandwidth of a compute unit with a real complete core. The term "core" used by AMD is a non-sense. What's a core ? A core includes dispatch, prefetch, decoding and executions unit. A core is not a cluster of Integer ALU/AGU or an ALU alone. You have only 4 FP scheduler for a "8-core" Bulldozer, not 8. You have 4 I-Cache, not 8. 4 blocks of L2 cache, not 8. The hypothetic maximum throughput of a unit is a great information, but is not representative of actual, real performances. If you don't feed the engine with good decoder/dispatcher, the FP units will sux. They tried to copy the CMT architecture build by Alpha (Alpha never called that a "core") but they failed to implement that correctly due to .... well, i should shut up now :) This said, a big problem in BD µarch will comes from the Integer unit.
09-10-2011, 05:27 AM
informal

Quote:

Originally Posted by TESKATLIPOKA

informal
First of all integer cluster and FPU is not the only thing what makes a core what it is, without the rest you can't do anything.
And to your dedicated hardware parts
HT is using the integers because their ALU isn't utilized at max.
BD shrank the number of ALU but made a dedicated integer in a core because it doesn't use HT.
If you want to use AVX then you can't say it has dedicated FMAC per core but per module so your point about dedicated hardware is flawed.

If you want a real 8 core and not just 4core+CMT(cluster based multithreading), get an 8 module Interlagos and deactivate the second integer in every module, That would be a true 8 core and not this hybrid.

Even in AVX you have a dedicated HW per thread. In 128bit AVX mode you have one FMAC doing one 128bit AVX instruction.In 256bit mode when both cores have scheduled 256bit FP instructions on the FP co-processor(so called FlexFP) you have 1 256bit instruction being done in 2 cycles for each core,that's all. So my logic is not flawed. Threads in Bulldozer have dedicated HW doing work. But ,like I said before, it won't matter much if performance is not there.It will be regarded as slow 8 core chip,something like 8 K7 level cores.
09-10-2011, 05:33 AM
drfedja

Quote:

Originally Posted by xsecret

AMD is trapped with its own bullshat. If you compare a FX-6100 (6-core @ 3.3 GHz according to AMD marketing terminology) with a 1100T (6-core @ 3.3 GHz), the 1100T will, or course, be faster. Why ? Because the FX-6100 is a 3-core CPU w/ CMT and µarch tweaking. It just can't compete with a real 6-core CPU, even with an old arch. You can translate the case to Intel to understand how stupid it is. If you compare a Core i5 750 (4-cores, 4-threads, 2.66 GHz, 1st gen arch) with a Core i3 2120T (2-cores SMT, 4 threads, so "4-cores" by AMD terminology, 2.60 GHz, 2nd gen arch), the i5 750 will, of course, be much more faster in SMT environment, despite the presence of 4 threads in both case.

Is it X6 1100T faster than i7 2600K ? 2600K has only 4 real cores, but X6 with 6 real cores isn't faster than 2600K. Your comparison with SMT isn't correct.
On average FX6100 should be significantly faster than 1100T, but may be slower in Prime95 or SSE128 optimised Linpack. However, with proper FMA optimisation FX6100 can do 24 DP FLOPS per cycle, same as Thuban with SSE128, but more flexible.
09-10-2011, 05:56 AM
informal

I can tell you one thing.Even if we consider that pure integer speed maybe goes down a bit versus K10,those FP/SIMD numbers from FX8120 really do seem odd and out of place.That's why I think something is going on with the platform itself. According to Interlagos' Sisoft Multimedia numbers,we are left with around 32% better fp performance than MC,per core and per clock. This has not been seen in other Zambezi leaks.On the contrary,we have seen 30%+ lower performance per core than Deneb/Thuban. I think that those Interlagos numbers are with AVX in mind,but again,AVX brings very little to Bulldozer performance since you have fixed number of FP resources which is practically the same in legacy and AVX workloads(peak flops are the same). I also doubt that sisoft uses FMA for Multimedia benchmark when Interlagos is in question(just "plain" AVX).

Show 100 post(s) from this thread on one page