AMD Zambezi news, info, fans !

**muziqaz** · 08-07-2011, 03:31 PM

Originally Posted by Lightman

That's the famous banned on XS person we shouldn't even link to his blog here. [OBR if you're confused]
There are more pics but don't bother looking, because they might be fake.

Man, why people even bother posting anything from him? He admitted faking BD tests a while ago.
And to answer freeloader's question: FPS are so low because they are from OBR imagination.
it is hard enough to imagine that Phenom II x6 would be twice as slow than 6 core i7

**2good4you** · 08-07-2011, 09:13 PM

OBR is a piece of

. 100% fake and has too much spare-time of making all bs crap images... He has NOT a bulldozer sample, and has never had. Wait for the cpu and don't waste so much energy on speculating.

Best regards from Sweden!

**Smartidiot89** · 08-07-2011, 11:03 PM

Giving OBR attention is like feeding a troll, they will just continue. Every time someone talks about OBR Bulldozer IPC goes down so please stop it

**undone** · 08-07-2011, 11:54 PM

Man, every post that mentioned about THAT IDIOT had been deleted by admin in this forum. Don't waste your mind and resources to discuss anything about him.

EDIT:

Seems some guys emulate FX-6110 by using ES FX-8130p and compared to 2600k and 1100T. I don't have time to observe every result, and don't know whether it could reflect any problem.

http://www.f-paper.com/?i708835-Phot...lation-testing

**PatRaceTin** · 08-08-2011, 04:01 AM

september / october

**freeloader** · 08-08-2011, 05:41 AM

I've read the article. It doesn't look so good for Bulldozer single thread performance. I hope for AMD's sake that they've concentrated more on IPC then simply adding more cores. I guess we will know in about five to six weeks.

**muziqaz** · 08-08-2011, 06:07 AM

Originally Posted by undone

Seems some guys emulate FX-6110 by using ES FX-8130p and compared to 2600k and 1100T. I don't have time to observe every result, and don't know whether it could reflect any problem.

http://www.f-paper.com/?i708835-Phot...lation-testing

Did I understand it correctly? They took Phenom II x6 overclocked it, gave it 1866Mhz ram and called it a FX-6110? Man that's even worse than OBR

It's like simulating core 2 duo using pentium 4 :/

**drfedja** · 08-08-2011, 06:09 AM

BD module may have 20% better IPC than K10.5 core, because of better memory reordering, faster cache hierarchy, wider front end, beefer branch prediction, bigger L2 cache, and faster memory controller and L3 cache. Also, BD module can execute 2 ALU, 2 AGU, 2 intSSE and 2 fpSSE operations per cycle per thread. BD module is 4-issue design versus 3-issue K10.5. L1D cache is write trough and smaller than K10, but WT performance penality is compensated with WCC - Write Coalesce Cache. L1D is also 4 times smaller than K10.5, but it is 4-way associative, and 16K 4-way WT cache may have 92% hit rate, 64K 2-way L1D has a little better hit rate, something about 94%. But that isn't problem, because branch predictor is better, L2 is bigger and has better hit rate. 4-cycle use to load latency is hidden by 2-4 stage longer pipeline.
BD pipeline is optimised for 15-20% higher frequency than K10 pipeline at same process node. Because of that, with turbo core 2, BD single thread performance may be much better than K10.5, and something on pair with Sandy Bridge.
Multithread performance isn't that much better. It may be 35-45% better than six core thuban, and maybe little bit lower than 6-core SB-e and Westmere.
FX8 has 8 small cores but SB-e has six fat, hyperthreaded cores, with at least 30-35% better IPC. However, FX8 with 4 modules may have low TDP and high frequency, much higher than six core i7.
My assumption is that the 4 module BD have 10% better multithread performance with same thermal than 4-core SB, if we count on Amdahal's law, turbo core 2, memory bandwidth and latency and shared module resources. Single thread performance may be on pair with SB, with same thermal envelope.
In comparision with Westmere, 4 module BD have 10-15% lower multithread performance. Maybe 10-core Komodo should outpace six core Westmere, and be on pair with SB-E.
Here is my little study about prediction of BD performance.

Originally Posted by muziqaz

Did I understand it correctly? They took Phenom II x6 overclocked it, gave it 1866Mhz ram and called it a FX-6110? Man that's even worse than OBR

It's like simulating core 2 duo using pentium 4 :/

LOL!

My predictions is based on math. Light multithreaded software has lower paralelisation, arround 0.7, and heavy multithreaded code has 0.95 paralelisation. FP intensive code also contain lot of integer code. Cinebench for example has 0.6 IPC for integer and 0.7 IPC for FP on K10 core. FPU is underutilised, because max for FP is 2 packed FP-SSE operations or 2x integer SSE and shuffle on K10 core. With BD module max. is 2 packed FP SSE or FMA + 2 integer SSE or FMA or 1 int SSE + 1 shuffle.
Easily seen, FP core is underutilised with IPC=0.7. With 2 int threads FP IPC for both threads can go up to 2, but hardly more.
Per thread FP IPC can be 0.8, and int IPC can be 0.7. That is 15-20% better per core IPC than K10. With 33% more cores, 15% better average IPC, and 15% higher frequency, overall multithread performance in FP intensive applications can be up to 55-60% better than Thuban. Also if IPC per core is same or little lower, because of shared resources, with higher frequency and more cores in such FP intensive applications performance can rise up to 40%, which is good.

**Manicdan** · 08-08-2011, 06:59 AM

10% better in multithreading than SB 4c/8t sounds too low.
you state that single thread perf will be much higher than thuban, but keep in mind that if its 10% faster clocks, 10% higher IPC (after the loss due to 2 cores in one module), and 33% more cores than thuban, it should be 60% faster than thuban in multithreading (think cb11.5)

if you believe that 1 core vs 1 core of BD vs SB will be pretty close to each other, then why do you think 8 cores will struggle against 8 threads?

**charged3800z24** · 08-08-2011, 07:12 AM

They did more to BD then just increase clock speed. And i didn't see that they did with the turbo for the x6? Did they have it off?

**drfedja** · 08-08-2011, 08:57 AM

Originally Posted by Manicdan

10% better in multithreading than SB 4c/8t sounds too low.
you state that single thread perf will be much higher than thuban, but keep in mind that if its 10% faster clocks, 10% higher IPC (after the loss due to 2 cores in one module), and 33% more cores than thuban, it should be 60% faster than thuban in multithreading (think cb11.5)

if you believe that 1 core vs 1 core of BD vs SB will be pretty close to each other, then why do you think 8 cores will struggle against 8 threads?

Depends of type of workload. If module utilisation is high, difference will be lower. 8-core BD has only 4 FPU units.
Compare to Thuban, IPC per module could be 20% higher, but IPC per module may be equal, or little bit higher, something arround 10%. Threads doesn't scale linear to core count, because of Amdahal's law.
For 33% more cores due to Amdahal's law, and CB paralelisation of 95%, with same IPC cores you can squezee only 23.5% more performance. With core and thread count performance convergent to constant value.
For 20% IPC improvement for BD module vs K10 core, there could be 8-9% improvement for BD core IPC if we know that the BD core IPC = 0.9 BD module IPC per thread .
Calculation for CB is 1.09(IPC)x1.1(frequency)x1.23(core scaling) = 1.47x Thuban 1100T. With lower paralelisation or higher module utilisation, difference could be lower, close to 35%.

**BeepBeep2** · 08-08-2011, 09:07 AM

Originally Posted by muziqaz

Did I understand it correctly? They took Phenom II x6 overclocked it, gave it 1866Mhz ram and called it a FX-6110? Man that's even worse than OBR

It's like simulating core 2 duo using pentium 4 :/

X6 at 3.8 and FX-8130P at 6 cores 3.8.

**Manicdan** · 08-08-2011, 09:16 AM

BD has 8x 128bit FPUs, OR 4x 256bit FPUs

funny how you come to 1.47x when you already factored in the inefficiencies, then drop it down to 35% just because?

**muziqaz** · 08-08-2011, 09:26 AM

Originally Posted by BeepBeep2

X6 at 3.8 and FX-8130P at 6 cores 3.8.

So why not include 8130's performance at 3.8ghz?

**drfedja** · 08-08-2011, 09:50 AM

BD FPU can issue up to four instructions, but front end can issue maximum 4 instructions + branch fusion. With two threads, with high ILP, BD core can retire up to 2 instructions per cycle which in realiti can't go over 1.6-1.8. For example, Phenom II core can reach 2.4 IPC with Linpack of max. 3 IPC. BD module probably can reach up to 3.5-3.6 IPC with two threads, which is 1.7-1.8 IPC per thread. With such heavy workload, BD core can issue less instructions than K10.5 core. But, there is rather exception than rule. In that case 8 core BD or 4 modules in BD can retire up to 14.4 instructions per cycle. Phenom II X6 can retire same 14.4 IPC with six cores.
For example: CB10, has 1.3 IPC on K10 core, but CB10 can reach 1.5 IPC with SB core. This is 50% faster than Phenom II X6 in such workload with 33% more cores.

Originally Posted by Manicdan

BD has 8x 128bit FPUs, OR 4x 256bit FPUs

funny how you come to 1.47x when you already factored in the inefficiencies, then drop it down to 35% just because?

I was correct my calculations. There was a error.

~30% is in case of high ILP code, with IPC up to 2 per thread or ~50% difference with lower ILP.
Attachment 118752

I've done simulation sheet for six core 3.2 GHz 6120p 95W BD. It is 0-35% faster than Thuban 1100T. Ipc is little higher because of lower frequency. IPC doesn't scale linear with frequency increase.
Attachment 118753

**SEA** · 08-08-2011, 01:14 PM

Originally Posted by undone

Man, every post that mentioned about THAT IDIOT had been deleted by admin in this forum. Don't waste your mind and resources to discuss anything about him.

EDIT:
Seems some guys emulate FX-6110 by using ES FX-8130p and compared to 2600k and 1100T. I don't have time to observe every result, and don't know whether it could reflect any problem.

http://www.f-paper.com/?i708835-Phot...lation-testing

Agreen on mentioning "unmentionable" )
but that link is not better:

Attachment 118755

**TESKATLIPOKA** · 08-09-2011, 04:20 AM

I made a small comparison between Core duo, SB, K8, K10 and BD.
https://public.sheet.zoho.com/publis...v-architekture
If something is wrong just comment and I will correct the mistakes.

**freeloader** · 08-09-2011, 04:39 AM

Originally Posted by TESKATLIPOKA

I made a small comparison between Core duo, SB, K8, K10 and BD.
https://public.sheet.zoho.com/publis...v-architekture
If something is wrong just comment and I will correct the mistakes.

You may want to add L1, L2 and L3 cache speed to that chart. Nice graph.

**TESKATLIPOKA** · 08-09-2011, 04:45 AM

I don't know their cache speeds, just the latency.

**drfedja** · 08-09-2011, 05:06 AM

L1 load to use latency is 4 cycles. Branch mispredict latency is 16-cycles, which implies that integer pipeline is 16-stages long. Also some simple ALU operations has 2 cycles longer latency. BD pipeline must be 14-16 stages for integer and 19-21 cycles for FP.

BD module has 2x 2ALU + 2AGLU.

BD L1D can do 1x256 bit load (AVX), and 1x128-bit store at the same time, or 2x128-bit load and no one store, because BD core is limited to two memory operations because of 2 AGU.
Other combinations are 2x128-bit load, 1x128-bit load and 1x128-bit store, 2x64-bit store.
Like SB, BD L1D has data cache bandwidth of 384 bit/cycle in both directions.

K10 L1D can do 2x128-bit load, 1x128-bit load + 1x64-bit store, 2x64-bit store effectively 256-bits/cycle.

**TESKATLIPOKA** · 08-09-2011, 05:56 AM

I will add the value 14-16 for pipeline at least until we won't know the real number.

2x64-bit store effectively 256-bits/cycle.
you meant 128-bits/cycle, right?

You are right, I wrote double the amount of LS units, I will repair It right away.

If anything else is wrong just say it.

**drfedja** · 08-09-2011, 06:49 AM

Originally Posted by TESKATLIPOKA

I will add the value 14-16 for pipeline at least until we won't know the real number.

2x64-bit store effectively 256-bits/cycle.
you meant 128-bits/cycle, right?

I mean 2x64-bit store for 10h or 2x128-bit load. Because 10h can't execute AVX 256 instructions, it can load data in 128-bit chunks.
This is 128-bits /cycle for stores, or 256-bits/cycle for loads.
In the Bulldozer core(not module), there is 256-bit load + 128-bit store in the same time. With Bulldozer module there is double of that operations.
Bulldozer core can calculate 2 adresses at same time because it has 2 AGU - adress generation units.

Sandy core can do also 2 adress operations at once, because it has 2 L/S AGU. It has slightly different approach for store. SB store unit is attached to scheduler

You are right, I wrote double the amount of LS units, I will repair It right away.

If anything else is wrong just say it.

Yes, per core it has 2 ALU and 2 AGU. I've made detail diagram for Bulldozer module, K10, Nehalem and of course of Sandy Bridge HT core architecture.
Attachment 118765

**SkullCracka** · 08-09-2011, 07:35 AM

BLAH BLAH BLAH ..................

**TESKATLIPOKA** · 08-09-2011, 09:18 AM

drfedja great work, I love these diagrams

. If you don't mind I will link them to another forum I frequently visit.

**Oliverda** · 08-09-2011, 10:48 AM

Originally Posted by drfedja

I mean 2x64-bit store for 10h or 2x128-bit load. Because 10h can't execute AVX 256 instructions, it can load data in 128-bit chunks.
This is 128-bits /cycle for stores, or 256-bits/cycle for loads.
In the Bulldozer core(not module), there is 256-bit load + 128-bit store in the same time. With Bulldozer module there is double of that operations.
Bulldozer core can calculate 2 adresses at same time because it has 2 AGU - adress generation units.

Sandy core can do also 2 adress operations at once, because it has 2 L/S AGU. It has slightly different approach for store. SB store unit is attached to scheduler

Yes, per core it has 2 ALU and 2 AGU. I've made detail diagram for Bulldozer module, K10, Nehalem and of course of Sandy Bridge HT core architecture.

BTW you interpret the Address Generation Units as units for calculate linear addresses as well as INC/LEA values. The Optimization Guide refers them as simple integer exetution units, too (AGLU).

Would you briefly explain what kind of operations can these units execute?

Thanks

Thread: AMD Zambezi news, info, fans !

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions