AMD FX-8150 Bulldozer finally tested

**Opteron146** · 10-13-2011, 02:12 PM

Originally Posted by Drwho?

Well, BD is very similar to Prescott in many ways, it is very very sensitive to the code quality, due to the fact that each thread of a core only have 2 large decoder, the front end is very limited, and you have to wait for more decode steps to feed your out of order more parallelism opportunity.

No, the front is vertical multi threaded, i.e. every clock 4 decoders to one thread, in the next clock to the other. If the 2nd thread does not decode anything, obviously the other thread can have the front end longer than 1 clock cycle ;-)
bulldozer4b.jpg

This is the major difference with hyper threading, where each thread can get up to 5 large decoding.

How do you count to 5 now? Do you include the Macro Op Fusion, too? There are only 3fastpath plus 1 complex decoder in Intel's design. Officially they count 4:
32_m.png
Anyways, MacroOpFusion is used with Bulldozer, too now, so you have to count 5 for AMD, too (however in less cases, AMD's fusion is on the Conroe's level, Nehalem got more fusion capabilities, not sure about Sandy now.)

BD will stay very sensitive to code quality as long as the front end is not 4 large on each side of the threads, this is the bottom line.

As said above, each thread has 4 or if you count Fusion then 5 decoders. How is intel running Hyperthreading on the 3+1 decoders? Each thread gets 4 decoders, so 8 total? That would be new to me and intel. If intel does it in another way than AMD, then they have to run both threads simultaneously. However, that would mean "only" 2 decoders for each thread, and that's exactly the baaaad case you wrote about above in your incorrect statement about AMD's decoder in the beginning.

Hope it is ok to share my point of view and personal analysis of the performance issues.
It is ok to disagree ;-)

Discussion is always fine, however in the above case, I assume you are rather wrong.

cheers

Opteron

**Drwho?** · 10-13-2011, 04:01 PM

Originally Posted by Opteron146

No, the front is vertical multi threaded, i.e. every clock 4 decoders to one thread, in the next clock to the other. If the 2nd thread does not decode anything, obviously the other thread can have the front end longer than 1 clock cycle ;-)
bulldozer4b.jpg

How do you count to 5 now? Do you include the Macro Op Fusion, too? There are only 3fastpath plus 1 complex decoder in Intel's design. Officially they count 4:
32_m.png
Anyways, MacroOpFusion is used with Bulldozer, too now, so you have to count 5 for AMD, too (however in less cases, AMD's fusion is on the Conroe's level, Nehalem got more fusion capabilities, not sure about Sandy now.)

As said above, each thread has 4 or if you count Fusion then 5 decoders. How is intel running Hyperthreading on the 3+1 decoders? Each thread gets 4 decoders, so 8 total? That would be new to me and intel. If intel does it in another way than AMD, then they have to run both threads simultaneously. However, that would mean "only" 2 decoders for each thread, and that's exactly the baaaad case you wrote about above in your incorrect statement about AMD's decoder in the beginning.

Discussion is always fine, however in the above case, I assume you are rather wrong.

cheers

Opteron

I propose you take a simple linear algorithm and run it on one thread, then, count the number of instructions retired, and then, divide by the number of clock ticks ... You ll be surprise ;-)
( make sure your code is totally compute, with 1 to 2 instructions dependancy ... )

Power point are one thing, but measuring and checking yourself is much better ... Otherwise , at 4.2ghz, how could you explain the poor performance of BD on superPI? Low IPC ... Then, ask yourself, if you measure the IPC for each thread, why it never goes about 2 on a single thread ... Please experiment before trying to correct me. I did my homework ;-)

Then , for your intel diagram, you forgot to count code fusion ... SandyB is 4 large + Fusion ... That gives you up to 5!

We saw a lot of powerpoint slide, but the measurement don t match what is showed in the ppt, sorry, you assume the marketing slide are correct, this is where is the gap. I looked for everywhere, I could not find anywhere clearly said that it will decode more than 2 per threads, and match it with an ASM code doing more than 2 IPC , did you try?

Hehe ...

Francois

**Gener_AL (UK)** · 10-13-2011, 04:39 PM

oh while we have an Intel chap here in AMD thread discussing the benefits on SB, can yo please expand on the DRM architechture within SB and future generations?
Might as well ask eh.
http://blogs.intel.com/technology/20...t_is_it_no.php

No please don't let me stop you, if you cant answer then please bang your head on the door on the way out.

**Opteron146** · 10-13-2011, 04:49 PM

Originally Posted by Drwho?

I propose you take a simple linear algorithm and run it on one thread, then, count the number of instructions retired, and then, divide by the number of clock ticks ... You ll be surprise ;-)
( make sure your code is totally compute, with 1 to 2 instructions dependency ... )

I guess your point is, that it is next to 1?

Power point are one thing, but measuring and checking yourself is much better ... Otherwise , at 4.2ghz, how could you explain the poor performance of BD on superPI? Low IPC ... Then, ask yourself, if you measure the IPC for each thread, why it never goes about 2 on a single thread ... Please experiment before trying to correct me. I did my homework ;-)

Why should I be interested in code that is more than 10 years old when the Pentium1 was state of art and MMX the most modern instruction set extension? If you want, we can discuss cinebench, much more interesting and BD's performance is very bad there, too. Any guesses what is happening there?

Then , for your intel diagram, you forgot to count code fusion ... SandyB is 4 large + Fusion ... That gives you up to 5!

Well yes, for 2 threads.
Furthermore that's not my diagram, it is intel's diagram ;-) There are only 4 decoders, but these can decode 4+1 x86 instructions. If you say that there are "5 decoders" because of that, then I would assume you work in marketing ;-)

We saw a lot of powerpoint slide, but the measurement don t match what is showed in the ppt, sorry, you assume the marketing slide are correct, this is where is the gap. I looked for everywhere, I could not find anywhere clearly said that it will decode more than 2 per threads, and match it with an ASM code doing more than 2 IPC , did you try?

If you only measure retired instructions then well ... there are only 2 INT pipes, obviously the max IPC of these is 2. Question is now do you want to say that it is only 1 in most cases?
Unfortunately, I cant measure myself, did not buy a BD yet, and I am not sure if I will in the future ;-)

Well guessing from the bad performance numbers I would believe an IPC of ~1. IPC has to be low, obviously. The big question is why. Imo the decoders are the last problem. For example how about AMD's version of Fusion, does that not work?

**Drwho?** · 10-13-2011, 05:05 PM

Just do your home work, thanks.
Figure your max IPC ( it is not 1) ... And well, then ... You conclude what ever you want, I know my answer, very happy of it ... But I am sure you know better than I do ;-)

**Opteron146** · 10-13-2011, 06:07 PM

Originally Posted by Drwho?

Just do your home work, thanks.

I already told you that I can't:

Originally Posted by Opteron146

I cant measure myself, did not buy a BD yet, and I am not sure if I will in the future ;-)

"Cant" like in "i can not".
You just make some assumptions, add some incorrect decoder information and then you expect I believe you?
Either put your stuff on the table, then we can discuss it, or leave the thread. AFAIK, this is still a discussion board here, not a quiz show.

Thank you

**Drwho?** · 10-13-2011, 06:27 PM

Originally Posted by Opteron146

I already told you that I can't:

"Cant" like in "i can not".
You just make some assumptions, add some incorrect decoder information and then you expect I believe you?
Either put your stuff on the table, then we can discuss it, or leave the thread. AFAIK, this is still a discussion board here, not a quiz show.

Thank you

Well, you assume that I am incorrect, I know you are

... I did put the stuff on the table, I told you, I tried a permutation of codes to figure out how large the decoding capability, and i ended up with 2 IPC per thread ...

sorry ... If you had a part, you would end up with the same number , you can throw at me as much Powerpoint persentations as you can, you are not going to change my mind.

you are free to trust your powerpoint presentation, I don't mind, you are just a little bit too quick to say that somebody is wrong, without even a part in your hands ... don't you think it is a little too much?

PS: Yeap, you can design code to figure out the specs of a CPU, it is a nice practice, you should try.

With Respect,
Francois

**BeepBeep2** · 10-13-2011, 06:43 PM

Originally Posted by Drwho?

Well, you assume that I am incorrect, I know you are

... I did put the stuff on the table, I told you, I tried a permutation of codes to figure out how large the decoding capability, and i ended up with 2 IPC per thread ...

sorry ... If you had a part, you would end up with the same number , you can throw at me as much Powerpoint persentations as you can, you are not going to change my mind.

you are free to trust your powerpoint presentation, I don't mind, you are just a little bit too quick to say that somebody is wrong, without even a part in your hands ... don't you think it is a little too much?

PS: Yeap, you can design code to figure out the specs of a CPU, it is a nice practice, you should try.

With Respect,
Francois

I think you are both wrong.

Him for accusing you of being incorrect and you for sitting on your high horse.

**Opteron146** · 10-13-2011, 07:56 PM

Originally Posted by Drwho?

Well, you assume that I am incorrect, I know you are

... I did put the stuff on the table, I told you, I tried a permutation of codes to figure out how large the decoding capability, and i ended up with 2 IPC per thread
...

sorry ... If you had a part, you would end up with the same number , you can throw at me as much Powerpoint persentations as you can, you are not going to change my mind.

Fine, if you say so, I believe you. However your "5 decoders" information was wrong. Furthermore you did ignore my question why intel should be better. There are also only 4 decoders for 2 threads if SMT is activated.
Wrong information + ignoring -> lower trust -> less credibility.

you are free to trust your powerpoint presentation, I don't mind, you are just a little bit too quick to say that somebody is wrong, without even a part in your hands ... don't you think it is a little too much?

I didnt say that you are wrong in the IPC case, I just said that you are wrong with your decoder counting. Furthermore, the little picture that you nicely promoted to a whole "powerpoint presentation" was published in IEEE Micro March/April 2011. Of course it also came with an additional text:

Each block indicates a local pipeline that
forms a different thread switch domain.
Once a thread switch decision has b ee n
made at the start of the local pipeline, the de-
cision propagat es the pipeline’s length. A
thread switch can occur as o ften as every
cycle, so multiple threads c an be in flight
in the pipelines, but never in the same pipe-
stage. Decoupling queues, which exist for the
normal pipelining of the front-end, serve as
the different domains’ boundaries. Concep-
tually, the FPU c oprocessor’s front end is
an extension of the dispatch thread domain.

Now I believe you, but also I believe the authors of that paper (Michael Butler,Leslie Barnes,Debjit Das Sarma, Bob Gelinas). These people should at least know what they build. Thus my conclusion is:
There is a bug, or the less likely cases: They lied or your measurements were insufficient.

PS: Yeap, you can design code to figure out the specs of a CPU, it is a nice practice, you should try.

Yes, nothing new. But thanks for stating it for the other readers ;-)

regards

Opteron

**mAJORD** · 10-15-2011, 05:28 PM

Originally Posted by Drwho?

I propose you take a simple linear algorithm and run it on one thread, then, count the number of instructions retired, and then, divide by the number of clock ticks ... You ll be surprise ;-)
( make sure your code is totally compute, with 1 to 2 instructions dependancy ... )

Power point are one thing, but measuring and checking yourself is much better ... Otherwise , at 4.2ghz, how could you explain the poor performance of BD on superPI? Low IPC ... Then, ask yourself, if you measure the IPC for each thread, why it never goes about 2 on a single thread ... Please experiment before trying to correct me. I did my homework ;-)

Then , for your intel diagram, you forgot to count code fusion ... SandyB is 4 large + Fusion ... That gives you up to 5!

We saw a lot of powerpoint slide, but the measurement don t match what is showed in the ppt, sorry, you assume the marketing slide are correct, this is where is the gap. I looked for everywhere, I could not find anywhere clearly said that it will decode more than 2 per threads, and match it with an ASM code doing more than 2 IPC , did you try?

Hehe ...

Francois

Sorry I'm a bit lost here. Why are you focusing on one thread When the front end of bulldozer is responsble for two threads, just like Sandybridge?

I know it falls behind clock/clock, but I don't you think that has more to do with other bottlnecks? Including, for integer code, the much debated ALU resources ona single thread? What about the longer pipeine? Are you taking into account there may still be a deficiency in Branch prediction next to Intel ? Pipeline Bubbles (floating point) that get filled by a 2nd thread?

What would be more interesting I think, is comparing code thats exlusivley floating point with a single thread, then two threads, both on the one module. This would remove the integer clusters from the equation completely. (don't know if this is practical.. programming knowlege is my deficiency so help me out here! )

**haylui** · 10-15-2011, 06:46 PM

Originally Posted by mAJORD

Sorry I'm a bit lost here. Why are you focusing on one thread When the front end of bulldozer is responsble for two threads, just like Sandybridge?

I know it falls behind clock/clock, but I don't you think that has more to do with other bottlnecks? Including, for integer code, the much debated ALU resources ona single thread? What about the longer pipeine? Are you taking into account there may still be a deficiency in Branch prediction next to Intel ? Pipeline Bubbles (floating point) that get filled by a 2nd thread?

What would be more interesting I think, is comparing code thats exlusivley floating point with a single thread, then two threads, both on the one module. This would remove the integer clusters from the equation completely. (don't know if this is practical.. programming knowlege is my deficiency so help me out here! )

IMO, BD front end is two separated computational unit or integer core. if there is only 1 thread going in, I think two cores couldn't break the single thread into two components and then later on fuse it back at the back end.

**mAJORD** · 10-16-2011, 12:10 AM

Originally Posted by haylui

IMO, BD front end is two separated computational unit or integer core. if there is only 1 thread going in, I think two cores couldn't break the single thread into two components and then later on fuse it back at the back end.

The seperate Integer cores are the exeuction units / Schedulers, The front end is unifed, 1 per module.

Thread: AMD FX-8150 Bulldozer finally tested

Thread Tools

Search Thread

Rate This Thread

Display

Hybrid View

Bookmarks

Bookmarks

Posting Permissions