AMD's Bobcat and Bulldozer

**informal** · 08-30-2010, 05:03 PM

Originally Posted by Motiv

Being an absolute noob, could someone explain this to me.

How many pipelines are on the P2 (x4 for arguments sake), how do they feed the ALU & AGU normally.

To me it looks like bulldozer has cut down by 1 ALU&AGU per 'core'.

Agner Fog's microarchitecture.pdf is a good place to start.It has a part where it tries to identify the bottlenecks in every major x86 design today,so there is 10h(or wrongly called K10). Essentially 10h can in theory do a massive of 9(nine) "micro ops"* but retire only 3 "macro ops"** . There is a bottleneck in the retirement part of the design(but the utilization of 9 units can't be effectively measured in real world as the document says;it is clear that some of the time exec. units are underutilized ,especially 3rd AGU which is redundant due to 2 ports to L1D cache).

*macro op is split into these micro instructions and then sent to execution units
**macro op is an instruction the decoder deals with;1 x86 instruction typically = 1 or 2 macro ops

edit:
continued on to Bulldozer
Front end can take up 4 x86 instructions(can't tell what is the relation to the RISC like macro ops in 10h decoder stage) and dispatch it in 2 groups of 4(macro ops?). Each integer core can do 4 instructions (2 arithmetic and 2 address,but the Agen unit can maybe do some math work too ). Still a lot is unknown so we can't say what else is in there and how AMD organized it.At least not until launch .

**god_43** · 08-30-2010, 05:32 PM

Originally Posted by nn_step

Correction

while(!interrupted)
{
cout << "terrace215 post: " << random(bull_sh1t_reason) << endl;

cin >> wait_responses;

if (wait_responses = true) {

cout << "terrace215 post: " << random(spout_more_sh1t) << endl;

}

}

fixed...although might not work well if it was a real program ;p.

**nn_step** · 08-30-2010, 05:43 PM

Originally Posted by god_43

fixed...although might not work well if it was a real program ;p.

wait for response isn't required, since it is apparent that such activity doesn't actually exist. [At least in most of the posts made]

**qcmadness** · 08-30-2010, 05:47 PM

Originally Posted by Motiv

Being an absolute noob, could someone explain this to me.

How many pipelines are on the P2 (x4 for arguments sake), how do they feed the ALU & AGU normally.

To me it looks like bulldozer has cut down by 1 ALU&AGU per 'core'.

http://www.xbitlabs.com/articles/cpu...0_6.html#sect0

Upon the availability of data, the scheduler may issue one integer operation to ALU and one address operation to AGU from each queue. There can be maximum two simultaneous memory requests. So, up to 3 integer operations and 2 memory operations (64-bit read/write in any combination) may be issue for execution per clock. Micro-operations from various arithmetic MOPs are issued for execution from their queues in an out-of-order manner, depending on the readiness of the data.

**jtdigital** · 08-30-2010, 06:44 PM

isn't the bulldozer going to be released in 2nd quarter 2011 when the sandybridge 8 core arrives to do battle?

**Hans de Vries** · 08-30-2010, 07:02 PM

Originally Posted by qcmadness

http://www.xbitlabs.com/articles/cpu...0_6.html#sect0

page 251 of: http://support.amd.com/us/Processor_TechDocs/25112.PDF

A.3 Superscalar Processor

The AMD Athlon 64 and AMD Opteron processors are aggressive, out-of-order, three-way
superscalar AMD64 processors. They can fetch, decode, and issue up to three AMD64 instructions
per cycle with a centralized instruction control unit (ICU) and two independent instruction
schedulers—an integer scheduler and a floating-point scheduler. These two schedulers can
simultaneously issue up to nine micro-ops to the three general-purpose integer execution units
(ALUs), three address-generation units (AGUs), and three floating-point execution units. The
processors move integer instructions down the integer execution pipeline, which consists of the
integer scheduler and the ALUs, as shown in Figure 6 on page 252. Floating-point instructions are
handled by the floating-point execution pipeline, which consists of the floating-point scheduler and
the floating-point execution units.

or alternatively:

http://www.chip-architect.com/news/2...Core.html#1.20

But don't forget that the average number of ALU instructions is something like 0.4/cycle
which is 4, 5 times less as two ALUs can provide.

Regards, Hans

**blindbox** · 08-30-2010, 07:20 PM

Originally Posted by Hans de Vries

forever{
terrace215 post: IPC decreases, because .....
terrace215 post: IPC decreases, says .... of AMD
terrace215 post: IPC decreases, according to AMD's presentation.
terrace215 post: IPC decreases, don't trust marketing guys.
terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, the marketing guy isn't talking about IPC
terrace215 post: IPC decreases, because of the 16KB caches
terrace215 post: IPC decreases, AMD has given up improving IPC.
terrace215 post: IPC decreases, The AMD architect says it decreases by 5%
terrace215 post: IPC decreases, Bulldozer is only optimized for server workloads.
terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
.....}
until (interrupt by Movieman)

Regards, Hans

You've summed it up better than we all could.

**god_43** · 08-30-2010, 07:23 PM

Originally Posted by nn_step

wait for response isn't required, since it is apparent that such activity doesn't actually exist. [At least in most of the posts made]

lool thats true..i have failed. cast out of troll programming school : (.

on topic. yeah its supposed to be 2011 q2, should be fun!

**JF-AMD** · 08-30-2010, 07:26 PM

Originally Posted by god_43

lool thats true..i have failed. cast out of troll programming school : (.

on topic. yeah its supposed to be 2011 q2, should be fun!

on topic, it is 2011. that is all that has ever been said.

**JumpingJack** · 08-30-2010, 09:45 PM

Guys ... David has posted a terrific summary of Bulldozer ... http://www.realworldtech.com/page.cf...2610181333&p=1

**tifosi** · 08-30-2010, 10:12 PM

Originally Posted by Movieman

Hello Hans.
.... Oh, forgot,no one drives me nuts. I just smile, grab my hammer and hit them upside the head so hard their grandchildren will walk with a 15degree list.

I honestly am waiting like thousands more to get a sneak peak... :P I don't live in that city which i mentioned earlier with the cave running the (early sample) hardware in question... or i'd have had done anything, akin to indiana jones (which is lamo me thinks) to get to the 16 core unobtanium-optronium! :P

**-Boris-** · 08-30-2010, 11:01 PM

Originally Posted by Hornet331

Yes one module is capable of 4 agu/alu ops, but that requiers 2 threads. For single thread your down to 2/2, but with a fornt-end thats much more capable then that of K8

Your making a mistake, your talking about relative performance, which is not related to IPC at all or is only one part of the equation.

No, a BD Core has 2 ALUs AND 2 AGUs available. 2+2=4. A Phenom II has 3 ALUs OR 3 AGUs. 6/2 = 3.

EDIT:
And PLEASE, can we dedicate this thread to Bulldozer and not forum moderation? I too welcome the ban, but I'm sure we got enough criticism and back-patting here. There are other places we can continue doing that.

**[XC] Oj101** · 08-31-2010, 12:06 AM

I heard via the grapevine that it'll be 1H2011 for server, 4Q2011 for desktop

**ajaidev** · 08-31-2010, 12:44 AM

Originally Posted by JumpingJack

Guys ... David has posted a terrific summary of Bulldozer ... http://www.realworldtech.com/page.cf...2610181333&p=1

nice article by David and thanks for the heads up...

**geo** · 08-31-2010, 12:51 AM

Originally Posted by Oj101

I heard via the grapevine that it'll be 1H2011 for server, 4Q2011 for desktop

come on AMD make it the other way round :P

**-Boris-** · 08-31-2010, 12:54 AM

Originally Posted by geo

come on AMD make it the other way round :P

Might see a Bulldozer FX at the same time as the server chips.

**informal** · 08-31-2010, 02:22 AM

Originally Posted by JumpingJack

Guys ... David has posted a terrific summary of Bulldozer ... http://www.realworldtech.com/page.cf...2610181333&p=1

Thanks,that is a great article.

**savantu** · 08-31-2010, 02:24 AM

Originally Posted by -Boris-

No, a BD Core has 2 ALUs AND 2 AGUs available. 2+2=4. A Phenom II has 3 ALUs OR 3 AGUs. 6/2 = 3.

..

K10 has 3 ALUs and 3 AGUs. No matter how hard you and others try to downplay K10 execution resources, fact is, a K10 integer core has more resources than a BD integer core.

The docs linked by Hans are pretty clear.

http://www.xtremesystems.org/forums/...&postcount=681

**informal** · 08-31-2010, 02:33 AM

Bobcat is 2way(2ALU+2AGU) design,has 90% of Propus and is a low power design with solid perfromance.One can expect Bulldozer core to stump over Bobcat core but both have less ALUs/AGUs than 10h. Number of units means nothing if you can't effectively use them and you know that.The number of core level changes is pretty big,from L/S improvements,prefetch,BP,shared L2 etc..As Anand wrote(info from AMD) ,per core performance will be better than 10h.

**JF-AMD** · 08-31-2010, 03:07 AM

Originally Posted by savantu

K10 has 3 ALUs and 3 AGUs. No matter how hard you and others try to downplay K10 execution resources, fact is, a K10 integer core has more resources than a BD integer core.

The docs linked by Hans are pretty clear.

http://www.xtremesystems.org/forums/...&postcount=681

No, you are wrong. Old architecture has shared resources, new architecture has dedicated resources.

A BD integer core will do more IPC and perform single threads faster than an old core.

Why do you keep saying these things even though I have posted the information in multiple places?

**freeloader** · 08-31-2010, 03:19 AM

Originally Posted by JF-AMD

No, you are wrong. Old architecture has shared resources, new architecture has dedicated resources.

A BD integer core will do more IPC and perform single threads faster than an old core.

Why do you keep saying these things even though I have posted the information in multiple places?

JF...have you personally seen running BD chips yet or whatever the server variant is called? Just wondering how you're so sure if you haven't bench tested one yet.

Does anyone here know when BD compatible socket motherboards will go on sale?

**STaRGaZeR** · 08-31-2010, 03:24 AM

Originally Posted by JF-AMD

No, you are wrong. Old architecture has shared resources, new architecture has dedicated resources.

He's right. K10 has more resources, shared or not.

**-Boris-** · 08-31-2010, 03:29 AM

Originally Posted by freeloader

JF...have you personally seen running BD chips yet or whatever the server variant is called? Just wondering how you're so sure if you haven't bench tested one yet.

Does anyone here know when BD compatible socket motherboards will go on sale?

I'm pretty sure that when you have the position JF has in a company you get pretty accurate numbers from engineering and so on. There is no need for him to sit down and bench engineering samples personally. Would be quite stupid if engineering lied about the performance in internal reviews and documents.
You know this isn't Dilbertland right?

**-Boris-** · 08-31-2010, 03:31 AM

Originally Posted by STaRGaZeR

He's right. K10 has more resources, shared or not.

If you can't use it it isn't a resource. Phenom has only three integer pipes. In one of those pipes the AGU and ALU have to take turns being part of the resource pool.

**informal** · 08-31-2010, 03:34 AM

10h can retire 3 macro ops.BD integer core/fp core should be able to do 4.

Thread: AMD's Bobcat and Bulldozer

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions