AMD's Bobcat and Bulldozer

**Hans de Vries** · 08-24-2010, 05:44 PM

Originally Posted by Mats

What's the reason behind the odd die shape, do you know?

The synthesis starts with rectangular shapes but the logic migrates
during the optimization process. Some pieces of one unit even end up
in the middle of other units (typically interface logic between the two
units) For some reason the hardware synthesizer concludes that it's
electrically/timingwise better to move it there.

Regards, Hans

**Mats** · 08-24-2010, 05:47 PM

Ok, thanks for the explanation, both of you!

**Mats** · 08-24-2010, 05:51 PM

So the Hot Chip conference ended five minutes ago, where will we find the first reports from it?

**JF-AMD** · 08-24-2010, 06:39 PM

Originally Posted by xlink

fewer execution units...

those three factors coupled with other architectural improvements would need to be able to be at least 50% faster in some instances.

Today's processors have 3 execution units that are shared between ALU/AGU. That is essentially 1.5 ALU and 1.5 AGU. With BD we get 2 AGU and 2 ALU. Much better.

**Jowy Atreides** · 08-24-2010, 07:06 PM

Originally Posted by god_43

honestly wtf cares about single thread performance? geez if ppl cared about it, they would not be buying dual-cores even. the whole nature of multiple cores is for multi-threading, anything else is fail!

Single thread performance does not equal single core purpose.

Example:

I play dolphin emulator with 4 cores set to individual threads.
If the single thread performance is low, the game is not playable unless you like choppy slow motion.

If I have a 4ghz i3, the game is fast and smooth, despite having 2 less cores.

Individual core performance is FAR FAR FAR FAR FAR more important for home users and gamers than core count.

I'm not running a server, I'm running a performance machine that requires fast processing of audio threads to avoid lag or stuttering, same for games, same for time dependent applications like my sql real time stats head up display for stocks analysis.

Screw having 999 cores, just give me 4 FAST ones.

**vietthanhpro** · 08-24-2010, 07:47 PM

Originally Posted by JF-AMD

Today's processors have 3 execution units that are shared between ALU/AGU. That is essentially 1.5 ALU and 1.5 AGU. With BD we get 2 AGU and 2 ALU. Much better.

2 mem ops per core per cycle
not 3 mem ops(2 load and 1 store) per core per cycle
but .............

-------------------------------------
FPU with
2x128 bit MMX
2x128 bit FMAC
--> non SSE: 4 DP per cycle ?
--> SSE 128 bit: 8 DP per cycle ?
--> SSE 256 bit: 8 DP per cycle ?

**informal** · 08-24-2010, 07:56 PM

Originally Posted by vietthanhpro

2 mem ops per core per cycle
not 3 mem ops(2 load and 1 store) per core per cycle
but .............

Since 10h could do 1x128bit load and 1x64bit store,basically 2x load capability of K8,without even using the 3rd AGU(look at AT article,3rd AGU was redundant/unused but kept for other reasons-symmetry of the units),i don't see how AMD couldn't double this with BD. BD will have a full OoO load/store capability,different what K8->10h brought(limited to loads only). There are other clues about 2 load/1 store per core,namely GCC source code that describes BD scheduling.

**Raqia** · 08-24-2010, 08:35 PM

Looks like full slides are up at anand:

http://www.anandtech.com/show/3865/a...tations-online

No transcripts yet. Some interesting stuff about pointers being used to prevent unncessary data transfers in bobcat.

**64NOMIS** · 08-24-2010, 09:23 PM

Originally Posted by Hans de Vries

1) JF told you at the other thread that IPC is higher.

2) If that's true then the higher frequency design comes on top of that.

3) And last but not least: Power gating Turbo now allows much higher single core frequencies.

Looks like a 1-2-3 speed bump for single thread performance to me....

Regards, Hans

I've been spending some time thinking about client loads on BD. Improving per core power consumption & max frequency required significant rethinking of the x86 pipeline and we will see these differences play out as changes in IPC performance across workloads. Outside of desktop these are necessary to improving mobile performance and battery life which in my view will remain a consumer-felt issue for decades to come. I think of the smartphone and tablet like the original IBM PC. We've just begun.

Bobcat will bring x86 and the performance of the PC experience - multi-tasking, web experience, media processing, standards-based connectivity - to ultra mobility and appliance computing. My biggest concern for Bobcat is the application environments for some of the potential form factors. I want Bobcat and its descendants in an alarm clock, but I want an application environment well suited to an alarm clock. Voice, remote gesture, intelligent agency, handheld graphics need an application environment built around them, not extending to them. Sitting down and using this platform was an interesting experience because I found myself pushing it around in all the wrong ways - it wants to live in a small ultra-mobile device or a specialty client and to do things suited to those form factors. With novel IPC/TDP combinations I am hopeful for form factor and usage innovation.

**madcho** · 08-24-2010, 10:33 PM

This one is the most important slide i think :

http://www.anandtech.com/Gallery/Album/754#16

**-Boris-** · 08-24-2010, 10:44 PM

Originally Posted by Chumbucket843

lol, it is the exact opposite. hand layout is much better. humans are better at finding eulerian paths and coming up with clever layouts. computers cant really do that with all of the design rules and other parameters as effectively. the difference in performance is 2.6-7x faster with custom designed circuits.

really what happens is a coder will simulate his module and make sure it reaches the targeted timing, which is usually much higher than actual delay to assure robust operation. if the logic cant reach the speed it is either rewritten or circuit designers optimize it. in certain logic families it must be entirely custom designed.

circuits that are custom designed are usually things like power gating, clock distribution, and analog circuits such as pll's, dll's, and memory controllers/ io pads.

I've heard humans is better at those tasks, but that was many years ago, I thought that computers would be better at this point. Honestly I don't see any reason why computers would fail at such "logical" tasks.

**freeloader** · 08-24-2010, 10:51 PM

Given the information that AMD has released today, is it possible for anyone here to make an educated guess on how much faster BD will be clock for clock over Deneb? Those slides go beyond my basic understanding of CPU design.

**I_no** · 08-24-2010, 10:58 PM

Point to note is that L2 will run at half the speed of core.
I think that used to be the case for past architectures as I don't know any modern x86 core that does that or is there any?

**madcho** · 08-24-2010, 11:01 PM

Originally Posted by freeloader

Given the information that AMD has released today, is it possible for anyone here to make an educated guess on how much faster BD will be clock for clock over Deneb? Those slides go beyond my basic understanding of CPU design.

Phenom was bad in branch prediction, so AMD improved it far beyond intel's best, just watch the die space used by it on the only "litle bobcat". It's just impressive. Alu Pipes seem to have been change from 3ALU/AGU that can do 1.5 of each per cyle to 2AGU + 2ALU that can do 2 per cylce. So more IPC.

About L1D it's not very clear, we need know if it's inclusive or not. So it can be incredible faster, or simply as good as old Phenom II, latency is said to be masked.

L2 latency is 17cycles, so it's not bad. and seem to be 1MB and shared between 2 Alu cores, so that mean less data will go L3 to change of core.

Pipe is longer, but aimed for ramping up clocks, and prediction is far better to hide the bad effect of the long pipe.

It's gonna be the "core 2" effect i think.

**madcho** · 08-24-2010, 11:03 PM

Originally Posted by I_no

Point to note is that L2 will run at half the speed of core.
I think that used to be the case for past architectures as I don't know any x86 core that does that or is there any?

L3 is not full speed in Phenom i think.
And L2 half speed in bobcat help the power to stay low, that's the first goal and the IPC don't go too bad.

For BD it's full speed.

**freeloader** · 08-24-2010, 11:08 PM

The only other thing I've read that's got me confused is the following...

http://www.rage3d.com/articles/amd_h.../index.php?p=5

"Bulldozer on the Desktop"

For the desktop, the Zambezi processor is good news and bad news. The good news is it's an 8 core product, the bad news is it needs a new socket - AM3r, or AM3+. This is an electrical upgrade of the AM3 platform, to provide the power phases and planes/states required by the power gating features of Zambezi. As you might have guessed from the name, this socket is backwards compatible with existing AM3 processors,.....

So BD is not compatible with AM3? Not a big deal as a new arch usually requires a new socket anyhow. I've read up to this point that BD would be compatible with AM3.

**xlink** · 08-24-2010, 11:11 PM

Originally Posted by -Boris-

I've heard humans is better at those tasks, but that was many years ago, I thought that computers would be better at this point. Honestly I don't see any reason why computers would fail at such "logical" tasks.

transistor count is going up faster than performance is increasing.. It's getting more and more complex overall. Computers can generate a solid generic path which is OK, but humans still need to tweak it.

**-Boris-** · 08-24-2010, 11:23 PM

Originally Posted by xlink

transistor count is going up faster than performance is increasing.. It's getting more and more complex overall. Computers can generate a solid generic path which is OK, but humans still need to tweak it.

Yeah, but one might think that computers would be better at the rough overall layout, better utilization of die space and so on. And while transistor count is going up faster than performance, the human brain is the same.

**Calmatory** · 08-25-2010, 12:05 AM

Anyone willing to take bets on cache inclusiveness/exclusiveness?

I'd bet all I've got for inclusive cache. ..unless cutting the L1D down by 75 % is merely for smaller die size. But in that case it would be only logical to go inclusive, because the benefit of exclusive cache would be lost. This again would require some reworking, but would allow far greater cache latencies and bandwidth. I'd also claim that the increase in performance would be higher than with Athlon II with no L3 vs. PhII with L3, which is around 5-6%, ranging from 0 to ~20%.

**JkS** · 08-25-2010, 01:16 AM

I'm not sure if this has been posted already, but it might be useful to some.

http://www.youtube.com/watch?v=VIs1CxuUrpc

**madcho** · 08-25-2010, 01:20 AM

Originally Posted by Calmatory

Anyone willing to take bets on cache inclusiveness/exclusiveness?

I'd bet all I've got for inclusive cache. ..unless cutting the L1D down by 75 % is merely for smaller die size. But in that case it would be only logical to go inclusive, because the benefit of exclusive cache would be lost. This again would require some reworking, but would allow far greater cache latencies and bandwidth. I'd also claim that the increase in performance would be higher than with Athlon II with no L3 vs. PhII with L3, which is around 5-6%, ranging from 0 to ~20%.

I bet for an exclusive L1-I and inclusive L1-D, and L2-L3 exclusive.

**Sn0wm@n** · 08-25-2010, 01:40 AM

Originally Posted by -Boris-

I've heard humans is better at those tasks, but that was many years ago, I thought that computers would be better at this point. Honestly I don't see any reason why computers would fail at such "logical" tasks.

because computers cant think for themselves to implement complicated logic maybe ????

**-Boris-** · 08-25-2010, 01:49 AM

Originally Posted by Sn0wm@n

because computers cant think for themselves to implement complicated logic maybe ????

And you think it must be capable of thought to arrange predetermined connections in an even and efficient way?

**Dresdenboy** · 08-25-2010, 02:00 AM

Way too much discussion for me to answer everything in detail. Some thoughts:

Who said, that each mem op in BD actually needs an AGU op? Could there also be a fast path (single addition) address generation somewhere else? Do addresses have to be calculated each time?

3 ALUs/3 AGUs with their respective reservation stations were used in a symmetrical configuration in K8 to create OoO opportunities for execution. µOps couldn't change their "lane". If that "add rax, rdx" was in reservation station 0, which is busy, while the other RS' were free, this instruction would still have to wait -> IPC goes down

2 ALUs+2 AGUs having a unified scheduler would allow to use these units as they are available. There won't be any binding of our "add rax, rdx" to a busy ALU so it could execute on the free one -> IPC goes up (vs. reservation stations).

**Sn0wm@n** · 08-25-2010, 02:07 AM

Originally Posted by -Boris-

And you think it must be capable of thought to arrange predetermined connections in an even and efficient way?

computers need to be programed by humans in order to make stuff up ... but even then we dont make perfect stuff up .. so it would be easier for a human to design the best layout by hand then let a robot do the work ... that's all im saying.... some logic can be done by computers .. but some complicated logic portion of a chip would indeed benefit a human intervention

Thread: AMD's Bobcat and Bulldozer

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions