AMD's Bobcat and Bulldozer

**Manicdan** · 08-25-2010, 07:37 AM

Originally Posted by Mats

No, it's not like that. They will run at 100 % each, either one or two cores.

It's all about the module who will have the performance that's 180 % of two regular full cores (2 x 100 %), but the size is much smaller than two regular cores.

it sounds like the same thing being looked at from both directions

if you get 100 uberpoints with 1 core, you get 180 with 2 cores in the same module

if we move on to examine the whole 33% more cores, 50% more perf for the server stuff, and take into account scaling efficiency estimates, and assuming similar clocks. my extremely simple answer is 25% better perf clock per clock between BD and PII, add in better turbo (200 more mhz is safe bet), i think a single core BD should be similar to 5ghz of a PII core (or 40-50% better single threaded perf, but for Thuban thats for 3 cores, for BD im thinking 4 threads)

so my very horrible estimate is that you will need 4 cores of PII at 5ghz to equal stock turbod setting of BD in gaming.

(please do not try and compare your information to this post, it is not in any way scientific or accurate or detailed, its just an edjumacated guess from an edjumacated guy)

**informal** · 08-25-2010, 07:40 AM

My estimate in integer work is that one BD core will be solidly faster than one Phenom II core,no turbo,same clocks(all around 20-25%).Add in *better* turbo,add in 33% more cores Vs Thuban,add in complete ISA support(with SSE1-4.2,AES,AVX,FMA4 and XOP) and you can imagine what happens to Thuban

.

**Mats** · 08-25-2010, 07:47 AM

Originally Posted by Manicdan

if you get 100 uberpoints with 1 core, you get 180 with 2 cores in the same module)

The way I see it:

By making a module out of a core you add 12 % die space, and that raises performance to 180 %.

I hope this 180 % figure is close to the truth..

**kl0012** · 08-25-2010, 07:49 AM

Originally Posted by Hornet331

Slides say single thread performance is higher the current P2, but that doesn't mean its IPC is higher.

The slides say "significant improvement in performance/watt/mm2". Not sure if it also means improvement in absolute performance per thread. There is no a single word about higher IPC.
Also bulldozer's fpu subsystem disappoints a bit. It looks like it able to start only one 256 bit instruction per cycle (SB can start two). This means that Bulldozer can reach SB on fpu throughput only if FMA is used but this is less flexible and not always possible. Also SB can independently start third non-arithmetic FP instruction (permute, fp-move e.t.c), not sure how it is implemented in bulldozer. And the final disappointment - Zambezi is going to have only 4x 256-bit fpus while Jaketown will have 8x 256-bit fpus.

**xVeinx** · 08-25-2010, 07:52 AM

Originally Posted by Mats

No, it's not like that. They will run at 100 % each, either one or two cores.

It's all about the module who will have the performance that's 180 % of two regular full cores (2 x 100 %), but the size is much smaller than two regular cores.

Another way to phrase it might be that there is a certain level of performance for two full cores, X. The efficiency of usage for those same two cores, for a given program is Y. X is 100%, Y might be 75% lets say. Bulldozer's arch achieves 180% of the performance of two full cores (Not Deneb cores, but full cores with the same performance characteristics as the new arch will have), bringing us down to 90% per-core performance. The efficiency for the program, even with the drop in possible "performance," can increase due to a more robust architecture removing latency (or at least hiding it) and improving throughput (IPC). This might bring the efficiency to 85% lets say.

75% of 100% is 75%
85% of 90% is 76%

AMD doesn't really have to lose anything here, but has the potential to drive up the net performance with IPC improvements, clockspeed, and increased usage efficiency. If they can garner all three, then they could easily get the 50% multi-threaded performance improvements mentioned.

For single-threaded applications, we (or, some notable folks here anyway

) see 90% and assume worse performance than Deneb. First, it's possible performance (ie, no relation to Deneb). Secondly, an increase in the same three variables above allow for a boost in net single-threaded performance if they capture them well. Clockspeed here is especially in the form of Turbo Core.

By trading some performance for die space, efficiency, and power consumption, they have a better chance at appropriately balancing multi-threaded and single-threaded performance (ie, no need to turn off CMP because of a decrease in performance in well-threaded applications, especially since you can't anyway

). Moreover, you have more room in a given TDP to increase clockspeed and tweak things such that the net loss from the architecture for it's potential performance is minimal at best.

**Hornet331** · 08-25-2010, 07:53 AM

Originally Posted by informal

My estimate in integer work is that one BD core will be solidly faster than one Phenom II core,no turbo,same clocks(all around 20-25%).Add in *better* turbo,add in 33% more cores Vs Thuban,add in complete ISA support(with SSE1-4.2,AES,AVX,FMA4 and XOP) and you can imagine what happens to Thuban

.

Lol with that numbers it will be faster then SB... P2 is only ~15% slower then nehalem in singel threaded loads (clock for clock) sometimes even less. SB might add another 5-10%... bit overoptimistic, aren't we...

**kl0012** · 08-25-2010, 08:00 AM

Originally Posted by informal

You have a total of 4 instructions executed by each integer core.In 10h you had a total of 3(be it mem or math ops).That's a 33% difference.Now count in the massively improved prefetch and other stuff in the front end that are supposed to keep the core(s) busy all the time and you have a potentially pretty nice boost in IPC. Remember that with 10h ,the 3ALUs were paired with AGUs and sat around just waiting for data doing nothing. The new 2+2 scheme is built in order to address the under-utilization.

Thats not totally true. K8-K10 cores can do 3 register additions, substractions, shifts, moves per cycle while bulldozer can do only two. Now in K8-k10 AGU was fused with ALU (this is the reason why out-of-order load/stores where imposible on those architectures) but stil K8-K10 were able to execute more then 2 (2.7) arithmetic instruction with memory operand per cycle.
Here is the throughput table:
http://gmplib.org/~tege/x86-timing.pdf

**informal** · 08-25-2010, 08:00 AM

Originally Posted by Hornet331

Lol with that numbers it will be faster then SB... P2 is only ~15% slower then nehalem in singel threaded loads (clock for clock) sometimes even less. SB might add another 5-10%... bit overoptimistic, aren't we...

Phenom II is around 15 to 20% slower than Nehalem in single thread task,no turbo counted. When AMD designed Shanghai they managed to squeze out around 5-8% in IPC compared to Agena.Here you have a complete pipeline redesign with special focus on removing the present bottlenecks .If Shanghai brought that much being shrink,20% over Shanghai is really not impossible,given everything we know about Bulldozer(I didn't mention the new shared L2,larger L3,better IMC ,2x load/store BW of Shanghai etc.)

**informal** · 08-25-2010, 08:03 AM

Originally Posted by kl0012

Thats not totally true. K8-K10 cores can do 3 register additions, substractions, shifts, moves per cycle while bulldozer can do only two. Now in K8-k10 AGU was fused with ALU (this is the reason why out-of-order load/stores where imposible on those architectures) but stil K8-K10 were able to execute more then 2 (2.7) arithmetic instruction with memory operand per cycle.
Here is the throughput table:
http://gmplib.org/~tege/x86-timing.pdf

You also had separate scheduler for math and address ops in 10h,now you have the unified scheduler for both. The problem with 10h is that ALUs were underutilized(although in theory as you show it was quite capable) and BD is supposed to fix that bottleneck.

**Hornet331** · 08-25-2010, 08:05 AM

Originally Posted by kl0012

The slides say "significant improvement in performance/watt/mm2". Not sure if it also means improvement in absolute performance per thread. There is no a single word about higher IPC.
Also bulldozer's fpu subsystem disappoints a bit. It looks like it able to start only one 256 bit instruction per cycle (SB can start two). This means that Bulldozer can reach SB on fpu throughput only if FMA is used but this is less flexible and not always possible. Also SB can independently start third non-arithmetic FP instruction (permute, fp-move e.t.c), not sure how it is implemented in bulldozer. And the final disappointment - Zambezi is going to have only 4x 256-bit fpus while Jaketown will have 8x 256-bit fpus.

Well, for servers they use MCM anyway, so they offer the same ammout of fpus.

**~~LesGrossman~~** · 08-25-2010, 08:06 AM

So... no more bulldozer single thread doomsday scenario?

Bulldozer no go for AM3
.

**blindbox** · 08-25-2010, 08:07 AM

Eh? The way I see it, AMD added tons of stuff to enhance single threaded performance. It's not just one or two things. Are you sure you're reading the same slides I am reading?

Sandy Bridge vs Bulldozer is going to be fun times.

Is an 8-core sandy bridge coming to the desktop?

**-Boris-** · 08-25-2010, 08:08 AM

Originally Posted by Chumbucket843

not if the delay per stage stays the same. longer pipelines also tend to lead to lower ipc.

That really depends, they can increase IPC if they are related to prefetch, which they seem to be.

Originally Posted by Solus Corvus

2 ALUs per core is somewhat disappointing. I guess instead of developing some technological trick (that would take die space and power) to increase the efficiency of underutilized units, they just took them out. It's probably a big boon for power efficiency, but it leaves IPC up in the air.

On the face of it, taking out ALUs seems like it would reduce IPC. But since IPC rarely goes past 2 on average that might not be the case. The branch prediction, prefetch, decoders, reorder, and caches sound nice. If they could utilize these units to keep the remaining ALUs + FPU fed a larger percentage of the time IPC could still go up significantly.

Considering the reduced ALUs and shared units, it sounds like BD will be a very power efficient architecture. Couple that with a long pipeline, aggressive prefetch, etc and there may be major headroom for turbo/OC. Though, if the rumors are true, they may need a new process to realize it's full potential.

As for single-threaded versus multi-threaded. Low ST performance would affect office users and gamers (slightly). But as a power user I won't be affected because in low thread situations I just run more programs. Servers and HPC sound like they would be serviced by this chip well.

Now you have 2 usable ALUs and 2 usable AGUs. Phenom II has an average of 1.5 each. 4 pipes vs 3. And capable of 4 instructions per clock instead of 3. This means that it's all for the better.

**kl0012** · 08-25-2010, 08:16 AM

Originally Posted by Hornet331

Well, for servers they use MCM anyway, so they offer the same ammout of fpus.

But it seems that Jaketown is going to desktop as well. Also Jaketown die size should be about 290-300 mm2 (which is in range of linfield) so Intel also can take the multimodule route in a server segment.

**Hornet331** · 08-25-2010, 08:23 AM

Originally Posted by kl0012

But it seems that Jaketown is going to desktop as well. Also Jaketown die size should be about 290-300 mm2 (which is in range of linfield) so Intel also can take the multimodule route in a server segment.

Well yes jaketown also come to desktop, but I bet its gona be EE. So I guess most potent singel socket desktop solution will still be in intels hand.

My comment was more aimed at server space. Sure they also could do MCM, but then there 10 core westmere-ex would be starved.

**kl0012** · 08-25-2010, 08:23 AM

Originally Posted by informal

You also had separate scheduler for math and address ops in 10h,now you have the unified scheduler for both. The problem with 10h is that ALUs were underutilized(although in theory as you show it was quite capable) and BD is supposed to fix that bottleneck.

You're probably right saing that ALUs in K10 were underutilized but I have a hard time to belive that the average IPC was less then 2. Now AMD is adding many stuff to improve ALU utilization but in the same time is reducing number of ALUs. Where is the logic?

**Hornet331** · 08-25-2010, 08:34 AM

well x86 quite sucks at ipc.. 1.5 is a good value.

**informal** · 08-25-2010, 08:44 AM

Originally Posted by kl0012

You're probably right saing that ALUs in K10 were underutilized but I have a hard time to belive that the average IPC was less then 2. Now AMD is adding many stuff to improve ALU utilization but in the same time is reducing number of ALUs. Where is the logic?

I know it's hard to believe but Hornet is correct.The average IPC in the spec2006 test suit is around ~1

.This is on a Core 2 class chip,that is 4(+1 )wide.

The average IPC’s for CPU2006 and CPU2000 benchmarks were measured at 1.006 and 0.85 respectively

PDF link

Now imagine what it is on an underutilized 3wide core like Phenom.It's easily 15% lower than on core 2/i7. That's why I think that 4wide(2+2) cores in BD will successfully address the execution bottleneck in 10h.

**~~terrace215~~** · 08-25-2010, 09:01 AM

Originally Posted by Chumbucket843

also gate last is inferior to gate replacement integration. in gate replacement you can optimize both pmos and nmos transistors. the problem is that intel has patented a lot of ip with respect to HK/MG. i dont know if glofo wants to license intel's patents.

Just to correct a glitch: It's gate first HKMG SOI that GloFo has unwisely chosen. Intel uses replacement gate, a.k.a. gate last.

Interesting about GF using mostly the 45nm metal stack @ 32nm.

**~~terrace215~~** · 08-25-2010, 09:05 AM

Originally Posted by kl0012

You're probably right saing that ALUs in K10 were underutilized but I have a hard time to belive that the average IPC was less then 2. Now AMD is adding many stuff to improve ALU utilization but in the same time is reducing number of ALUs. Where is the logic?

They are optimizing for server application throughput, at the expense of client low-threaded performance. That might make sense for them, considering the initial target market is virtually all server/hpc. In client, they have Llano in the middle, and Ontario down low... so maybe they decided they couldn't be all things to all segments with BD.

**Hornet331** · 08-25-2010, 09:06 AM

Originally Posted by terrace215

Just to correct a glitch: It's gate first HKMG SOI that GloFo has unwisely chosen. Intel uses replacement gate, a.k.a. gate last.

Interesting about GF using mostly the 45nm metal stack @ 32nm.

But arn't they gona offer gate last with the 22nm?

**informal** · 08-25-2010, 09:11 AM

Originally Posted by terrace215

They are optimizing for server application throughput, at the expense of client low-threaded performance.

You are basing this on Paul Demone's comment?

**~~terrace215~~** · 08-25-2010, 09:11 AM

Originally Posted by Hornet331

Well yes jaketown also come to desktop, but I bet its gona be EE. So I guess most potent singel socket desktop solution will still be in intels hand.

My comment was more aimed at server space. Sure they also could do MCM, but then there 10 core westmere-ex would be starved.

Westmere-EX could still be segmented by RAS functionality, if they wanted to produce a very high-cored/moderate speed SB "Interlagos" of their own, whether through the MCM route, or directly.

**Calmatory** · 08-25-2010, 09:12 AM

ALU utilization is highly dependent on the code being ran. Anyone can disassemble some software and ponder a while how it would work in terms of ILP, the chances are, very poorly.

**~~terrace215~~** · 08-25-2010, 09:12 AM

Originally Posted by Hornet331

But arn't they gona offer gate last with the 22nm?

There's been talk of that, but I don't think GF has disclosed that yet, at least not publicly.

Thread: AMD's Bobcat and Bulldozer

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions