AMD's Bobcat and Bulldozer

**-Boris-** · 08-31-2010, 09:23 AM

Originally Posted by AliG

No one is sure, all JF has said is that AMD is working with MS to devise core utilization order etc.

I would imagine, that ideally for multithreaded tasks you would want the same module due to the shared L2, but for separate tasks you would want different modules due to the performance loss from sharing components

At the same time as you have a performance loss to shared components you have a boost from Turbo. If four threads run on a module each, you will have no turbo, since turbo managment is at a module level and not at a core level. If all threads run at two modules, you will have a 10% performance hit, but you will have turbo making up for that and more.

**god_43** · 08-31-2010, 09:28 AM

Originally Posted by Hornet331

http://flamewheelspin.ytmnd.com/

perfectly sums up this thread...

**AliG** · 08-31-2010, 09:29 AM

Originally Posted by Motiv

It was answered on the blog, that the shared L2 Cache wouldn't really help.

As for the Multitasking, I suspect it will work like Intels HT. As far as I'm aware, that doesn't cripple 1 core only, but spreads it out amongst the other cores first and foremost.

If that's the case, then ideally AMD would create a lineup that was priced such that 1 module ~ 1 intel HT core - but we know that's probably never going to happen

**Motiv** · 08-31-2010, 09:29 AM

Originally Posted by -Boris-

At the same time as you have a performance loss to shared components, you have a boost from Turbo. If four threads run on a module each, you will have no turbo, since turbo managment is at a module level and not at a core level. If all threads run at two modules, you will have a 10% performance hit, but you will have turbo making up for that and more.

Why would turbo work when it's 2 module/2 core? Surely turbo would be more suited to running when all 4 modules are only utilising 1 core?

I thought the performance hit would be around 20%, if both cores are used within a module.

**Motiv** · 08-31-2010, 09:30 AM

Originally Posted by AliG

If that's the case, then ideally AMD would create a lineup that was priced such that 1 module ~ 1 intel HT core - but we know that's probably never going to happen

I suspect we'll be seeing AMD lineup using 8 cores vs 4 cores, even if than means 4 modules. The AMD cores within the modules are certainly more core like, than HT.

At the end of the day, the prices will be set based on workloads and how it copes with them. If a 4 module/8 core AMD chip (at say 2.5ghz), can deal with the same workload as a 4 core(8ht) Intel chip (at 2ghz), then that will be it's price window (speed values for arguments sake etc).

**-Boris-** · 08-31-2010, 09:39 AM

Originally Posted by Motiv

Why would turbo work when it's 2 module/2 core? Surely turbo would be more suited to running when all 4 modules are only utilising 1 core?

I thought the performance hit would be around 20%, if both cores are used within a module.

If you run one thread per module all modules work at the same time, and no modules rest, therefore no module can enter turbo. But if two the modules work with two threads, then two modules rest, if two modules rest the other to can enter turbo mode.
You can't have turbo and all modules working at the same time, the fact that parts of a module is idle doesn't matter since turbo works on a module level.

And it's said everywhere that a second thread run in a module "only" increases performance with 80%. That is a 10% performance loss compared to a traditional dual core approach.

**AliG** · 08-31-2010, 09:45 AM

Originally Posted by Motiv

At the end of the day, the prices will be set based on workloads and how it copes with them. If a 4 module/8 core AMD chip (at say 2.5ghz), can deal with the same workload as a 4 core(8ht) Intel chip (at 2ghz), then that will be it's price window (speed values for arguments sake etc).

I doubt that would happen though due to manufacturing costs. I have to believe 1 module is bigger than 1 Intel core. For consumers, the Intel core would make more sense, whereas for servers the module would make more sense as you are comparing 130% to 180% of the integer performance. Thus since server processors are always priced with much higher margins in mind, they could probably line up their processors that way, so even if intel's ipc is 10% faster, they would still win the performance battle.

However, I just can't see AMD being able to price their products as you described for the general consumer and still make a profit, especially when Intel is at 32nm whereas AMD is stuck at 45nm. Even if they could, if the Intel product offers anywhere from 5-20% more ipc, I would just by an unlocked k series processor and be happy with that. Having anything beyond 4 threads is pretty much useless for me, so single threaded performance is what will earn my money.

**informal** · 08-31-2010, 09:48 AM

Originally Posted by AliG

However, I just can't see AMD being able to price their products as you described for the general consumer and still make a profit, especially when Intel is at 32nm whereas AMD is stuck at 45nm.

Bulldozer is 32nm SOI highk/mg...

**AliG** · 08-31-2010, 09:52 AM

Originally Posted by informal

Bulldozer is 32nm SOI highk/mg...

is it? 45nm makes a lot more sense because it's a proven process. That seems like a bad idea considering how well their 65nm k10 transition went. Perhaps that's the root of all the delays

**-Boris-** · 08-31-2010, 09:53 AM

Originally Posted by AliG

is it? 45nm makes a lot more sense because it's a proven process. That seems like a bad idea considering how well their 65nm k10 transition went. Perhaps that's the root of all the delays

No, it was planned for 45nm, the delays made them change that to 32nm, giving them more time to develop the architecture.

**JF-AMD** · 08-31-2010, 09:55 AM

Originally Posted by -Boris-

If you run one thread per module all modules work at the same time, and no modules rest, therefore no module can enter turbo. But if two the modules work with two threads, then two modules rest, if two modules rest the other to can enter turbo mode.
You can't have turbo and all modules working at the same time, the fact that parts of a module is idle doesn't matter since turbo works on a module level.

And it's said everywhere that a second thread run in a module "only" increases performance with 80%. That is a 10% performance loss compared to a traditional dual core approach.

I would not make assumptions about how our processor works based on how our competitor has implemented technology.

As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.

I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.

Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.

People have been asking for someone to really bring some real innovation to the market, I think you will see that.

**AliG** · 08-31-2010, 09:55 AM

Originally Posted by -Boris-

No, it was planned for 45nm, the delays made them change that to 32nm, giving them more time to develop the architecture.

that explains it then

**-Boris-** · 08-31-2010, 09:58 AM

Originally Posted by JF-AMD

I would not make assumptions about how our processor works based on how our competitor has implemented technology.

As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.

I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.

Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.

People have been asking for someone to really bring some real innovation to the market, I think you will see that.

Ok, I've read that it was working on a module level. But I guess you are telling me that there is more to it than that?

**informal** · 08-31-2010, 10:01 AM

It is working on a module level but that is all we know. Many things AMD didn't reveal,for obvious reasons.

**Solus Corvus** · 08-31-2010, 10:33 AM

This is quite a fascinating architecture. If that RWT article is accurate then I am extremely interested in seeing some benchmarks.

I don't buy that overall per-core IPC must necessarily decrease (in relation to K10) because of reduced interger ALUs. Of course they will obviously miss out, compared to a 3 or 4 ALU core, on cases where int ILP is greater then 2. But in cases where the code is more mixed int and memory ops, IPC could go up in relation to K10 - based on available execution resources alone. Which case is more common obviously depends on the specific code being ran. Though I'd suggest that a program with consistently high integer ILP would be more efficient using packed integers (handled by the FPU) anyway.

If we add to that the fact that missed branches and cache misses (both significantly improved in BD) have a much greater effect on overall IPC than some missed ILP cases, it's clear that claiming lower IPC than K10 isn't really justified based on fewer ALUs alone. I doubt that BD will have lower IPC per-core than K10. In reality it's probably somewhere in the vast gulf between PII and SB.

As already noted though, IPC isn't the only factor in a processor's performance. This is obviously a high frequency design. The memory and cache subsystems are a big leap forward for AMD. They are designed to keep a large number of cores well fed - to minimize the amount of time that execution resources are waiting on data and thus increase efficiency. Intel will probably continue to lead in IPC by a significant margin. Whether AMD can increase frequency enough to make single threaded performance competitive remains to be seen. On the multi-threaded side BD sounds like a monster.

If AMD can't match Intel's single threaded performance it looks like we will have a split market come 2011. Office users and gamers might do best with SB while people doing encoding, folding, heavy multitasking, HPC, and servers might do best with BD.

**Motiv** · 08-31-2010, 10:40 AM

Originally Posted by Solus Corvus

snipped...

If AMD can't match Intel's single threaded performance it looks like we will have a split market come 2011. Office users and gamers might do best with SB while people doing encoding, folding, heavy multitasking, HPC, and servers might do best with BD.

While I agree with everything else you put (that RWT article is a must read for anyone who hasn't), I would say this last statement is wrong.

I suspect that margins will be significantly lower for gamers/office users (although will bobcat/llano fill the office space?). It could be a great result for overclockers, as we'll have access to decent multicore tech, that should have a bit of room to mess with.

So unless Intel go for a price war, all AMD has to do is price match on a performance level.

it's only people wanting absolute max, that care about who has the best CPU. The mainstream gamer just wants to spend £200 on a cpu and make sure that the cpu is competitive to other cpus round that price break.

**nn_step** · 08-31-2010, 10:43 AM

Originally Posted by Solus Corvus

This is quite a fascinating architecture. If that RWT article is accurate then I am extremely interested in seeing some benchmarks.

I don't buy that overall per-core IPC must necessarily decrease (in relation to K10) because of reduced interger ALUs. Of course they will obviously miss out, compared to a 3 or 4 ALU core, on cases where int ILP is greater then 2. But in cases where the code is more mixed int and memory ops, IPC could go up in relation to K10 - based on available execution resources alone. Which case is more common obviously depends on the specific code being ran. Though I'd suggest that a program with consistently high integer ILP would be more efficient using packed integers (handled by the FPU) anyway.

If we add to that the fact that missed branches and cache misses (both significantly improved in BD) have a much greater effect on overall IPC than some missed ILP cases, it's clear that claiming lower IPC than K10 isn't really justified based on fewer ALUs alone. I doubt that BD will have lower IPC per-core than K10. In reality it's probably somewhere in the vast gulf between PII and SB.

As already noted though, IPC isn't the only factor in a processor's performance. This is obviously a high frequency design. The memory and cache subsystems are a big leap forward for AMD. They are designed to keep a large number of cores well fed - to minimize the amount of time that execution resources are waiting on data and thus increase efficiency. Intel will probably continue to lead in IPC by a significant margin. Whether AMD can increase frequency enough to make single threaded performance competitive remains to be seen. On the multi-threaded side BD sounds like a monster.

If AMD can't match Intel's single threaded performance it looks like we will have a split market come 2011. Office users and gamers might do best with SB while people doing encoding, folding, heavy multitasking, HPC, and servers might do best with BD.

Why don't we look at the argument from another view point.

Show me the source code to 1 program which can sustain under optimal conditions an IPC greater than 1.8, for which multi-threading isn't a better solution.

For those of you smart enough to actually wonder what makes IPC greater than 1 possible [In source code]; let me save you a long winding trip and give you the answer; such a beast DOES NOT EXIST.

**Solus Corvus** · 08-31-2010, 10:55 AM

Let me think about that.

**Opteron146** · 08-31-2010, 10:55 AM

Originally Posted by -Boris-

BD has more resources since it can use 2 ALUs and 2 AGUs every clock, Phenom II averages at 1.5 ALUs and 1.5 AGUs since the share pipe. Again, if you can't use it, it isn't a resource. 2+2=4 (3+3)/2=3..

Hans wrote for the K8:

Each Scheduler can launch one ALU and one AGU operation per cycle. The ALU operation may come from one x86 instruction while the AGU operation may come from another.

http://chip-architect.com/news/2003_...it_Core.html#3
That is no 1.5, that is 3 ... maybe u missed the fact, that the MacroOps are splitted into µOps at that stage ?

**AliG** · 08-31-2010, 10:57 AM

correct there are 3 full integer operations in k8 and on, that can do either ALU or AGU, but as I understand it is more efficient due to improved prefetchers and smaller die sizes to use a 2+2 simplified design

**informal** · 08-31-2010, 11:01 AM

Originally Posted by Opteron146

Hans wrote for the K8:
http://chip-architect.com/news/2003_...it_Core.html#3
That is no 1.5, that is 3 ... maybe u missed the fact, that the MacroOps are splitted into µOps at that stage ?

Yes ,but at the back end the Macro ops are retired and K8/10h can do 3 of those while each Bulldozer integer core can do 4. That is 33% difference.

**freeloader** · 08-31-2010, 12:41 PM

Originally Posted by JF-AMD

I would not make assumptions about how our processor works based on how our competitor has implemented technology.

As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.

I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.

Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.

People have been asking for someone to really bring some real innovation to the market, I think you will see that.

I'd love to see that, IN MY LIFETIME!.....Just joking....anyhow, I'm not asking anymore questions about BD. I'm just going to wait for a product release.

**god_43** · 08-31-2010, 01:10 PM

Originally Posted by JF-AMD

I would not make assumptions about how our processor works based on how our competitor has implemented technology.

As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.

I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.

Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.

People have been asking for someone to really bring some real innovation to the market, I think you will see that.

this right here is the human element! it separates you from the bots JF! you show us that you care, that you want to tell us; but are unable too.

we (most anyways) understand, and appreciate what you have told us so far.

**MTd2** · 08-31-2010, 01:29 PM

JF, so each BD is faster clock per clock than the Phenom cores? Or is it by just comparing the top clocked frequency processors of each product line?

**Opteron146** · 08-31-2010, 01:30 PM

Originally Posted by AliG

correct there are 3 full integer operations in k8 and on, that can do either ALU or AGU,

No, it is not "either" it is both ... what do you not understand in the quote of Hans' article ?

but as I understand it is more efficient due to improved prefetchers and smaller die sizes to use a 2+2 simplified design

That is correct, the current IPCs of usual code is around 1, I think Nehalem achievs 1.5-1.7 in best cases, thus: 2 pipes are enough

Originally Posted by informal

Yes ,but at the back end the Macro ops are retired and K8/10h can do 3 of those while each Bulldozer integer core can do 4. That is 33% difference.

Yes you are right, but I never said anything against that point ;-)
Maybe one note on that, because I red it earlier: The AGU results are not retired, they go immediately into the LD/STR units, so the waiting µOp can get its mem-data ;-) Later, after the calculation of the µOp is finished, that µOp is retired.
So in short the retire / ExU ratio is 1:2 for both, not 1:3. For K10 it's (3:6) and for BD it's (4:8).

Thread: AMD's Bobcat and Bulldozer

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions