AMD's Bobcat and Bulldozer

Printable View

Show 100 post(s) from this thread on one page

08-24-2010, 04:30 AM
Rise

Module divisions are transparent... so does that mean natural multi-threading? meaning a single threaded app will be split among 2 cores at the module level?
08-24-2010, 04:30 AM
Hornet331

Quote:

Originally Posted by informal

AT covered the issue with the "issue" :) in their article here. Basically BD will have higher single thread performance than Deneb class cores.

Yes i already read the artciel...

Quote:

The 3rd ALU does have some performance benefits, and AMD canned it to reduce die size, but AMD mentioned that the 4-wide front end, fusion and other enhancements more than make up for this reduction. In other words, while there’s fewer single thread integer execution resources in Bulldozer than Phenom II, single threaded integer performance should still be higher.

The improtant part is other enhancements. ;)

4way wide frontend is wide, but alone its useless if you can't process the ops fast enough, so the solution is higher clocks speeds aka turbo.

Again i said, its highly likely that bulldozer will have higher singlethreaded performance, but IPC will go down compared to a single deneb core.
08-24-2010, 04:32 AM
Opteron146

Quote:

Originally Posted by Hornet331

Again i said, its highly likely that bulldozer will have higher singlethreaded performance, but IPC will go down compared to a single deneb core.

Not necessarily ... 1 ALU is just a tiny little part of the whole machine.
08-24-2010, 04:34 AM
Hornet331

Quote:

Originally Posted by Rise

Module divisions are transparent... so does that mean natural multi-threading? meaning a single threaded app will be split among 2 cores at the module level?

Each core is shown as own thread to the os.
08-24-2010, 04:39 AM
informal

Quote:

Originally Posted by Hornet331

Yes i already read the artciel...

The improtant part is other enhancements. ;)

4way wide frontend is wide, but alone its useless if you can't process the ops fast enough, so the solution is higher clocks speeds aka turbo.

Again i said, its highly likely that bulldozer will have higher singlethreaded performance, but IPC will go down compared to a single deneb core.

IPC will go down?Are you serious ? :p:
Having more underutilized units is bad.Having less units that are constantly fed and always have something to do(the part you made bold) will make the "IPC",the relative and mostly meaningless number (that is 99% of the time <=2), higher. That's the whole point of this machine.
08-24-2010, 04:44 AM
Mats

Quote:

Originally Posted by JF-AMD

Nope. No integrated PCIe. That would not allow us to use the same sockets.

That would be G34 and C32, is it AM3 as well?

Sorry, it's an obvious desktop question, but given that old roadmaps says YES, and some sources todays says NO, I just wish you could answer.;)
08-24-2010, 04:46 AM
madcho

Quote:

Originally Posted by informal

IPC will go down?Are you serious ? :p:
Having more underutilized units is bad.Having less units that are constantly fed and always have something to do(the part you made bold) will make the "IPC",the relative and mostly meaningless number (that is 99% of the time <=2), higher. That's the whole point of this machine.

IPC can go down if frequency is upped for same performance.

If they double pumped ALU with a double frequency, that could help to use the 4 wide front end.

let me hope for that :rolleyes:
08-24-2010, 04:51 AM
informal

Quote:

Originally Posted by madcho

IPC can go down if frequency is upped for same performance.

If they double pumped ALU with a double frequency, that could help to use the 4 wide front end.

let me hope for that :rolleyes:

I doubt they will double pump the ALUs.The units will just be more utilized,a difference from the units in today's parts which sit idle a lot of time .

Quote:

Originally Posted by Mats

That would be G34 and C32, is it AM3 as well?

Sorry, it's an obvious desktop question, but given that old roadmaps says YES, and some sources todays says NO, I just wish you could answer.;)

I think that the guy answering the question in the teleconference got a bit confused. It is possible that the new parts won't work in old boards (and the old CPUs will work in the new boards),but this has never been the case in the past : just look at the AM2/AM2+/AM3 comparison.
08-24-2010, 04:56 AM
ajaidev

Quote:

Originally Posted by Mats

That would be G34 and C32, is it AM3 as well?

Sorry, it's an obvious desktop question, but given that old roadmaps says YES, and some sources todays says NO, I just wish you could answer.;)

+1

JF-AMD pls answer this if possible...
08-24-2010, 04:57 AM
Chumbucket843

Quote:

Originally Posted by informal

I doubt they will double pump the ALUs.The units will just be more utilized,a difference from the units in today's parts which sit idle a lot of time .

what makes you say this?

high utilization of alu's is one of the hardest things to do. i doubt AMD has improved this dramatically, as it would require a drop in efficiency.
08-24-2010, 05:05 AM
informal

Quote:

Originally Posted by Chumbucket843

what makes you say this?

high utilization of alu's is one of the hardest things to do. i doubt AMD has improved this dramatically, as it would require a drop in efficiency.

Read the AT article.It is the hardest thing to do indeed,but the way this machine is built allows it to go a bit further ahead of itself(with all the BP,data speculation,much improved out of order loads and stores capability etc). I guess the more in depth slides from HotCHips will shed more light on this subject later today.
08-24-2010, 05:06 AM
G.Foyle

Quote:

Originally Posted by madcho

IPC can go down if frequency is upped for same performance.

IPC is maximum theoretical instructions PER CLOCK. It doesn't go up with frequency.
08-24-2010, 05:07 AM
Dimitriman

Here is my take on how BD will operate on sockets AM3 and AM3+

Bulldozer CPU = AM3 and AM3+ compatible

BD + AM3 = Dual Channel DDR3 enabled
BD + AM3+ = Quad Channel DD3 enabled

I bet difference is a few extra memory features on socket AM3+, just as previously done with sockets AM2=>AM2+/AM3
08-24-2010, 05:11 AM
informal

Hmm somehow I doubt there will be 4 ch. on AM3+ boards.
08-24-2010, 05:11 AM
Mats

Quote:

Originally Posted by Dimitriman

*Dreaming*

And how is it possible to cram in two extra channels without adding any pins?;)
08-24-2010, 05:12 AM
Dimitriman

well, I'm only speculating that the difference will be as such that you will get extra benefits on AM3+ but won't hinder AM3 compatibility

But its true QC won't be possible. but perhaps enhanced power features like the turbo mode, could be different on either sockets.
08-24-2010, 05:19 AM
Dresdenboy

Ontario 2 core + DX 11 GPU die is about 75 sqmm based on analysis of photos of the wafer shown at Computex.
08-24-2010, 05:24 AM
Mats

Quote:

Originally Posted by Dimitriman

well, I'm only speculating that the difference will be as such that you will get extra benefits on AM3+ but won't hinder AM3 compatibility

Yeah, I wonder how long we have to wait before we get an answer for that.

No matter if AM3+ have many or very few improvements, AMD won't tell us too soon I guess, because they want to sell as many CPUs and chipsets before BD as possible.

The 870/880G/890GX/890FX was launched this spring, and when was the specs revealed? Two or three months before? And that was even with a chipset with very minor updates, especially the NB.
This makes me think that it isn't very likely that we'll see any specs this year.

This socket confusion will get sorted out pretty soon tho.
08-24-2010, 05:55 AM
Manicdan

if BD dosnt drop into AM3, it better be due to costing alot and competing very well, then maybe the extra 100-200$ would be worth while/irrelevant.
since i was just going to wait for AM3+ before dropping AM2+ anyway, im not feeling very hurt, but it does kinda suck
08-24-2010, 06:00 AM
JF-AMD

Quote:

Originally Posted by Rise

Module divisions are transparent... so does that mean natural multi-threading? meaning a single threaded app will be split among 2 cores at the module level?

No. 1 thread, 1 core. Period.
08-24-2010, 06:04 AM
Behemot

Quote:

Originally Posted by Opteron146

There was a telephone conference for the press earlier.
However, at situations like this, the presentators keep some more (detailed) information for the real presentation.

The press information is more like a briefing.

If all the web's complied with NDA, they'd have more info now. Those who did so have the new slides:rolleyes:

As for the socket compatibility, there is no info about it in the AMD slides and what was said during the conference call does not say as much as the webs, too. So I guess if they don't have another source they just made the informations trying to hit the mark.

I just can't understand how can you talk on 6 pages about infromations based on nothing? Calm down people! It's starting to be impossible to find something important betwen all the crap here. If it will continue this way, people will move to another forums:shakes:
08-24-2010, 06:20 AM
-Boris-

Quote:

Originally Posted by Opteron146

Well the traditional NB had:
FSB Interface, IMC and SB connection.

Now AMD stresses the IMC, so FSB and SB connector are left out.

FSB is now Hypertransport (well at least from the point of view of Multi processing, otherwise it is the Xbar), and the SB connect was back then PCI, nowadays PCIe - seems fine to me.

FSB Interface was in the processor too, otherwise it couldn't communicate with the NB. So HT has nothing to do with a NB.

NB has always been memory controller, the main bus interface has always been present in the processor. K7 had an area called "Bus interface unit".
So I still can't see what part of the NB except the IMC that has been integrated in Phenom. :confused:
08-24-2010, 06:23 AM
Hornet331

Quote:

Originally Posted by G.Foyle

IPC is maximum theoretical instructions PER CLOCK. It doesn't go up with frequency.

He means, that if you have lower ipc, you can compensate it by increasing the clockspeed to get the same performance.
08-24-2010, 06:23 AM
Manicdan

Quote:

Originally Posted by JF-AMD

No. 1 thread, 1 core. Period.

i think were all interested about the IPC difference between 2 threads per module and 1, not how much turbo can overclock it, but just how a thread by itself will act (it looks like we already know its going to have 2MB of L2 all to itself)
08-24-2010, 06:51 AM
superrugal

Quote:

Fruehe says that AMD will be able to get six-core and eight-core Bulldozer chips in the 30 to 40 watt power range, which is pretty low for a server. "The question is this," says Fruehe. "Is there a need for a more discrete, less-threaded chip for servers?"

John is this true?:eek:
08-24-2010, 07:03 AM
Dimitriman

Quote:

Originally Posted by superrugal

John is this true?:eek:

Tbh I don't see ANY reason why AMD can't release 3 module (6 core) Bulldozer units besides product portfolio choices.

Each module is independent and the design would permit this choice fairly easily i believe.
Plus there will always be 4 module chips with 1 defective module to disable ;)
08-24-2010, 07:18 AM
superrugal

Quote:

Originally Posted by Dimitriman

Tbh I don't see ANY reason why AMD can't release 3 module (6 core) Bulldozer units besides product portfolio choices.

Each module is independent and the design would permit this choice fairly easily i believe.
Plus there will always be 4 module chips with 1 defective module to disable ;)

I just wondering how can bulldozer perform such a low power within 8 cores.

Finally JF answer me.:D

http://www.amdzone.com/phpbb3/viewto...37878&start=50
08-24-2010, 07:20 AM
JF-AMD

Quote:

Originally Posted by Manicdan

i think were all interested about the IPC difference between 2 threads per module and 1, not how much turbo can overclock it, but just how a thread by itself will act (it looks like we already know its going to have 2MB of L2 all to itself)

I actually have a meeting on this at lunch. It's actually a pretty cool discussion. However, don't expect a detailed discussion until much later.

I believe the next round of "20 questions" answers (some time next week) will tackle part of this. But there are some pretty cool SW things to consider.

That is all I can say.
08-24-2010, 07:21 AM
JF-AMD

Quote:

Originally Posted by superrugal

I just wondering how can bulldozer perform such a low power within 8 cores.

Finally JF answer me.:D

http://www.amdzone.com/phpbb3/viewto...37878&start=50

I answered your other thread.

For the rest of you, we hit that today with 4-6 core server parts and the 6-8 core parts fit in the same socket with the same power/thermals. You can do the math on that.
08-24-2010, 07:27 AM
Manicdan

Quote:

Originally Posted by JF-AMD

I actually have a meeting on this at lunch. It's actually a pretty cool discussion. However, don't expect a detailed discussion until much later.

I believe the next round of "20 questions" answers (some time next week) will tackle part of this. But there are some pretty cool SW things to consider.

That is all I can say.

when your waiter stands next to you at very odd times listening in on the convo, you will be wondering if its me.
08-24-2010, 07:31 AM
ajaidev

Quote:

Originally Posted by Manicdan

when your waiter stands next to you at very odd times listening in on the convo, you will be wondering if its me.

Or you could plant a bug and then invite a few special pal's "like me" to listen in via the net... ya it would be diabolical :p:
08-24-2010, 07:33 AM
Frontl1ne

So do we have a definitive/official response as to whether or not Bulldozer will be compatible with current AM3 motherboards? This thread grows too fast for me to follow :p
08-24-2010, 07:34 AM
Manicdan

Quote:

Originally Posted by ajaidev

Or you could plant a bug and then invite a few special pal's "like me" to listen in via the net... ya it would be diabolical :p:

i own AMD stock, if this got out and hurt the price, i would feel bad

infact i think he should be packing some heat and scare off anyone suspicious looking
08-24-2010, 07:39 AM
JF-AMD

Quote:

Originally Posted by Manicdan

when your waiter stands next to you at very odd times listening in on the convo, you will be wondering if its me.

I think the fact that there is a waiter in the conference room at AMD would be enough of a tip.

I don't think I have had more than 2 lunches in restaurants in the last 6 months, just a bit busy around here.
08-24-2010, 07:41 AM
gallag

I would not give AMD a hard time for needing a new mobo, I would rather they design the cpu then the mobo to suite rather having design trade offs to accomadate the older mobo tech.
08-24-2010, 07:43 AM
Manicdan

Quote:

I think the fact that there is a waiter in the conference room at AMD would be enough of a tip

it worked, we know exactly where to plant the bug. now i just need an inside man, a tracking bug (maybe a baby monitor?), and a black suv with tinted windows parked in the grass just 20ft away from the room. sounds easy
08-24-2010, 07:43 AM
Manicdan

Quote:

Originally Posted by gallag

I would not give AMD a hard time for needing a new mobo, I would rather they design the cpu then the mobo to suite rather having design trade offs to accomadate the older mobo tech.

would be nice to know what AM3+ will offer that is going to be a deal breaker
08-24-2010, 07:44 AM
Chumbucket843

Quote:

Originally Posted by superrugal

John is this true?:eek:

sure a low voltage chip can do that. at 3GHz? unlikely.

Quote:

Originally Posted by informal

Read the AT article.It is the hardest thing to do indeed,but the way this machine is built allows it to go a bit further ahead of itself(with all the BP,data speculation,much improved out of order loads and stores capability etc). I guess the more in depth slides from HotCHips will shed more light on this subject later today.

no, i will not read that article. i dont need someone to tell me the sky is blue either.

the only way to increase alu utilization efficiently is through software. using hardware to improve this is generally inefficient. gpu's are a good example of this. it is harder to exploit data parallelism but the majority of transistors in gpu's are sram in register files/caches, and alu's. this is about as lean as you can get.
08-24-2010, 08:08 AM
terrace215

16KB L1D (down from 64KB in K10.5), only 2 ALUs per integer core (down from 3 in K10.5), longer pipeline... compensated by some op fusion, better prefetching & branch prediction, but: pretty clear now that the single- (and therefore low-) threaded integer IPC is NOT going to make a huge jump from K10.5. I'm not sure it will match Nehalem, even- might be close. SB should be well in the clear.
08-24-2010, 08:28 AM
blindbox

The only thing I'm wondering with regards to AM3/3+ compatibility of bulldozer, are there no slides regarding it? It seems like everything is word of mouth.

I was quick to believe what Opteron146 has quoted but now that I think of it, there aren't any official slides regarding platform yet. Not all infos are released yesterday (or today for those still in the 24th).
08-24-2010, 08:34 AM
geo

Quote:

. In many ways the architecture looks to be on-par with what Intel has done with Nehalem/Westmere. We finally have a wider front end, branch fusion, power gating/true turbo and more aggressive prefetching. Whether or not AMD can surpass Sandy Bridge’s performance really boils down to how many Bulldozer modules you get at what price. If 2 module (4-core) Bulldozer CPUs go up against dual-core Sandy Bridge things could get very interesting.

- Anandtech

with all the delays (late ?) 2011 Bulldozer and now this. BD is looking more and more like Microsoft Vista. hope i'm wrong.
08-24-2010, 10:43 AM
terrace215

It's more interesting that they slimmed down the integer cores to 2 ALUs/2 AGUs.

This is a big change from the 3 ALUs/3 AGUs that AMD has used for K7/K8/K10.
Combine this with the four wide front end/decoder and it is clear that BS is a
throughput design optimized for server type workloads.

It looks like AMD concluded that it can't match Intel in single core performance
and it would always have to throw 2 cores at every Intel core. It sat down and
looked carefully at how much sharing and trimming it could do to make 2 of its
cores closer in area and power to 1 Intel Nehalem/SB core. The biggest moves
are sharing front end and FPU between cores and slimming down the integer
cores. This should keep AMD in the game in servers where it can put two 8 core
devices into an MCM and sell it against 8 core SB and 10 core Westmere-EX.
The loss of the third ALU/AGU in the BD's integer cores likely won't make a lot
of difference to commercial server workloads which have low ILP and high MLP.
The sharing of the FPUs probably won't hurt HPC performance much because
with 8 cores per die and 16 cores per package memory system limitations will
tend to dominate.

IMO where it becomes problematic is building client devices based on the BD
architecture. AMD will really have to ratchet up its CPU clock rates to more
than make up for the loss in IPC from the slimmed down integer cores.

http://aceshardware.freeforums.org/a...iew-t1042.html
08-24-2010, 10:45 AM
informal

What are we looking at here terrace? An opinion from traditionally anti-AMD guy(Demone) who has no real perf. data at hand? What "losses in IPC" is he talking about?
08-24-2010, 10:52 AM
btarunr

Quote:

Originally Posted by -Boris-

http://www.techpowerup.com/img/10-08-24/130g.jpg

Here we have the reason. Integrated PCI-Express controller.

EDIT: Well obviously not, but that what techpowerup claims.

There's "L3 Cache and NB" and "Integrated Northbridge Controller".

The "NB" in "L3 Cache and NB" is the northbridge component found on AMD processors since K8. The "Northbridge Controller" is the IOMMU (what we're used to referring to as northbridge, such as AMD 790FX) integrated into the processor die.
08-24-2010, 10:59 AM
LesGrossman

Quote:

Originally Posted by terrace215

It's more interesting that they slimmed down the integer cores to 2 ALUs/2 AGUs.
It looks like AMD concluded that it can't match Intel in single core performance and it would always have to throw 2 cores at every Intel core.

That's like saying that with K8 AMD concluded that couldn't match the higher cpu clocks of netburst and that would always have to throw more instructions per clock at Intel MHZ's.

Crysis 2 will support 8 cores, good luck with the single thread performance thing.
08-24-2010, 11:05 AM
-Boris-

Quote:

Originally Posted by terrace215

16KB L1D (down from 64KB in K10.5), only 2 ALUs per integer core (down from 3 in K10.5), longer pipeline... compensated by some op fusion, better prefetching & branch prediction, but: pretty clear now that the single- (and therefore low-) threaded integer IPC is NOT going to make a huge jump from K10.5. I'm not sure it will match Nehalem, even- might be close. SB should be well in the clear.

Size of the cache isn't everything, now it might be quicker, 8-way or more? And the L1 takes a pretty large die area on all K7 derivatives so the space might be used more efficient now. And as they said, you could quite easily make up for the lower cache size.
The pipeline part might be something good, Athlon 64 has a longer pipeline than K7, and core 2 has a longer pipe than A64, and nehalem has an even longer one, only prescott has a longer pipe than Nehalem among X86 cores. Now they might have evened out the execution to be more streamlined, and these stages could be related to prefetch, thus increasing effiency.
And when it comes to pipes, Phenom II has 3 pipes, that means 3 ALU or 3 AGU. While Bulldozer has 4 pipes, which translates to 2 ALU and 2 AGU. That means that Bulldozer most likely will have quicker execution than Phenom II, and might even save die space at the same time.
So even without the branch prediction, prefetch and op fusion Bulldozer might be faster. Now consider the very aggresive prefetch, which translates to better use of available bandwidth, enhanced memory controller, dual 128 bit FPUs, and possibly higher clockspeeds.

Bulldozer might be an insane performer, we simply have to little knowledge to judge at this point.
08-24-2010, 11:05 AM
god_43

honestly wtf cares about single thread performance? geez if ppl cared about it, they would not be buying dual-cores even. the whole nature of multiple cores is for multi-threading, anything else is fail!
08-24-2010, 11:10 AM
Oliverda

Quote:

Originally Posted by ajaidev

http://www.techpowerup.com/129392/AM...hitecture.html

Another one

:( :( :( :( waaa waa

techpowerup fail

"At the chip-level, there's a large L3 cache, a northbridge that integrates the PCI-Express root complex, and an integrated memory controller. Since the northbridge is completely on the chip, the processor does not need to deal with the rest of the system with a HyperTransport link. It connects to the chipset (which is now relegated to a southbridge, much like Intel's Ibex Peak), using A-Link Express, which like DMI, is essentially a PCI-Express link. It is important to note that all modules and extra-modular components are present on the same piece of silicon die. Because of this design change, Bulldozer processors will come in totally new packages that are not backwards compatible with older AMD sockets such as AM3 or AM2(+). "

:rofl:

official slide from last year's Analyst Day:

http://prohardver.hu/dl/upc/2009-11/...road.thumb.png

http://prohardver.hu/dl/upc/2009-12/13063_amd_rmap.png
08-24-2010, 11:15 AM
Opteron146

Quote:

Originally Posted by blindbox

The only thing I'm wondering with regards to AM3/3+ compatibility of bulldozer, are there no slides regarding it? It seems like everything is word of mouth.

I was quick to believe what Opteron146 has quoted but now that I think of it, there aren't any official slides regarding platform yet. Not all infos are released yesterday (or today for those still in the 24th).

There are no slides, because the slides are only about a Bulldozer module, not a full Zambezi CPU/chip.

However, people ask after the slide's presentation in the Q&A session about AM3 compatibility and then the AMD officials made that non-AM3 statement :( :(
08-24-2010, 11:16 AM
Manicdan

Quote:

Originally Posted by god_43

honestly wtf cares about single thread performance? geez if ppl cared about it, they would not be buying dual-cores even. the whole nature of multiple cores is for multi-threading, anything else is fail!

it does have some purpose
imagine if we hear AMD say that you can have 75% of the performance at 4 threads that can have with 8 thread on BD, either it sounds like core scaling sucks, or it would mean that when 1 thread has the BD module all to itself, the bonuses are astonishing. this is very important for things that do have multi cpu support, but still have one thread limiting the others. like in games, one thread is 100% usage, the other 2 may be at 40-70% usage. so if you can give your most resource hogging thread more performance (like turboing) it can run at up to 120%, letting the other threads reach 50-90% usage.

getting the most power out of a cpu at 1, 2, 3, 4, 5, 6, 7, or all threads in a given TDP is what were are noticing both the blue and green teams are trying to accomplish, and in the end it means a WELL ROUNDED cpu.

just look at laptops, if you buy a quad u get 2ghz, if u buy a duel you get 3ghz, with proper features the quad should ALWAYS be faster than the duel, even if just 2 threads are needed. and hopefully we will be there soon enough
08-24-2010, 11:16 AM
terrace215

Quote:

Originally Posted by god_43

honestly wtf cares about single thread performance? geez if ppl cared about it, they would not be buying dual-cores even. the whole nature of multiple cores is for multi-threading, anything else is fail!

Single-thread performance --> low-thread performance, due to the nature of 2-cored modules and Intel's 2-way HT.

So you are essentially saying "wtf cares about 4-threaded performance?!?!"

Well, that encompasses the vast majority of what you'll be doing on the desktop most of the time, so, the answer is: most people.
08-24-2010, 11:22 AM
terrace215

Quote:

Originally Posted by LesGrossman

That's like saying that with K8 AMD concluded that couldn't match the higher cpu clocks of netburst and that would always have to throw more instructions per clock at Intel MHZ's.

Crysis 2 will support 8 cores, good luck with the single thread performance thing.

In that case, you'll have the 8-cored SB 2011 monster able to run one thread per core, while the 8-core Zambezi runs in full module-based-resource sharing mode. The result will not be pretty.

Again, single thread performance really translates as N-threaded performance, where N is either the number of cores in an Intel device, or the number of modules in a BD device. Because up until those thresholds, shared resources aren't really having much of an impact, and you are not really constrained much by power, etc.
08-24-2010, 11:23 AM
kl0012

Quote:

Originally Posted by LesGrossman

Crysis 2 will support 8 cores, good luck with the single thread performance thing.

Quote:

Originally Posted by god_43

honestly wtf cares about single thread performance? geez if ppl cared about it, they would not be buying dual-cores even. the whole nature of multiple cores is for multi-threading, anything else is fail!

It's a big mistake to think that a single thread performance doesn't matter. And I don't mean a single threaded programs. Just think about it - what is better -8 fast cores or 8 slow cores in a multithreaded program? Now if we speak about an upcoming high-end desktops then that would be 8x 4-wide 2 thread/core capable SB int cores vs 8x 2-wide Bulld int cores and 8x265 SB fpu vs 4x256 Bulld FPU. Yes, I'm avare about 16-core interlagos, but this probably is going to be a multimodule server oriented chip.
08-24-2010, 11:23 AM
-Boris-

Quote:

Originally Posted by btarunr

There's "L3 Cache and NB" and "Integrated Northbridge Controller".

The "NB" in "L3 Cache and NB" is the northbridge component found on AMD processors since K8. The "Northbridge Controller" is the IOMMU (what we're used to referring to as northbridge, such as AMD 790FX) integrated into the processor die.

No, the PCIe controler (like 790FX) isn't integrated. So we still don't know what that part is.
There are three NBs on that picture, and only one that truly is a northbridge, and that isn't even labeled as a northbridge, it's labeled as an IMC.
08-24-2010, 11:25 AM
Manicdan

Quote:

Originally Posted by terrace215

In that case, you'll have the 8-cored SB 2011 monster able to run one thread per core, while the 8-core Zambezi runs in full module-based-resource sharing mode. The result will not be pretty.

Again, single thread performance really translates as N-threaded performance, where N is either the number of cores in an Intel device, or the number of modules in a BD device. Because up until those thresholds, shared resources aren't really having much of an impact, and you are not really constrained much by power, etc.

in realistic cases it means that any 8core cpu can probably run crysis at 2ghz cause it would be built around 3ghz quads for high, and 2.5ghz duels as the normal.

doubt either 8core chip will "lag" in crysis
08-24-2010, 11:29 AM
superrugal

Finally something news.

http://www.semiaccurate.com/2010/08/...mds-bulldozer/

http://www.semiaccurate.com/static/u...r_prefetch.png
08-24-2010, 11:43 AM
Oliverda

delete pls
08-24-2010, 11:53 AM
FlanK3r

read last post from yuri.cs Now, i dont know reality about AM3+ vs AM3 :confused::shrug:
http://translate.google.cz/translate...409%23p8041409
08-24-2010, 12:08 PM
ajaidev

Quote:

Originally Posted by kl0012

It's a big mistake to think that a single thread performance doesn't matter. And I don't mean a single threaded programs. Just think about it - what is better -8 fast cores or 8 slow cores in a multithreaded program? Now if we speak about an upcoming high-end desktops then that would be 8x 4-wide 2 thread/core capable SB int cores vs 8x 2-wide Bulld int cores and 8x265 SB fpu vs 4x256 Bulld FPU. Yes, I'm avare about 16-core interlagos, but this probably is going to be a multimodule server oriented chip.

ammm sandy bridge uses a double pumped fpu makeup that means legacy "non-avx" will run it at 8*128 bit, bulldozer will also run the same with 4 modules or 8 cores.

Quote:

Originally Posted by FlanK3r

read last post from yuri.cs Now, i dont know reality about AM3+ vs AM3 :confused::shrug:
http://translate.google.cz/translate...409%23p8041409

Cant understand the translation can you give a summery of the post....
08-24-2010, 12:28 PM
ajaidev

http://www.brightsideofnews.com/Data...t_Core_675.jpg

Bobcat article - http://www.brightsideofnews.com/news...-movement.aspx

AMD Bobcat Core plan: Add an 80-core Cedar GPU and you have - Ontario

80 shaders means it has double the cores of 4200/3200 and since i have overclocked a 4290 to 900mhz i know how well it performs. If this 80 core GPU is also clocked around 750-1000Mhz we may find quite some graphical horses under the bobcats bonnet... :D
08-24-2010, 01:07 PM
El Mano

Quote:

Originally Posted by LesGrossman

Crysis 2 will support 8 cores, good luck with the single thread performance thing.

Link?
08-24-2010, 01:16 PM
justin.kerr

lost planet 2 uses 12 threads, at 4.5Ghz it keeps all 12 above 70%
08-24-2010, 01:20 PM
Chumbucket843

Quote:

Originally Posted by ajaidev

http://www.brightsideofnews.com/Data...t_Core_675.jpg

is that a joke? that's the most ridiculous floorplan i have ever seen. it looks like a map or a cloud or something.:shrug:
08-24-2010, 01:32 PM
deeperblue

Quote:

Originally Posted by Chumbucket843

is that a joke? that's the most ridiculous floorplan i have ever seen. it looks like a map or a cloud or something.:shrug:

Computer synthesized, only a few things are laid out by hand.
08-24-2010, 01:35 PM
madcho

i was thinking the same the 10 first second. Looks like a fake map on a ship :D
08-24-2010, 01:41 PM
El Mano

Quote:

Originally Posted by justin.kerr

lost planet 2 uses 12 threads, at 4.5Ghz it keeps all 12 above 70%

I just wanted to read more about it. You can assign CPU time to many tasks, some are useful and some are not.
08-24-2010, 01:41 PM
Chumbucket843

Quote:

Originally Posted by deeperblue

Computer synthesized, only a few things are laid out by hand.

synthesizers do not floorplan. generally modern logic blocks are 50-100K gates, probably around 300-600K transistors. this is a rather large chunk when the core itself, including L1 & L2 it is probably <20M transistors.
08-24-2010, 01:43 PM
Sn0wm@n

LOL is descriptive of this thread
08-24-2010, 01:50 PM
Chumbucket843

Quote:

Originally Posted by god_43

honestly wtf cares about single thread performance? geez if ppl cared about it, they would not be buying dual-cores even. the whole nature of multiple cores is for multi-threading, anything else is fail!

http://en.wikipedia.org/wiki/Amdahl's_law

a simple example:

a task is 50% parallel, 50% serial.
if i speed up the parallel part by 2x i increase performance by a factor of 1.33
if i speed up the parallel part by 10x i increase performance by a factor of 1.81
if i speed up the parallel part by 100x i increase performance by a factor of 1.98
if i speed up the parallel part infinitely i increase performance by a factor of 2
08-24-2010, 01:54 PM
radaja

Will BD DROP into current AM3 boards or not?
08-24-2010, 01:57 PM
-Boris-

Quote:

Originally Posted by justin.kerr

lost planet 2 uses 12 threads, at 4.5Ghz it keeps all 12 above 70%

How do you know? In a true 12 core design do you think it would be over 70%? Maybe it uses four cores, and the HT-threads easily gets maxed out. Four cores is around 66% of a quad.

My point is, having a hexa core being utilized 70% can only mean at least 4 cores. It could be 12, but there is no way for you to see it, and since it isn't 100%, I seems like it isn't using all cores.

Quote:

Originally Posted by deeperblue

Computer synthesized, only a few things are laid out by hand.

Finally, chips from AMD has always been nicely ordered, pointing at a mostly hand made layout. I can imagine that it leads to an uneven power usage and unnecessary long circuits and timings. And wastes die space.
08-24-2010, 02:21 PM
Chumbucket843

Quote:

Originally Posted by -Boris-

Finally, chips from AMD has always been nicely ordered, pointing at a mostly hand made layout. I can imagine that it leads to an uneven power usage and unnecessary long circuits and timings. And wastes die space.

lol, it is the exact opposite. hand layout is much better. humans are better at finding eulerian paths and coming up with clever layouts. computers cant really do that with all of the design rules and other parameters as effectively. the difference in performance is 2.6-7x faster with custom designed circuits.

really what happens is a coder will simulate his module and make sure it reaches the targeted timing, which is usually much higher than actual delay to assure robust operation. if the logic cant reach the speed it is either rewritten or circuit designers optimize it. in certain logic families it must be entirely custom designed.

circuits that are custom designed are usually things like power gating, clock distribution, and analog circuits such as pll's, dll's, and memory controllers/ io pads.
08-24-2010, 02:23 PM
Calmatory

L1D drop to one fourth could possibly indicate two things, inclusive cache hierarchy and in rare cases slightly higher clocks.

Deeper pipelines then could further increase the clocks and simplify the design. I believe inclusive cache hierarchy would more than compensate for the loss in IPC, and if the BP is better, it could possibly even reduce the time spent in stalls vs. K10 with weaker BP but shorter pipelines.

...or then... The cache hierarchy remains exclusive and slow. Any benefit of exclusive cache is being lost due to smaller L1 compared to L2, and L2 compared to L3. Deeper pipeline reduces the IPC but doesn't allow much higher clocks to compensate for it. Aggressive BP isn't good enough to compensate deeper pipeline. More L1D misses. Speculative execution would lead to wrong decisions and L1I misses too often, greatly worsening the IPC in some cases while improving it only marginally in most cases. ...and GF messes with 32nm HighK SOI, effectively ruining the benefit of more cores in hope for better yields.

Inclusive cache hierarchy would mean that AMD would be able to squeeze the gap between K10/Nehalem cache performance, and bring BD to SB levels of cache performance.
08-24-2010, 02:23 PM
deeperblue

Quote:

Originally Posted by Chumbucket843

synthesizers do not floorplan. generally modern logic blocks are 50-100K gates, probably around 300-600K transistors. this is a rather large chunk when the core itself, including L1 & L2 it is probably <20M transistors.

AMD says (http://www.youtube.com/watch?v=VIs1CxuUrpc)
"Synthesizable with small number of custom arrays"
Together with what was said before I think one of the main goals that AMD wants to achieve is to have easily customizable processors. Add a gpu core here, some cache there and another core here. From the slide it looks like lots of their process is already capable of being laid out by a computer.
We have the caches, the integer units and the floating point units being the fixed hand optimized blocks with stuff like the x86 decode organically filling up the space in between. AMD also says it makes it easier to put the whole thing on a different process.

I've only limited knowledge about modern synthesizing and floor planning from working with some FPGAs.
Maybe Hans or somebody in the industry can say something about Bobcat?
08-24-2010, 02:30 PM
Hornet331

Only 1 1/2 hour till the other NDA on the presentation slides drop.
08-24-2010, 02:37 PM
blindbox

I hope they'll talk about platforms (both server and desktop) on the second wave of slides.
08-24-2010, 02:47 PM
informal

1 Attachment(s)

Another deck of slides,or a good part of them, is up at certain (well) known website that deals with hardware inner things :D. I don't want to say which one since the NDA is not over yet(1hour remaining) so there is a chance it gets pulled :). If you understood my word play you will know which website it is ;).

edit:
since terrace asked me on SA forum about the load/store BW here it is,confirmed(2x128bit L and 1x128bit S capability,per core-the slide talks about dedicated integer cores and distinct/non shared features).This is from the other slide deck with more detailed info on Bulldozer.The thing i noticed about the shared FPU is the dual 128bit packed integer pipelines ,along the 2 128bit FMACs that sit inside the FPU monster.
08-24-2010, 02:58 PM
xlink

Quote:

Originally Posted by terrace215

Single-thread performance --> low-thread performance, due to the nature of 2-cored modules and Intel's 2-way HT.

So you are essentially saying "wtf cares about 4-threaded performance?!?!"

Well, that encompasses the vast majority of what you'll be doing on the desktop most of the time, so, the answer is: most people.

sc2 only uses 2 cores. Photoshop is limited by the user's skill not the computer. Web browsers are dependent upon the connection

so the answer is: only nerds who rage over 1/10th of a frame rate on forums.

desktops are single thread oriented
08-24-2010, 02:59 PM
MrMojoZ

Quote:

Originally Posted by STaRGaZeR

So this is AMD's Hot Chips presentation? Gosh, they didn't say anything new except for Bobcat's L1 and L2 cache sizes as Hans has already said... :rolleyes:

Saw AMD in thread title and couldn't resist huh? Try harder.
08-24-2010, 03:22 PM
terrace215

Quote:

Originally Posted by xlink

desktops are single thread oriented

And um, that's the problem for BD. ;)
08-24-2010, 03:24 PM
informal

Quote:

Originally Posted by terrace215

And um, that's the problem for BD. ;)

Why do you say that ? :D
How do you know how it performs in underutilized scenarios?
08-24-2010, 03:25 PM
terrace215

Well, that doesn't actually go as far as saying you can do all 3 (2 loads, 1 store) at the same time...

There's now info up at tech report, too.
08-24-2010, 03:29 PM
informal

Quote:

Originally Posted by terrace215

Well, that doesn't actually go as far as saying you can do all 3 (2 loads, 1 store) at the same time...

Well that's the most granular bit you will get for now :D. You can interpret it as you wish :p:
08-24-2010, 03:33 PM
Hans de Vries

Quote:

Originally Posted by terrace215

And um, that's the problem for BD. ;)

1) JF told you at the other thread that IPC is higher.

2) If that's true then the higher frequency design comes on top of that.

3) And last but not least: Power gating Turbo now allows much higher single core frequencies.

Looks like a 1-2-3 speed bump for single thread performance to me....

Regards, Hans
08-24-2010, 03:55 PM
xlink

Quote:

Originally Posted by Hans de Vries

1) JF told you at the other thread that IPC is higher.

2) If that's true then the higher frequency design comes on top of that.

3) And last but not least: Power gating Turbo now allows much higher single core frequencies.

Looks like a 1-2-3 speed bump for single thread performance to me....

Regards, Hans

fewer execution units...

those three factors coupled with other architectural improvements would need to be able to be at least 50% faster in some instances.
08-24-2010, 04:20 PM
slaveondope

Its AMD, you got to have faith they'll give Intel something to think about ;):up:

I love the under dog:hump:
08-24-2010, 04:29 PM
safan80

I like how amd is combating HT by intel. I'm looking forward to both bobcat and bulldozer. Although I still wonder how AMD is going to complete in the mobile market against intel's atom.
08-24-2010, 04:39 PM
rcofell

Quote:

Originally Posted by Chumbucket843

synthesizers do not floorplan. generally modern logic blocks are 50-100K gates, probably around 300-600K transistors. this is a rather large chunk when the core itself, including L1 & L2 it is probably <20M transistors.

Not exactly sure which context your mean when you say they "do not floorplan", but they definitely allow floorplanning at some level. The first step of synthesis, RTL -> Netlist, doesn't floorplan (it just cares about standard cells usage & timing estimates/constraints), if that's what you're trying to get at. However, the second step of synthesis, Netlist -> Placement (placement tool), definitely does floor-planning.

Tools like Cadence Encounter take floorplan constraints and allow for partitioning sub-modules, however like the picture above shows the results tend to look like a jumbled mess, since strict boundaries aren't adhered to.

Quote:

Originally Posted by -Boris-

Finally, chips from AMD has always been nicely ordered, pointing at a mostly hand made layout. I can imagine that it leads to an uneven power usage and unnecessary long circuits and timings. And wastes die space.

Well, historically [x86] chips from both camps have always been mainly custom layout in the datapath with a varying amount of synthesized control logic, seeing pictures like this is a bit of an eye opener from the norm :D

Another example is Intel's Pine-Trail (bigger):
http://www.intel.com/pressroom/enhan...netrail_06.jpg
The huge purple blotch running down the middle is all synthesized logic :yepp:

Quote:

Originally Posted by deeperblue

AMD says (http://www.youtube.com/watch?v=VIs1CxuUrpc)
"Synthesizable with small number of custom arrays"
Together with what was said before I think one of the main goals that AMD wants to achieve is to have easily customizable processors. Add a gpu core here, some cache there and another core here. From the slide it looks like lots of their process is already capable of being laid out by a computer.
We have the caches, the integer units and the floating point units being the fixed hand optimized blocks with stuff like the x86 decode organically filling up the space in between. AMD also says it makes it easier to put the whole thing on a different process.

I've only limited knowledge about modern synthesizing and floor planning from working with some FPGAs.
Maybe Hans or somebody in the industry can say something about Bobcat?

You pretty much sum up my thoughts on the matter, it looks like they shot for a semi-custom approach by supplying some of the main datapath logic (not necessarily say the whole FPU, etc., just the important chunks) and the arrays as hard-macros/external-IP (in- or out-of-house, doesn't matter) while synthesizing the rest.

While they're definitely not unique in the approach, it will certainly provide a quicker process adaptation, since only a standard cell library and select logic/array-IP pieces would technically be necessary. Granted there's still a bit more work than just swapping libraries/IP and pressing a few buttons :p:

Quote:

Originally Posted by Chumbucket843

lol, it is the exact opposite. hand layout is much better. humans are better at finding eulerian paths and coming up with clever layouts. computers cant really do that with all of the design rules and other parameters as effectively. the difference in performance is 2.6-7x faster with custom designed circuits.

really what happens is a coder will simulate his module and make sure it reaches the targeted timing, which is usually much higher than actual delay to assure robust operation. if the logic cant reach the speed it is either rewritten or circuit designers optimize it. in certain logic families it must be entirely custom designed.

circuits that are custom designed are usually things like power gating, clock distribution, and analog circuits such as pll's, dll's, and memory controllers/ io pads.

Just pointing out that while we humans can definitely be more adapt at coming up with these clever (sometimes novel) solutions to optimizing layout area/timing/congestion-constraints, it's also a significant capital and time investment, so it's for ROI and time to market reasons that it doesn't always work out. The case of Bobcat is obvious an example of this, and Atom for that matter.

Honestly I would find the logically optimal euler path to be much easier for a computer to solve :D
But yes computerized tools aren't very good when it comes to balancing the plethora of added constraints in a physical world, hence us restricting them to sub-optimal standard cells + wiring constraints.
08-24-2010, 04:42 PM
terrace215

Quote:

Originally Posted by Hans de Vries

1) JF told you at the other thread that IPC is higher.

2) If that's true then the higher frequency design comes on top of that.

3) And last but not least: Power gating Turbo now allows much higher single core frequencies.

Looks like a 1-2-3 speed bump for single thread performance to me....

Regards, Hans

Hans, I wouldn't put much faith in "much higher single core frequencies", relative to phenom II.

As for the IPC, I could believe it is higher, but having seen how they've trimmed the cores in certain ways, not that much higher.
08-24-2010, 04:43 PM
crazydiamond

Quote:

Originally Posted by informal

Why do you say that ? :D
How do you know how it performs in underutilized scenarios?

they dont, but standard procedures say to enter amd thread and say anything negative about amd even if there is no proof to back it
08-24-2010, 04:56 PM
Chumbucket843

Quote:

Originally Posted by crazydiamond

they dont, but standard procedures say to enter amd thread and say anything negative about amd even if there is no proof to back it

you say that so much and i think it's easier to just either visit another forum, not get so upset, ignore the posts, or even just set certain users to ignore. notice it's not a big deal to anyone else. the shintai era is over dude.

some people dont like amd and you'll just have to deal with it. personally i am more frustrated with them (amd) more than anything else. i have high standards, especially from big companies. and yes AMD is a big semiconductor company.
08-24-2010, 04:57 PM
Hans de Vries

Quote:

Originally Posted by deeperblue

AMD says (http://www.youtube.com/watch?v=VIs1CxuUrpc)
"Synthesizable with small number of custom arrays"
Together with what was said before I think one of the main goals that AMD wants to achieve is to have easily customizable processors. Add a gpu core here, some cache there and another core here. From the slide it looks like lots of their process is already capable of being laid out by a computer.
We have the caches, the integer units and the floating point units being the fixed hand optimized blocks with stuff like the x86 decode organically filling up the space in between. AMD also says it makes it easier to put the whole thing on a different process.

I've only limited knowledge about modern synthesizing and floor planning from working with some FPGAs.
Maybe Hans or somebody in the industry can say something about Bobcat?

Another nice example is the 1.9W TDP 2GHz hardmacro version of the dual
core ARM cortex A9 in the TSMC 40G process (total size of only 6.7 mm2)

http://www.arm.com/products/CPUs/Cor...ard-Macro.html
http://www.arm.com/images/A9-osprey-hres.jpg

Regards, Hans
08-24-2010, 05:08 PM
Mats

Quote:

Originally Posted by Hans de Vries

Another nice example is the 1.9W TDP 2GHz hardmacro version of the dual
core ARM cortex A9 in the TSMC 40G process (total size of only 6.7 mm2)

What's the reason behind the odd die shape, do you know?
08-24-2010, 05:14 PM
redpriest

Quote:

Originally Posted by terrace215

Hans, I wouldn't put much faith in "much higher single core frequencies", relative to phenom II.

Why? Does this boil down to your special insight with regards to GF's 32nm process?
08-24-2010, 05:25 PM
terrace215

Quote:

Originally Posted by redpriest

Why? Does this boil down to your special insight with regards to GF's 32nm process?

Phenom II hits what, 3.8 on turbo? Yeah, I don't see BD turboing much past 4GHz, *especially* on that GF process. ;) Power just starts to climb so fast with frequency at that point.
08-24-2010, 05:26 PM
rcofell

Quote:

Originally Posted by Mats

What's the reason behind the odd die shape, do you know?

It's supposed to still be a rectangle, the discrepancy is the white "missing" area is for external-IP (L2 cache SRAM mainly, but chip-level interfacing too).
08-24-2010, 05:33 PM
Calmatory

Quote:

Originally Posted by terrace215

Hans, I wouldn't put much faith in "much higher single core frequencies", relative to phenom II.

As for the IPC, I could believe it is higher, but having seen how they've trimmed the cores in certain ways, not that much higher.

1. Deeper pipeline. -> Higher clocks.
2. 32 nm Gate Last HKMG SOI. -> Higher clocks.
3. Inclusive cache hierarchy. -> Higher IPC.
08-24-2010, 05:39 PM
Mats

Quote:

Originally Posted by redpriest

Why? Does this boil down to your special insight with regards to GF's 32nm process?

No, it's because Intel is raising their turbo clock with 66 MHz, from 3733 MHz in 45 nm i7 880, to 3800 MHz in 32 nm i7 2600.:yepp::yepp::yepp::yepp:

If Intel can only raise it with 66 MHz, then AMD should only be able to raise it with less than that, right? It's fundamentally impossible that AMD can do ANYTHING better, according to some people.:D:D
Yet, they keep on posting and bashing in threads about things they hate.. I mean, what's the point?

Maybe I should join a Britney Spears forum and see what being negative in a forum is all about, because I honestly don't know.
08-24-2010, 05:40 PM
god_43

Quote:

Originally Posted by Calmatory

1. Deeper pipeline. -> Higher clocks.
2. 32 nm Gate Last HKMG SOI. -> Higher clocks.
3. Inclusive cache hierarchy. -> Higher IPC.

lol a trifecta eh?

i guess stp matters still....well thanks for all who explained why they view it as important.
08-24-2010, 05:42 PM
god_43

Quote:

Originally Posted by Mats

No, it's because Intel is raising their turbo clock with 66 MHz, from 3733 MHz in 45 nm i7 880, to 3800 MHz in 32 nm i7 2600.:yepp::yepp::yepp::yepp:

If Intel can only raise it with 66 MHz, then AMD should only be able to raise it with less than that, right? It's fundamentally impossible that AMD can do ANYTHING better, according to some people.:D:D
Yet, they keep on posting and bashing in threads about things they hate.. I mean, what's the point?

Maybe I should join a Britney Spears forum and see what being negative in a forum is all about, because I honestly don't know.

why Britney spears? there's nothing wrong with Britney!.....LEAVE BRITNEY ALONE!!!

http://www.youtube.com/watch?v=kHmvkRoEowc

Show 100 post(s) from this thread on one page

All times are GMT -8. The time now is 01:29 AM.

XtremeSystems