AMD's Eight-Core Bulldozer Processors to Get Massive Cache - Documents.

**mongoled** · 10-25-2010, 10:04 PM

Originally Posted by savantu

Hobby #3. ( IT in general )

How can searching for info be negative? If I want to know when BD is being launched and my previous research pointed out H2 2011 why when I express this, it is perceived as negativity towards AMD ? Simply because I do not believe in the Q1 fairy tales so many are grasping to here ?

I'm not part of the "cheer up" crowd. I do not perceive companies as beings which need encouragement, patting on the back and faith in their future.
I look at the state of things, financial and operational performance, discrepancy between management messages and operations and so on.

I stand by what I write. If I sometimes very rarely edit my posts, that's to temper the ironic/vitriolic content. Members around this forum are particularly sensitive to that. :P I am more of a fan of heated debates.

Thanks for your answers, I am not going to derail this thread anymore

**Dresdenboy** · 10-25-2010, 11:05 PM

To make any estimations, guesses or expectations regarding launch dates a bit less prone to attacks they could simply be improved by adding an error margin (the scientific way). So if I'm expecting a CPU to launch in Q2, but am not that sure about it, I could either say first half or middle of that year or simply Q2 +/- one quarter. If it would finally be Q1, Q2 or Q3, I wouldn't have been wrong

Sure if someone is already expecting a certain month, then naming a quarter could be a way to add an error margin to it. But I don't think, that there are many posters besides John, which are already working with month granularity in their minds regarding BD or Llano.

@John:
Will there any microarchitectural updates on BD at the Financial Analyst Day?

I'm currently collecting data supporting an assumption, that BD might not use the same clock in different parts of a module. This seems to be an interesting concept.

**madcho** · 10-26-2010, 01:00 AM

JF posted a new thing on his blog :

http://blogs.amd.com/work/2010/10/25/the-new-flex-fp/

nothing really new, but interresting reading.

**Calmatory** · 10-26-2010, 01:26 AM

Damn, I need BD now. Blazing fast software rasterization here I come!

**informal** · 10-26-2010, 01:30 AM

I think that with the FMA optimized software ,2P Interlagos will be one uber fast number crunching machine(compiler support should be there by the time it launches).

**JF-AMD** · 10-26-2010, 03:10 AM

Matthias, there will be some analyst day updates. I expect possibly one or two nuggets that we can share. There is data in the slides, but it obviously needs to get through the final executive reviews.

This is financial analysts, not industry analysts, so don't expect any deep technical discussions, they are more interested in business discussions.

**Dresdenboy** · 10-27-2010, 02:31 PM

Originally Posted by JF-AMD

Matthias, there will be some analyst day updates. I expect possibly one or two nuggets that we can share. There is data in the slides, but it obviously needs to get through the final executive reviews.

This is financial analysts, not industry analysts, so don't expect any deep technical discussions, they are more interested in business discussions.

No prob. You could still write additional blogs like the one about Flex FP.

Some interesting BD tech is still behind the curtain. But I can understand why you don't want to share info about it. Sometimes I already thought that by speculating into multiple directions I am actually camouflaging the microarchitecture instead of revealing it

I listed your Flex FP blog here with other neat stuff:
http://citavia.blog.de/2010/10/26/mo...lysis-9794436/

**qcmadness** · 10-27-2010, 08:15 PM

Originally Posted by JF-AMD

Matthias, there will be some analyst day updates. I expect possibly one or two nuggets that we can share. There is data in the slides, but it obviously needs to get through the final executive reviews.

This is financial analysts, not industry analysts, so don't expect any deep technical discussions, they are more interested in business discussions.

How's GF's 32nm going to be shaped? On-time or delayed?
I think this is vital for Bulldozer and Llano products.

**JF-AMD** · 10-27-2010, 08:39 PM

I don't work for GF so I can't comment on them. Dirk continues to reiterate that BD is still on schedule.

**Sn0wm@n** · 10-27-2010, 09:18 PM

Originally Posted by JF-AMD

Matthias, there will be some analyst day updates. I expect possibly one or two nuggets that we can share. There is data in the slides, but it obviously needs to get through the final executive reviews.

This is financial analysts, not industry analysts, so don't expect any deep technical discussions, they are more interested in business discussions.

maybe an advanced pricing sneak peak ???

that surely is a business decision

can't wait to see how this flex fp unit will work out

+ loads of other secret architectural change ..

**kuroikenshi** · 10-27-2010, 09:58 PM

sshhhhh! tis a secret

**qcmadness** · 10-27-2010, 11:31 PM

Originally Posted by madcho

JF posted a new thing on his blog :

http://blogs.amd.com/work/2010/10/25/the-new-flex-fp/

nothing really new, but interresting reading.

Each Flex FP has its own scheduler; it does not rely on the integer scheduler to schedule FP commands, nor does it take integer resources to schedule 256-bit executions.

But isn't L/S and retirement requires INT cores?

**JF-AMD** · 10-27-2010, 11:45 PM

Well, for a Bulldozer module you have 3 schedulers, (I am going off memory right now), 2 integer schedulers at 40 entries each and an FP scheduler at 60 entries. So you have 140 total entries for 2 integer threads and 1 FP thread.

If you look at SB, they have 1 scheduler that has to handle 1 thread (1 hyperthread) and 1 FP. I believe they only have 54 entries on that scheduler (Based on Real World Tech article).

**madcho** · 10-28-2010, 12:01 AM

what about phenom II ?

**Dresdenboy** · 10-28-2010, 12:36 AM

Originally Posted by qcmadness

But isn't L/S and retirement requires INT cores?

The int cores each could retire up to 4 instructions per cycle incl. the FP instructions. And while the FPU has a combined 2R/1W mem throughput via the int cores' LSUs, each cores' 2R/1W (4R/2W in total) should be enough to handle that.

**kl0012** · 10-28-2010, 12:46 AM

Originally Posted by JF-AMD

Well, for a Bulldozer module you have 3 schedulers, (I am going off memory right now), 2 integer schedulers at 40 entries each and an FP scheduler at 60 entries. So you have 140 total entries for 2 integer threads and 1 FP thread.

If you look at SB, they have 1 scheduler that has to handle 1 thread (1 hyperthread) and 1 FP. I believe they only have 54 entries on that scheduler (Based on Real World Tech article).

It makes sence for bulldozer to have a bigger schedulers. According to Dresdenboy, bulldozer has longer instruction latencies so it is possible that each instruction would wait longer in the scheduler queue for available execution slot (depends on the actual instruction throughput). On the other side SB has 168 slots in uop reorder buffer vs. 128 slots in each bulldozer core. So depends how you're positioning bulldozer vs SB (core vs. core or module vs. core) it is a bit better or a bit worse. Also separate FP/INT schedulers is not something new for AMD. AMD sticked with this approach since K7. So there's no clear winner since Intel's unified scheduler has been proven as very effective.

**informal** · 10-28-2010, 02:05 AM

Originally Posted by kl0012

It makes sence for bulldozer to have a bigger schedulers. According to Dresdenboy, bulldozer has longer instruction latencies so it is possible that each instruction would wait longer in the scheduler queue for available execution slot (depends on the actual instruction throughput). On the other side SB has 168 slots in uop reorder buffer vs. 128 slots in each bulldozer core. So depends how you're positioning bulldozer vs SB (core vs. core or module vs. core) it is a bit better or a bit worse. Also separate FP/INT schedulers is not something new for AMD. AMD sticked with this approach since K7. So there's no clear winner since Intel's unified scheduler has been proven as very effective.

AMD is switching to unified integer scheduler per core.This will allow greater flexibility when executing integer instructions compared to K8/10h.
As for instruction latencies,yes they are longer but BD has other ways of masking this,plus this is the hint about the possible high frequency targets for BD cores.

**Dresdenboy** · 10-28-2010, 02:41 AM

Originally Posted by kl0012

It makes sence for bulldozer to have a bigger schedulers. According to Dresdenboy, bulldozer has longer instruction latencies so it is possible that each instruction would wait longer in the scheduler queue for available execution slot (depends on the actual instruction throughput). On the other side SB has 168 slots in uop reorder buffer vs. 128 slots in each bulldozer core. So depends how you're positioning bulldozer vs SB (core vs. core or module vs. core) it is a bit better or a bit worse. Also separate FP/INT schedulers is not something new for AMD. AMD sticked with this approach since K7. So there's no clear winner since Intel's unified scheduler has been proven as very effective.

The number of scheduler entries also depends on the clock frequency target (FO4 pipeline stage delay). This could be a "knee of the curve" type optimization in Bulldozer: less FO4 increases latencies but limits scheduler entries.

OTOH an average scheduler size only influences overall performance by a small amount. Much more does it depend on branch prediction, memory prefetches (no data - nothing to schedule

), mem subsystem efficiency, front end, etc.

You could try the simple scheduler simulation available on my blog. I'll soon update it with one having different instruction latencies.

**kl0012** · 10-28-2010, 05:42 AM

Originally Posted by informal

AMD is switching to unified integer scheduler per core.This will allow greater flexibility when executing integer instructions compared to K8/10h.

Yep. And unified FP/INT scheduler may bring another few percent of perf (at least according to theoretical studies) but I guess AMD won't take this route because of future APU aproach (which aimed to replace traditional fpu).

As for instruction latencies,yes they are longer but BD has other ways of masking this,plus this is the hint about the possible high frequency targets for BD cores.

Yep, and the bigger sheduler size is one of the ways to mask high latency of some instructions. But rather then make some prediction on performance by mentioning higher latency instruction (which by self means nothing without a "big picture"), my point was that shedulers of different architectures are not comparable just by their's size. Sometimes bigger does not mean better (just like P4 with it's frequency).

**Dresdenboy** · 10-28-2010, 06:25 AM

Originally Posted by kl0012

Yep. And unified FP/INT scheduler may bring another few percent of perf (at least according to theoretical studies) but I guess AMD won't take this route because of future APU aproach (which aimed to replace traditional fpu).

My "big picture" looks more like there will first be an updated BD architecture (BD2) before making bigger changes to the µArch again. There currently is no need for making any circuits in the critical path prepared for some future changes. This would just cost time, performance, area, verification effort etc.

In this big picture we should also stop thinking in the old ways. Why does a scheduler have to be big and cover most stalls? In the past this was because there was a fixed power budget. Not making use of it while stalling simply reduced performance. Now you could during a stall switch off (clock gate) the core. This saves power. Saved power could be turned into work/performance at another place, even at another time. Overall performance not lost - scheduler window size ok.

Another question is, at which point it will be useful to bring shader-like computing resources closer to the GP (x86) core. One way could be to add APU resources as additional cluster(s), fed by a common front end. Then they would have their own scheduling etc. AMD has patents for decoding mixed ISA streams. So they could build a decoder, which directs decoded or even translated instruction packets to their appropriate targets. There could further be some instruction buffers/execution caches etc.

**god_43** · 10-28-2010, 08:31 AM

Originally Posted by kl0012

Sometimes bigger does not mean better.

thats what he said......

Thread: AMD's Eight-Core Bulldozer Processors to Get Massive Cache - Documents.

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions