Dresdenboys' blog: AMD Bulldozer - Patent based research part 2

**~~terrace215~~** · 04-21-2010, 03:18 PM

Originally Posted by informal

They had more than minimally functional Bobcat for a while now and you've heard it just a few days ago.And bobcat is not as big (in every possible sense of the word) as BD will be.

Agreed, bobcat is about 1/10th the complexity of BD.

Not sure about "a while now". In the best case interpretation of the CC remarks (that they concerned llano + bobcat as opposed to llano desktop + llano mobile), we know they had internal samples of bobcat at the time of the CC, from which they are "learning a lot." They did not say how long they have had such samples, at least not in the CC.

**Chumbucket843** · 04-21-2010, 03:55 PM

Originally Posted by qcmadness

but then it could not explain the lower DDR-2 memory bandwidth when comparing K10 with K8?

that could be a number of things. informal mentioned the memory controller which is a likely cause.

**nn_step** · 04-21-2010, 08:47 PM

Originally Posted by Chumbucket843

the concept that ILP is what is going to give bulldozer its advantage is wrong. ILP has been mined out, you will be wasting transistors trying to improve it. its perf/mm2 and memory bandwidth.
http://www.bloobble.com/broadband-pr...ns?itemid=2763
see slide 6.

Technically both true and false.

For the standard case, you are absolutely correct but there still exists more than a dozen cases for which a couple orders of magnitude better performance can be had. [Cryptography for example]

As for integer code, you are almost correct; approximately 10-15% better performance is theoretically possible for 64bit code [32bit code has only 1.3% remaining]

**haylui** · 04-21-2010, 09:17 PM

Originally Posted by qcmadness

The problem is the design of L3 cache.

Since L3 cache is slower than dual-channel DDR-3, memory bandwidth is limited to L3 cache speed.

L3 cache is slower than Dual-DDR3?

**haylui** · 04-21-2010, 09:22 PM

Originally Posted by Manicdan

ok so because AM2 is duel channel, we had to make AM3 duel channel, and now AM3+ is also going to only be duel channel, thats pretty sucky.

i understand backwords compatible is nice and all, but BD being limited because of the socket used by athlon x2s sounds like its going to hurt more than help. those 2 extra channels could be 5-10% more IPC

i hope they split BD across 2 sockets, current and quad channel, let us decide if we want a 300$ cpu put into a 2 year old motherboard, or buy a new mobo and cpu at the same time for that extra perf

perhaps 2%???
extra memory bandwidth don't boost IPC a lot
since L1 and L2 cache is fast enough to feed the processing units.
compare i5 and i7 and you will see if triple channel is a lot more faster than double channel

**haylui** · 04-21-2010, 09:30 PM

Originally Posted by xlink

looks like I won't be going for AMD the gen unless BD is truly exceptional.

shame, I was hoping for a drop in replacement. ohh well.

according to your spec, just drop in an 1090T BE in 2011 and it would be fine....

**LowRun** · 04-22-2010, 02:39 AM

Originally Posted by haylui

L3 cache is slower than Dual-DDR3?

Originally Posted by haylui

perhaps 2%???
extra memory bandwidth don't boost IPC a lot
since L1 and L2 cache is fast enough to feed the processing units.
compare i5 and i7 and you will see if triple channel is a lot more faster than double channel

Originally Posted by haylui

according to your spec, just drop in an 1090T BE in 2011 and it would be fine....

There is a multi-quote button for a good reason dude.

**Manicdan** · 04-22-2010, 06:05 AM

Originally Posted by haylui

perhaps 2%???
extra memory bandwidth don't boost IPC a lot
since L1 and L2 cache is fast enough to feed the processing units.
compare i5 and i7 and you will see if triple channel is a lot more faster than double channel

bold the word could next time, cause some apps love extra ram speeds, some like timings, the fact is 5-10% better perf can easily turn into massive profit margins depending on how close these are to the competition.

this is extreme systems, and quad channel desktop memory is quite extreme id say.

**vietthanhpro** · 04-22-2010, 03:39 PM

I can't access his blog.

**god_43** · 04-22-2010, 04:26 PM

Originally Posted by vietthanhpro

I can't access his blog.

really? i just tried and it worked for me?

here is a link to the latest entry.

http://citavia.blog.de/2010/04/22/pr...lated-8429143/

**haylui** · 04-22-2010, 05:12 PM

Originally Posted by Manicdan

bold the word could next time, cause some apps love extra ram speeds, some like timings, the fact is 5-10% better perf can easily turn into massive profit margins depending on how close these are to the competition.

this is extreme systems, and quad channel desktop memory is quite extreme id say.

can't deny the benefit of wider memory bandwidth, but I think AMD address that enough in server space, on G32 and G34 socket.
As the timing concern, I don't think manufacturer would tune a tight timing for memory system. They rather use a higher frequency module to archive the same performance for tight timing module.
In consumer end, not much apps are memory hungry. Mostly depends on efficient cache system, brunch predictions and instructions could fetch per cycle.

**amdcian** · 04-22-2010, 05:16 PM

Originally Posted by vietthanhpro

I can't access his blog.

You should use Opera 10 to access this page. Enabling Opera Turbo is required

// Lão Thành dùng mạng vịt teo à, nhà em Vịt teo cũng cóc vào được trang BD blog, phải dùng Opera Turbo

**god_43** · 04-22-2010, 05:25 PM

Originally Posted by amdcian

You should use Opera 10 to access this page. Enabling Opera Turbo is required

// Lão Thành dùng mạng vịt teo à, nhà em Vịt teo cũng cóc vào được trang BD blog, phải dùng Opera Turbo

what??? i am using IE/firefox, both of which work just fine in accessing the blog.

**vietthanhpro** · 04-22-2010, 10:12 PM

Originally Posted by amdcian

You should use Opera 10 to access this page. Enabling Opera Turbo is required

// Lão Thành dùng mạng vịt teo à, nhà em Vịt teo cũng cóc vào được trang BD blog, phải dùng Opera Turbo

vnpt, not vietel.

, next month, fpt

1)
Pipe 0 -> multiplier, simple ops (add, subtract, logical)
Pipe 1 -> AGU-like, barrel shifter, branch (both direct & indirect), simple ops
Pipe 2 -> ABM, simple ops
Pipe 3 -> AGU-like, barrel shifter, branch (both types too), simple ops
Pipe1,3: ALU+AGU, not AGU ?
2) I can't see FMISC unit.

Originally Posted by haylui View Post
L3 cache is slower than Dual-DDR3?

-----------------
nguồn:http://www.freepatentsonline.com/6167503.html
IDU: instruction distribution unit
RRU:Register renaming unit
IDB: Instruction dispatch buffer
ISC:Instruction scheduling controller
RFBC: Register file/ bypass circuit
EU: execution unit
TSB: transfer staging buffer

**Dresdenboy** · 04-22-2010, 10:57 PM

Originally Posted by vietthanhpro

I can't access his blog.

http://support.mozilla.com/de/forum/1/653217

**Dresdenboy** · 04-22-2010, 11:58 PM

Originally Posted by vietthanhpro

1)
Pipe 0 -> multiplier, simple ops (add, subtract, logical)
Pipe 1 -> AGU-like, barrel shifter, branch (both direct & indirect), simple ops
Pipe 2 -> ABM, simple ops
Pipe 3 -> AGU-like, barrel shifter, branch (both types too), simple ops
Pipe1,3: ALU+AGU, not AGU ?
2) I can't see FMISC unit.

1) That was a suggestion in a comment by Wireloop. It might work this way, but we don't have any evidence for it. But absence of evidence is no evidence for absence

I now have an even more revolutionary model of the execution units in mind. Please be patient for more details.

2) Do we need a FMISC unit? It might be there but the things handled by the current FMISC unit could be handled by the other FP units as well.

**superrugal** · 04-23-2010, 04:16 AM

What does "RM" stand for & What effect wiil it take?

**Dresdenboy** · 04-23-2010, 06:15 AM

Originally Posted by superrugal

What does "RM" stand for & What effect wiil it take?

That stands for "Resource Monitor", which could track resource utilization, latencies and the like. You could use them for some adaptions.

**superrugal** · 04-23-2010, 06:59 AM

Originally Posted by Dresdenboy

That stands for "Resource Monitor", which could track resource utilization, latencies and the like. You could use them for some adaptions.

THX for your reply.

"track resource utilization", is it very important for bulldozer? Does AMD ever make use of this "Resource Monitor" before?

**Dresdenboy** · 04-23-2010, 08:00 AM

Originally Posted by superrugal

THX for your reply.

"track resource utilization", is it very important for bulldozer? Does AMD ever make use of this "Resource Monitor" before?

Well it's even not sure, if it will be used in BD. See an old blog entry of mine for more details.

**nn_step** · 04-23-2010, 08:02 AM

Originally Posted by Dresdenboy

1) That was a suggestion in a comment by Wireloop. It might work this way, but we don't have any evidence for it. But absence of evidence is no evidence for absence

I now have an even more revolutionary model of the execution units in mind. Please be patient for more details.

2) Do we need a FMISC unit? It might be there but the things handled by the current FMISC unit could be handled by the other FP units as well.

1) lack of disproof is not proof

2) You mean FSTORE; which does not appear to be all that important (its used for FST(P), FLD(CONST) and "miscellaneous" instructions.) but the reason its existence was that it very cheaply enabled a few extra instructions to be executed for each given clock cycle. In Those edge cases, the addition of FSTORE effectively gave a 50% better performance than if the other two FP units just handled its work. It effectively meant adding a little more traces and tweaking the scheduler, and adding few transistors to the die.

what you accidentally stumbled upon is the ago old argument:
If ALU/FPUs are symmetric and can execute almost any integer instruction or the ALU/FPUs are are not symmetric and are slightly more restrictive. Each of the lanes must be nearly identical for distributed schedulers and instruction grouping to work optimally to provide maximum performance. But if you wanted to save power and die space, non-symmetric with a centralized scheduler will get the job done more efficiently.

Now what AMD did with their FPU with K7 and subsequently K8, and K10 is just keep a centralized scheduler for its FPU because the cost of all the extra transistors was deemed unacceptable.

But Given that transistors are cheaper now and as we approach 22nm, if AMD wanted better performance it could, now afford to match their integer design and move their FPU into a distributive scheduler; which would enable them to finally reach the last bit of performance remaining in x86 and a large chunk of x86_64's remaining performance.

**Dresdenboy** · 04-23-2010, 09:09 AM

Originally Posted by nn_step

2) You mean FSTORE; which does not appear to be all that important (its used for FST(P), FLD(CONST) and "miscellaneous" instructions.) but the reason its existence was that it very cheaply enabled a few extra instructions to be executed for each given clock cycle. In Those edge cases, the addition of FSTORE effectively gave a 50% better performance than if the other two FP units just handled its work. It effectively meant adding a little more traces and tweaking the scheduler, and adding few transistors to the die.
...
what you accidentally stumbled upon is the ago old argument:
If ALU/FPUs are symmetric and can execute almost any integer instruction or the ALU/FPUs are are not symmetric and are slightly more restrictive. Each of the lanes must be nearly identical for distributed schedulers and instruction grouping to work optimally to provide maximum performance. But if you wanted to save power and die space, non-symmetric with a centralized scheduler will get the job done more efficiently.

Now what AMD did with their FPU with K7 and subsequently K8, and K10 is just keep a centralized scheduler for its FPU because the cost of all the extra transistors was deemed unacceptable.

But Given that transistors are cheaper now and as we approach 22nm, if AMD wanted better performance it could, now afford to match their integer design and move their FPU into a distributive scheduler; which would enable them to finally reach the last bit of performance remaining in x86 and a large chunk of x86_64's remaining performance.

AMD also used the term FMISC for the FSTORE unit (e.g. in 25112.pdf).

Several AMD patents already describe the included scheduler (included in the exemplary architecture) to have free choice. But since you have to take care of program order, dependencies, flags, speculative state etc. this might be more difficult having many units to issue to (I think, this has about quadratically increasing complexity). But with only 2 ALUs and only 2 AGUs such things could be resolved easily. Also the suggestion of Wireloop will make things easier, because for many instruction types, there are only one or two units to choose from. But these are details, for which we have to wait until AMD discloses them.

And I think, the BD design is not about cheaper transistors but about more expensive energy consumption (with leakage becoming more important). And thanks to the SMT like execution on BD's FPU, this unit doesn't need to be of Speedy Gonzales' type and can afford to have somewhat longer latencies due to a more power efficient scheduling.

**nn_step** · 04-23-2010, 07:28 PM

Originally Posted by Dresdenboy

AMD also used the term FMISC for the FSTORE unit (e.g. in 25112.pdf).

Several AMD patents already describe the included scheduler (included in the exemplary architecture) to have free choice. But since you have to take care of program order, dependencies, flags, speculative state etc. this might be more difficult having many units to issue to (I think, this has about quadratically increasing complexity). But with only 2 ALUs and only 2 AGUs such things could be resolved easily. Also the suggestion of Wireloop will make things easier, because for many instruction types, there are only one or two units to choose from. But these are details, for which we have to wait until AMD discloses them.

And I think, the BD design is not about cheaper transistors but about more expensive energy consumption (with leakage becoming more important). And thanks to the SMT like execution on BD's FPU, this unit doesn't need to be of Speedy Gonzales' type and can afford to have somewhat longer latencies due to a more power efficient scheduling.

well given the performance advantages of distributive scheduling, it seems likely that AMD will probably go that route for Bulldozer. [added to the fact that it makes adding more units simple and cheap]

However I honestly am curious if for Bobcat AMD is going to go the other way and use a centralized scheduler to reduce transistors.

**Dresdenboy** · 04-25-2010, 07:42 AM

Originally Posted by vietthanhpro

-----------------
nguồn:http://www.freepatentsonline.com/6167503.html
IDU: instruction distribution unit
RRU:Register renaming unit
IDB: Instruction dispatch buffer
ISC:Instruction scheduling controller
RFBC: Register file/ bypass circuit
EU: execution unit
TSB: transfer staging buffer

Yes, Norman Jouppi had and has nice ideas.

How about David Witt (AMD) in 1998 (year of filing):

http://www.freepatentsonline.com/6119223.html

**Dresdenboy** · 04-28-2010, 10:39 AM

Originally Posted by nn_step

well given the performance advantages of distributive scheduling, it seems likely that AMD will probably go that route for Bulldozer. [added to the fact that it makes adding more units simple and cheap]

However I honestly am curious if for Bobcat AMD is going to go the other way and use a centralized scheduler to reduce transistors.

It looks like we have stronger support for 4 ALUs + 4 AGUs per core now:
http://citavia.blog.de/2010/04/28/an...again-8474038/

And here is a teaser for the other stuff mentioned there:

Thread: Dresdenboys' blog: AMD Bulldozer - Patent based research part 2

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions