AMD cuts to the core with 'Bulldozer' Opterons

**-Boris-** · 08-11-2010, 04:31 AM

I have to say that AMDs work with the 128bit FPU seems to be more of a cut and paste job, it could be more space effective than it is, I guess you have to accept that giving the very impressive last minute solution Phenom seems to be. Meanwhile, It's hard to believe that Intel doubled their FPU with only very minor size increase.
Talking about AMDs 128bit FPU job that is kind of AMDs thing at the moment, everything they are selling is the same old chip from 1999 with lots of addons bolted on. Intel got the money to rearrange the entire core every generation, and with that kind of resources it isn't hard to see why they currently are on top. Their cores are being twice as big and a bit more tuned for todays manufacturing capabilities.

I'm glad that Bulldozer is around the corner.

**jakefalcons** · 08-11-2010, 05:18 AM

I thought core was largely derived from the original Pentium family mainly the p3 mixed with a little of the front end of the p4 sorry that's probably a little too oversimplified but that's my basic understanding. No processor design is started from scratch you take what works and build on it... bulldozer will be the first largest step away from the base athlon architecture since the athlon 64 in along time.Hope it turns out well as well as the athlon 64 did *crosses fingers*

**informal** · 08-11-2010, 05:18 AM

Actually the whole Core generation is based on Yonah which in turn was the latest souped up P6 microarchitecture. They did invest massively in Yonah ->Merom change,as far as core is concerned(50% bigger core logic on the same process).Nehalem added another ~20% on top of that.Westmere to SB change looks pale to Penryn->Nehalem change and anemic when compared Yonah->Merom case.

**-Boris-** · 08-11-2010, 05:35 AM

Originally Posted by jakefalcons

I thought core was largely derived from the original Pentium family mainly the p3 mixed with a little of the front end of the p4 sorry that's probably a little too oversimplified but that's my basic understanding. No processor design is started from scratch you take what works and build on it... bulldozer will be the first largest step away from the base athlon architecture since the athlon 64 in along time.Hope it turns out well as well as the athlon 64 did *crosses fingers*

Base on, and bolt on isn't the same thing. Watch the dies from P6 to Dothan, Yonah, Conroe, Penryn and bloomfield. Ignore caches and IO, just watch the cores, and it's totally different almost every time. While a Phenom II core looks just like a K7 with an extra FPU.

And I don't think Bulldozer will be comparable with K7 to K8.
This is more like K6 -> K7. In other words totally different. I don't even think you can identify a part of the die that still is the same.

We might actually now by the end of the month.

JF-AMD? Any ideas on when we can see die shots?

**Opteron146** · 08-11-2010, 05:37 AM

Originally Posted by -Boris-

I have to say that AMDs work with the 128bit FPU seems to be more of a cut and paste job, it could be more space effective than it is, I guess you have to accept that giving the very impressive last minute solution Phenom seems to be. Meanwhile, It's hard to believe that Intel doubled their FPU with only very minor size increase.

No, it cant be more space efficient. If you change your 4 cylinder car engine with an 8 cylinder engine, then it will need doubled space, too (if each cylinder is of similar size).

It is as easy as this. Double power -> double size.

However if you just want more horsepowers, you can try to build in a turbo charger/intercooler setup instead of the +4 cylinders. That would be the rough equivalent to Sandy's hyper-pipelined FPU ;-)

**Dresdenboy** · 08-11-2010, 05:40 AM

Originally Posted by savantu

Instead of trying to find some weirdo explanations, we could take his words at face value. The words he uses are pretty straightforward :
I wouldn't be surprised of some intentional misleading done previously for deceiving the competition.

Looking at Mark's answer again with a little more time, I see that he refers to using 128 bit wide EUs (point 1) by Eric) during 2 consecutive clock cycles as double pumping (and thus execute a 256 bit SIMD instruction). This is a different kind of double pumping using base clock cycles as smallest unit.

Ok. Let me cite another Intel employee posting in the same thread (and has been quoted here already IIRC):

It seems point 1) may have assumed it requires monolithic 256-bit hardware to achieve 1 cycle throughput for 256-bit AVX instructions. That's not true.

With "misleading" you mean the first AVX LO/HI chart? I think there simply was not much time between publishing the first chart and the corrected one to have some meaningful effect.

OTOH saving die area and thus leakage (which would limit performance otherwise) doesn't look to be a bad choice. So why are you defending a monolithic implementation so heavily?

@Hans:
Thanks for the links.

**-Boris-** · 08-11-2010, 06:05 AM

Originally Posted by Opteron146

No, it cant be more space efficient. If you change your 4 cylinder car engine with an 8 cylinder engine, then it will need doubled space, too (if each cylinder is of similar size).

It is as easy as this. Double power -> double size.

However if you just want more horsepowers, you can try to build in a turbo charger/intercooler setup instead of the +4 cylinders. That would be the rough equivalent to Sandy's hyper-pipelined FPU ;-)

Well some parts are shared, not everything is doubled, so AMD has a chunk of silicon sitting near the second FPU pretty much unused.

There is lot of empty space on AMDs chips nowdays.

**Opteron146** · 08-11-2010, 06:13 AM

Originally Posted by -Boris-

Well some parts are shared, not everything is doubled, so AMD has a chunk of silicon sitting near the second FPU pretty much unused.

The stuff that is not doubled is the 3Dnow! legacy area, no need to double, 'cause there is no need to double 3DNow! ;-)

But the main pipelines, which are use up most of the FPU are are doubled. That's the important fact.

**Sn0wm@n** · 08-11-2010, 06:33 AM

good read since the last 3 pages ... keep it up guys

**-Boris-** · 08-11-2010, 06:34 AM

Originally Posted by Opteron146

The stuff that is not doubled is the 3Dnow! legacy area, no need to double, 'cause there is no need to double 3DNow! ;-)

What? But I want 128Bit 3DNow! Pro++.

And there is still wasted space, if they had the resources for a complete overhaul on the layout they would save some, and possibly increase performance.

**JF-AMD** · 08-11-2010, 08:10 AM

Originally Posted by Opteron146

No, it cant be more space efficient. If you change your 4 cylinder car engine with an 8 cylinder engine, then it will need doubled space, too (if each cylinder is of similar size).

It is as easy as this. Double power -> double size.

However if you just want more horsepowers, you can try to build in a turbo charger/intercooler setup instead of the +4 cylinders. That would be the rough equivalent to Sandy's hyper-pipelined FPU ;-)

Wow, that is pretty much a description of the whole bulldozer architecture. No need to double everything, only the things that matter.

**JF-AMD** · 08-11-2010, 08:12 AM

Originally Posted by -Boris-

JF-AMD? Any ideas on when we can see die shots?

Die shots will not be at hot chips; we aren't disclosing those just yet.

**Dresdenboy** · 08-11-2010, 08:32 AM

Originally Posted by Opteron146

However if you just want more horsepowers, you can try to build in a turbo charger/intercooler setup instead of the +4 cylinders. That would be the rough equivalent to Sandy's hyper-pipelined FPU ;-)

Hmm.. with all these design options there are also several ways to implement a decode/dispatch unit for Bulldozer, which might not only decode up to 4 x86 ops per cycle but - if some requirements are met - even up to 8 ops per cycle. I'm thinking of the fast/slow path concept and also double pumping here. A fast path (for simple/double decode ops) could be used twice per base clock cycle if it's that fast.

**Sn0wm@n** · 08-11-2010, 08:37 AM

Originally Posted by JF-AMD

Die shots will not be at hot chips; we aren't disclosing those just yet.

i presume die shots would be available closer to actual release date am i right?

**JF-AMD** · 08-11-2010, 08:39 AM

Originally Posted by Sn0wm@n

i presume die shots would be available closer to actual release date am i right?

yes

**Chumbucket843** · 08-11-2010, 08:51 AM

Originally Posted by Dresdenboy

OTOH saving die area and thus leakage (which would limit performance otherwise) doesn't look to be a bad choice. So why are you defending a monolithic implementation so heavily?

die area does not equate to leakage. transistors do.

faster transistors also increase Ioff exponentially.

**Sn0wm@n** · 08-11-2010, 08:52 AM

cant wait for that particular moment

... so much anticipation being build since the last 3 years

**Dresdenboy** · 08-11-2010, 09:15 AM

Originally Posted by Chumbucket843

die area does not equate to leakage. transistors do.

faster transistors also increase Ioff exponentially.

I admit having used a simplification. So the transistors populating the said die area will use less leakage.

But why do you mention faster transistors? Pipelining per se is about cutting work into smaller pieces of work, not doing it with faster transistors. Then I probably wouldn't need to further pipeline the circuit in question.

Example: an FP multiplier has a latency of 1000 ps or 1 ns. So if I use it in one pipeline stage (for simplification we leave out operand catching etc.) I could clock it at 1 GHz, being able to feed it once per 1 ns and get a result at the same rate. Latency is just one cycle.

With 2 pipeline stages, some additional latch overhead and some inefficiency due to cutting the multiplier in two about equally fast pieces, the overall latency could become 1100 ps. But I could clock it at 1.8 GHz with two stages of 550 ps. I could feed the multiplier at that rate and get results at that rate. Latency would be 2 cycles. Another slight disadvantage would be, that there could be up to 2 multiplications going on at any time vs. one in the 1 cycle version. Two muls mean more power consumption. OTOH I don't increase energy per instruction.

That's the principle.

**kl0012** · 08-11-2010, 09:19 AM

Originally Posted by Hans de Vries

It's no coincidence that the architects behind the very long SIMD words
(256 bit, 512 bit and longer) are Doug Carmean and Eric Sprangle who joined
Intel from Ross technologies.

These are exactly the Hyperpipelining specialists at Intel:

(1) They co-authored the original hyperpipelining paper:
Increasing Processor Performance by Implementing Deeper Pipelines

(2) They leaded the original ~60 stage hyperpiplined Nehalem project.
http://www.theinquirer.net/inquirer/...em-slated-2005

(3) They initiated the Larrabee project. One of the main ideas behind
Larrabee is to achieve a theoretical maximum number of FLOPs on a
certain die with a limited number of transistors. A fourfold hiperpipelined
128 bit unit running at 4.8 GHz can produce 512 bit results at 1.2 GHz
using only 25%(+a bit) of the transistors of a non hyperpipelined unit.
ftp://download.intel.com/technology/...abee_paper.pdf
http://www.drdobbs.com/high-performa...ting/216402188

Your post is a pure speculations. You really don't know who has actually worked on architecture of SB. Also even if assume that those was the same ppl who worked on Netburst, still you have no even a litle bit of info how really AVX was implemented. I can recall you that a "real" 128 bit SSE was firstly implemented by a Haifa team which is curently works on SB.

The SIMD units are the easiest (of all units) to hyperpipeline. All instructions
which could cause problems for hyperpipelining have been systematically
left out of the AVX and LNI specifications. (for instance data shuffles
crossing 128 bit boundaries)

Your assumptions are not nessesary true. While a general multiplication algorithm looks relative easy to serialize it is not nessesary the case in a real life since multiple heuristic may be added to a HW algorithm to make it using less power/space, make it faster, e.t.c. I heard that it was a big chalenge for intel to implement fast radix-16 divider in Penryn. Also while I don't know how much space consumes an fp multiplier I may assume that hyperpipelining may consume more space (as an example -to save intermidate results in a multiplication loop) then implement additional multiplier. Any way even in Netburst Intel implemented "doublepumping" only for some integer ops and decided not to implement a double pumped alu for a compex integer operations such as divide/multiply.

**savantu** · 08-11-2010, 10:22 AM

Originally Posted by kl0012

Your post is a pure speculations. You really don't know who has actually worked on architecture of SB. Also even if assume that those was the same ppl who worked on Netburst, still you have no even a litle bit of info how really AVX was implemented. I can recall you that a "real" 128 bit SSE was firstly implemented by a Haifa team which is curently works on SB.

True. Sandy Bridge was designed by the Haifa team.

Nehalem was Oregon, so it Hasswell.

Your assumptions are not nessesary true. While a general multiplication algorithm looks relative easy to serialize it is not nessesary the case in a real life since multiple heuristic may be added to a HW algorithm to make it using less power/space, make it faster, e.t.c. I heard that it was a big chalenge for intel to implement fast radix-16 divider in Penryn. Also while I don't know how much space consumes an fp multiplier I may assume that hyperpipelining may consume more space (as an example -to save intermidate results in a multiplication loop) then implement additional multiplier. Any way even in Netburst Intel implemented "doublepumping" only for some integer ops and decided not to implement a double pumped alu for a compex integer operations such as divide/multiply.

Intel double pumped the ALUs and turned them into "fireballs" by their own admission. Doubling pumping the FPU sounds even crazier.

**savantu** · 08-11-2010, 10:25 AM

Originally Posted by Dresdenboy

I admit having used a simplification. So the transistors populating the said die area will use less leakage.

But why do you mention faster transistors? Pipelining per se is about cutting work into smaller pieces of work, not doing it with faster transistors. Then I probably wouldn't need to further pipeline the circuit in question.

Example: an FP multiplier has a latency of 1000 ps or 1 ns. So if I use it in one pipeline stage (for simplification we leave out operand catching etc.) I could clock it at 1 GHz, being able to feed it once per 1 ns and get a result at the same rate. Latency is just one cycle.

With 2 pipeline stages, some additional latch overhead and some inefficiency due to cutting the multiplier in two about equally fast pieces, the overall latency could become 1100 ps. But I could clock it at 1.8 GHz with two stages of 550 ps. I could feed the multiplier at that rate and get results at that rate. Latency would be 2 cycles. Another slight disadvantage would be, that there could be up to 2 multiplications going on at any time vs. one in the 1 cycle version. Two muls mean more power consumption. OTOH I don't increase energy per instruction.

That's the principle.

You contradict yourself in your own argument.

**Hans de Vries** · 08-11-2010, 10:31 AM

Originally Posted by kl0012

Your post is a pure speculations. You really don't know who has actually worked on architecture of SB. Also even if assume that those was the same ppl who worked on Netburst, still you have no even a litle bit of info how really AVX was implemented. I can recall you that a "real" 128 bit SSE was firstly implemented by a Haifa team which is curently works on SB.

Your assumptions are not nessesary true. While a general multiplication algorithm looks relative easy to serialize it is not nessesary the case in a real life since multiple heuristic may be added to a HW algorithm to make it using less power/space, make it faster, e.t.c. I heard that it was a big chalenge for intel to implement fast radix-16 divider in Penryn. Also while I don't know how much space consumes an fp multiplier I may assume that hyperpipelining may consume more space (as an example -to save intermidate results in a multiplication loop) then implement additional multiplier. Any way even in Netburst Intel implemented "doublepumping" only for some integer ops and decided not to implement a double pumped alu for a compex integer operations such as divide/multiply.

These observations about hyperpipelining are really far beyond any
reasonable doubt.

I designed IEEE compatible Floating Point units myself, not only the usual
multiply/add ones but also much more complicated ones like fully pipelined
FP complex function unit which can output each cycle the result of any of a
square root, reciprocal, exponent, logarithm, sine/cosine, arcsine/arcosine,
while having any mix of these in the pipeline simultaneously.

So I think you may thrust me on this....

Regards, Hans

**savantu** · 08-11-2010, 10:47 AM

Well, a month left to autumn IDF when things will get clear on SB.

**kl0012** · 08-11-2010, 10:48 AM

Originally Posted by Hans de Vries

These observations about hyperpipelining are really far beyond any
reasonable doubt.

I designed IEEE compatible Floating Point units myself, not only the usual
multiply/add ones but also much more complicated ones like fully pipelined
FP complex function unit which can output each cycle the result of any of a
square root, reciprocal, exponent, logarithm, sine/cosine, arcsine/arcosine,
while having any mix of these in the pipeline simultaneously.

So I think you may thrust me on this....

Regards, Hans

While I really believe you designed all that stuff (I designed by self some functional blocks, including adder and multiplier, during an university VLSI study), it is still a bit perkily to think that no one else can do it better (or at least in a different way) then you.
I also trust an intel engineers about a difficulty in designing an efficient execution units.
http://www.intel.com/technology/itj/...er/8-radix.htm

**Chumbucket843** · 08-11-2010, 10:55 AM

Originally Posted by Dresdenboy

I admit having used a simplification. So the transistors populating the said die area will use less leakage.

But why do you mention faster transistors? Pipelining per se is about cutting work into smaller pieces of work, not doing it with faster transistors. Then I probably wouldn't need to further pipeline the circuit in question.

using die area as an estimation for static power can be very misleading. leakage increases linearly with transistors and exponentially with drive current.

Example: an FP multiplier has a latency of 1000 ps or 1 ns. So if I use it in one pipeline stage (for simplification we leave out operand catching etc.) I could clock it at 1 GHz, being able to feed it once per 1 ns and get a result at the same rate. Latency is just one cycle.

With 2 pipeline stages, some additional latch overhead and some inefficiency due to cutting the multiplier in two about equally fast pieces, the overall latency could become 1100 ps. But I could clock it at 1.8 GHz with two stages of 550 ps. I could feed the multiplier at that rate and get results at that rate. Latency would be 2 cycles. Another slight disadvantage would be, that there could be up to 2 multiplications going on at any time vs. one in the 1 cycle version. Two muls mean more power consumption. OTOH I don't increase energy per instruction.

That's the principle.

i understand the concept of pipelining.

if speed is critical you might want to use a pulsed latch.

Thread: AMD cuts to the core with 'Bulldozer' Opterons

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions