AMD cuts to the core with 'Bulldozer' Opterons

**Hans de Vries** · 08-11-2010, 11:37 AM

Originally Posted by kl0012

While I really believe you designed all that stuff (I designed by self some functional blocks, including adder and multiplier, during an university VLSI study), it is still a bit perkily to think that no one else can do it better (or at least in a different way) then you.
I also trust an intel engineers about a difficulty in designing an efficient execution units.
http://www.intel.com/technology/itj/...er/8-radix.htm

If a design is "feedback free" then hyperpipelining is a fully automated
process. Synopsis design compiler has had such features for years.
The software has to find the locations to which the signals have
propagated after a 1/2 cycle (or 1/N cycle in general) and then place
the flipflops there which will constitute the intermediate pipeline stages.

Intel (that is, the process guys) may probably have developed its own
software specifically adopted to Intel's own process physics model.

"Feedback free" is when the logic doesn't need to know what it just
calculated in the previous cycle. The P4 ALU was not feedback free
because it needs results from the previous cycle:

non-hyperpipelined:

----> C = A + B;
----
----> E = C + D;
----

hyperpipelined:

----> C = A + B;
----> E = C + D;
---->

The problem is that the result C is not yet known after a 1/2 cycle,
so you have to design logic which works with the intermediate result
instead. There exist tricks which do so. In general these tricks are
different for different functions and sometimes there are no tricks.

Hyperpipeling for circuits which are not "feedback free" is a Science
on its own right. The risk is that all these tricks blow up the size of
the circuits as well as its power consumption.

Regards, Hans

**FlanK3r** · 08-11-2010, 12:09 PM

its some idea, how much transistors will have bulldozer?

**Dresdenboy** · 08-11-2010, 12:44 PM

Originally Posted by Chumbucket843

using die area as an estimation for static power can be very misleading. leakage increases linearly with transistors and exponentially with drive current.

Ok, then let me rephrase it: double pumping uses less transistors than a monolithic implementation.

Originally Posted by Chumbucket843

i understand the concept of pipelining.

if speed is critical you might want to use a pulsed latch.

No prob. The example was also targeted at other possibly interested readers. Unfortunately it didn't work in one case.

Originally Posted by savantu

You contradict yourself in your own argument.

Wrong. In a pipeline like those discussed here there is a lot of logic consisting of millions of transistors.

In one pipeline stage there are several layers of many transistors creating the desired output of e.g. an adder. The longest possible delay path is the well known critical path.

Now you can have many fast or fewer slow transistors in such a circuit at a same clock frequency. The slower won't do as much and might need another pipe stage to get their work done. In my example I said that I would cut the multiplier circuit and do half the work per pipeline stage. There is no need to use faster and more power hungry transistors. Yet you could apply a faster clock frequency. It's as simple as that.

**Opteron146** · 08-11-2010, 01:02 PM

Originally Posted by -Boris-

What? But I want 128Bit 3DNow! Pro++.

And there is still wasted space, if they had the resources for a complete overhaul on the layout they would save some, and possibly increase performance.

The wasted space is used up now for Llano, maybe SSE4.1, maybe even AVX, perhaps 3DNow!++

.

**Opteron146** · 08-11-2010, 01:08 PM

Originally Posted by JF-AMD

Wow, that is pretty much a description of the whole bulldozer architecture. No need to double everything, only the things that matter.

Haha, but that was initially meant to explain Sandy's FPU ;-)

But perhaps that's a nice idea for some presentation slides:
double xy unit because ... <add great performance gain>
No doubled xz unit, because ... <add minor gains>.
.
.

At least if there are several doubled units ;-)

In any case, everybody is looking forward for Hot Chips.

One additional question to the die shots topic:
May we see at least a partly die-shot, i.e. of one BD module ? Like in case for Llano, where there are detailed die shots for the "K10.6" core, but not the whole Llano die, including the GPU.

thanks

Opteron

**rcofell** · 08-11-2010, 01:13 PM

Originally Posted by savantu

You contradict yourself in your own argument.

Little late on this since I'm a slow writer at times, but thought I'd throw in my 2cents in still.

He doesn't, you have the contexts mixed up

(this is topic that crosses the logic/physical design and process boundaries)

Faster transistors is in reference to using different physical-sizing/doping to achieve a lower propagation delay (through higher driving strength, etc.), which is independent of the pipeline clock-speed. The opposite is true however, the pipeline clock-speed is dependent upon the speed of the transistors; the clock-speed is bounded by worst-case delay path through the logic-transistors + setup time of the pipeline latches (without going into latch hold-time and clock-jitter/offset).

In his hypothetical scenario, he validly describes a situation where the actual speed (process parameters) of the logic-transistors does not have to be modified, while the clock-speed of the pipeline/unit as a whole can be raised.

**JF-AMD** · 08-11-2010, 05:29 PM

Originally Posted by Opteron146

One additional question to the die shots topic:
May we see at least a partly die-shot, i.e. of one BD module ?

Nope, there are no die shots in the presentation.

**kl0012** · 08-11-2010, 07:13 PM

Originally Posted by Hans de Vries

If a design is "feedback free" then hyperpipelining is a fully automated
process. Synopsis design compiler has had such features for years.
The software has to find the locations to which the signals have
propagated after a 1/2 cycle (or 1/N cycle in general) and then place
the flipflops there which will constitute the intermediate pipeline stages.

Intel (that is, the process guys) may probably have developed its own
software specifically adopted to Intel's own process physics model.

"Feedback free" is when the logic doesn't need to know what it just
calculated in the previous cycle. The P4 ALU was not feedback free
because it needs results from the previous cycle:

non-hyperpipelined:

----> C = A + B;
----
----> E = C + D;
----

hyperpipelined:

----> C = A + B;
----> E = C + D;
---->

The problem is that the result C is not yet known after a 1/2 cycle,
so you have to design logic which works with the intermediate result
instead. There exist tricks which do so. In general these tricks are
different for different functions and sometimes there are no tricks.

Hyperpipeling for circuits which are not "feedback free" is a Science
on its own right. The risk is that all these tricks blow up the size of
the circuits as well as its power consumption.

Regards, Hans

I specifically bolded the most important part of your post. This is why I mentioned multiplication/division.

**cegras** · 08-11-2010, 08:25 PM

Originally Posted by kl0012

I specifically bolded the most important part of your post. This is why I mentioned multiplication/division.

Do you actually understand what he is saying? It seems as if you're retreating behind your assumption that only intel employees understand what's going on.

**kl0012** · 08-11-2010, 09:02 PM

Originally Posted by cegras

Do you actually understand what he is saying? It seems as if you're retreating behind your assumption that only intel employees understand what's going on.

Yes, I fully understand what he said. But during the design of an efficient (from a space/power respective) miltiplier you probably can't avoid an iterative part of multiplier tree which is hardly pipelineable.

**Hans de Vries** · 08-11-2010, 11:27 PM

Originally Posted by kl0012

Yes, I fully understand what he said. But during the design of an efficient (from a space/power respective) miltiplier you probably can't avoid an iterative part of multiplier tree which is hardly pipelineable.

FP Multipliers are fully pipelined since many years now...

The CRAY-1 had fully pipelined 64 bit FP multipliers in 1975
http://en.wikipedia.org/wiki/Cray-1

The Intel i860 had an on chip fully pipelined 64 bit FP multiplier in 1989
http://en.wikipedia.org/wiki/Intel_i860

Xilinx Virtex-7 family will have an FPGA with almost 4000 pipelined
multiplier blocks each with a size of 25x18 bit with which you can build
all kind of multipliers in any way you want.
http://www.xilinx.com/support/docume...s_Overview.pdf

Regards, Hans

**Dresdenboy** · 08-11-2010, 11:42 PM

Originally Posted by Dresdenboy

The first version of the chart (said to be wrong in 1)) contained "AVX LO" and "AVX HI" units, also drawn at the same width as the 128 bit units. Maybe they're even not using double pumping but other techniques like wavefronts (less likely).

With "wavefronts", which is actually called "wave pipelining" (my brain mixed this up a bit, "wavefront" is used in some GPUs), there's even no need to add latches to get the pipelined behaviour:
http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

**kl0012** · 08-12-2010, 12:52 AM

Originally Posted by Hans de Vries

FP Multipliers are fully pipelined since many years now...

The CRAY-1 had fully pipelined 64 bit FP multipliers in 1975
http://en.wikipedia.org/wiki/Cray-1

The Intel i860 had an on chip fully pipelined 64 bit FP multiplier in 1989
http://en.wikipedia.org/wiki/Intel_i860

Xilinx Virtex-7 family will have an FPGA with almost 4000 pipelined
multiplier blocks each with a size of 25x18 bit with which you can build
all kind of multipliers in any way you want.
http://www.xilinx.com/support/docume...s_Overview.pdf

Regards, Hans

Thats about an actual implementation of multiplier (as I said above). You may chose parallel implementation and then you have no problem for further serializing it. But such implementation consumes much more space and power. On the other hand you may choose an iterative implementation and still get a fully pipelined multiplier if you can implement an iterative stage in one cycle but this specific stage is not pipelineable further (in case you want to rise the freq.). I know nothing about the actual implementation of multipliers in Intel's cpu, but my point is that "pipelineablilty" depends on specific design decision.

**Hans de Vries** · 08-12-2010, 01:36 AM

Originally Posted by kl0012

Thats about an actual implementation of multiplier (as I said above). You may chose parallel implementation and then you have no problem for further serializing it. But such implementation consumes much more space and power. On the other hand you may choose an iterative implementation and still get a fully pipelined multiplier if you can implement an iterative stage in one cycle but this specific stage is not pipelineable further (in case you want to rise the freq.). I know nothing about the actual implementation of multipliers in Intel's cpu, but my point is that "pipelineablilty" depends on specific design decision.

Intel's multipliers as well as those of almost everybody else are variations of
the booth encoded wallace tree multiplier

These are generally pipelined except for the small ones which operate
within a single cycle.

Regards, Hans

**kl0012** · 08-12-2010, 04:04 AM

Originally Posted by Hans de Vries

Intel's multipliers as well as those of almost everybody else are variations of
the booth encoded wallace tree multiplier

These are generally pipelined except for the small ones which operate
within a single cycle.

Regards, Hans

I don't see how your post contradicts mine. You call it wallace tree, I call it multiplier tree. Any way wallace tree arrays may have different sizes - if I remember correctly, Pentium 4 multiplication array is less then 64x64 so it needs additional iterations to execute DP and EP multiplications. BTW in all multipliers a wallace tree is a stage by it self and it is not usual to break it in an additional stages.

**blindbox** · 08-12-2010, 02:28 PM

Originally Posted by JF-AMD

yes

That's a promise!

There wasn't any dieshots of your Evergreen graphics chips (although looking back, you guys have always posted dieshots for processors.).

**JF-AMD** · 08-12-2010, 04:21 PM

Well, Bulldozer is my product and I am the one in charge of disclosures. We have a cadence for when we release info and things like die shots, cache sizes, clock speeds, pricing, launch dates, benchmarks, etc. are all things that help the other guys, so we have no desire to disclose them too early.

As for GPU, I am not even in the same country as that part of the company, so I really can't comment on them.

**~~terrace215~~** · 08-12-2010, 04:31 PM

Originally Posted by JF-AMD

Well, Bulldozer is my product and I am the one in charge of disclosures. We have a cadence for when we release info and things like die shots, cache sizes, clock speeds, pricing, launch dates, benchmarks, etc. are all things that help the other guys, so we have no desire to disclose them too early.

I'm sensing no die shot until the Nov 9 Analyst Day, at which point the launch window will be narrowed to "H2 2011" and cache size will be announced (as the die will give it away), no pricing, clock speeds or benchmarks until 2011.

**haylui** · 08-12-2010, 04:50 PM

Originally Posted by terrace215

I'm sensing no die shot until the Nov 9 Analyst Day, at which point the launch window will be narrowed to "H2 2011" and cache size will be announced (as the die will give it away), no pricing, clock speeds or benchmarks until 2011.

ok noted....thanks for the info

Thread: AMD cuts to the core with 'Bulldozer' Opterons

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions