Can Llano do AVX?

**Hans de Vries** · 04-22-2010, 10:54 AM

A few observations suggest that AMD's Llano could do AVX instructions.

1) A reasonably large new block next to the FP register file.
2) Something what could be a new 3-way extra decoding stage in front of the FP units.
3) The large increase in size of the reorder buffer (3x24 to 3x32 or 3x36)

-It would be faster even if it's still using 128 bit hardware for the 256 bit
operations since typically many time slots are unused in FP units.

-The AVX performance would be ultimately limited by the cache bandwidth
to/from the SSE/AVX units (32 byte/cycle versus 48 byte/cycle for Sandy
Bridge)

-The 256 bit operations would be split into independent 128 bit operations
which would explain the increase in size of the reorder buffer.

-The size of the 3-way decode pack stage in front of the Integer units
has also increased also suggesting that something is added to the
decoding units (cache access for 2x128 bit words?)

------------------------------

Some extra points:

The second level TLB units for the data cache have been doubled from
512 entries to 1024 entries.

There is extra integer logic. A good guess would be a faster version
of the Integer divider. One that can produce multiple result bits/cycle
like the ones in the Core2 and Nehalem architecture.

Regards, Hans

**ajaidev** · 04-22-2010, 11:18 AM

I have to say bravo AMD and bravo to you Hans.

My question is will there be any kind of FMA implementation? I know that will be difficult but is it possible?

**zir_blazer** · 04-22-2010, 11:29 AM

Basically, your statements are that AMD didn't just take a current Deneb K10 Core and inserted in on a Llano, but that it suffered modest modifications to the K10 Core design. I don't know anything about how to recognize the transistors blocks (With the exeption of usually the obvious Cache L2 and L3), but if they were added on a revision of the K10 Core in Llano, then it is interesing. Not counting AVX because they requiere applications capable to use it, how do you think that the new Integer unit and double TLB could impact performance?

**Hans de Vries** · 04-22-2010, 11:31 AM

Originally Posted by ajaidev

My question is will there be any kind of FMA implementation? I know that will be difficult but is it possible?

I think it should be possible.

Regards, Hans

**Hans de Vries** · 04-22-2010, 11:39 AM

Originally Posted by zir_blazer

Basically, your statements are that AMD didn't just take a current Deneb K10 Core and inserted in on a Llano, but that it suffered modest modifications to the K10 Core design. I don't know anything about how to recognize the transistors blocks (With the exeption of usually the obvious Cache L2 and L3), but if they were added on a revision of the K10 Core in Llano, then it is interesing. Not counting AVX because they requiere applications capable to use it, how do you think that the new Integer unit and double TLB could impact performance?

The larger TLB is good for newer large workloads. A fast Integer divide
is a bit overdue compared to Core/Nehalem. I think the somewhat larger
L1 caches (8 transitor/bit instead of 6 transistor/bit) opened up the
required extra space in the layout needed for a fast integer divider.
Any impact is very program specific.

Regards, Hans

**Opteron146** · 04-22-2010, 01:36 PM

Hmm ... even AVX, I just thought about SSSE3 and SSE4.1, because these are already mentioned in AMD's CPUID PDF since 2008.
But who knows, maybe that update was just for the canceled 1st Fusion generation cores. We now have 2010, more changes are possible, AVX - why not.

The main new feature of AVX besides the 256bit are 3operand instructions, could that also explain some of the changes ?

Thx

Opteron146

**vitchilo** · 04-22-2010, 01:56 PM

Hmmm maybe I'm the only one here...but what is AVX? Some kind of instructions like SSE?

**Opteron146** · 04-22-2010, 01:59 PM

Originally Posted by vitchilo

Hmmm maybe I'm the only one here...but what is AVX? Some kind of instructions like SSE?

Yes it's SSE's successor; it widens the registers from 126bit to 256bit and instructions can handle 3 operands instead of 2.

Just remembered that Charlie wrote about Llano some time ago, too:

The core itself is changed a bit, but if you are familiar with the current 45nm K10h parts, you will feel right at home. AMD upped the L2 cache to 1MB per core, up from the current 512K, but it maintains the current 16-way associativity. The instruction window is enlarged to 84 entries so things should be a bit more efficient, and the instruction scheduler is now 30 entries for Integer, 36 for FP.

Hardware integer divide is said to be improved and latency for FP instructions has been reduced as well. To fill these windows, there is a better prefetcher, cache lines transition between states faster, and memory fill speed is increased. The TLB is also improved for better residency. Although these little details may not seem like all that much, a percent or three here and there adds up to quite noticeable improvements when everything is added up.

http://www.semiaccurate.com/2010/02/...nm-llano-core/

Clarifies some numbers & assumptions.

**Zibi** · 04-22-2010, 02:21 PM

Great Job! The AVX is probably not possible (this is only slightly modified K10 core).

I think that AMD could modify the unit's branch prediction (is now a lot of free space next to the L1 Instruction cache and and branch selectors/branch targets).

Sorry for my english. Regards

**Hans de Vries** · 04-22-2010, 02:32 PM

Originally Posted by Opteron146

Yes it's SSE's successor; it widens the registers from 126bit to 256bit and instructions can handle 3 operands instead of 2.

Just remembered that Charlie wrote about Llano some time ago, too:

http://www.semiaccurate.com/2010/02/...nm-llano-core/

Clarifies some numbers & assumptions.

Indeed,

This "memory fill speed" might be the bandwidth between the L2 and L1
caches since the L2 cache doubles but the number of banks is kept
the same (16)

Regards, Hans

**Raqia** · 04-22-2010, 03:06 PM

It's interesting that 3dnow! is still being kept around. It's a minuscule amount of die space but it can't be trivial to implement and debug. Company pride I suppose?

**~~terrace215~~** · 04-22-2010, 03:09 PM

Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

I guess it would be a trade-off. A little better performance than not supporting it vs. bad marketing from head-to-head comparisons with SB on "AVX benchmarks". Might it not be better to wait for a 256b implementation? I suppose that depends on how long it will be for such a successor in the llano market space.

**Hans de Vries** · 04-22-2010, 03:57 PM

Originally Posted by terrace215

Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

I guess it would be a trade-off. A little better performance than not supporting it vs. bad marketing from head-to-head comparisons with SB on "AVX benchmarks". Might it not be better to wait for a 256b implementation? I suppose that depends on how long it will be for such a successor in the llano market space.

It's not that "crippled", not by a factor 2 (=256/128). For example:
If an SIMD FP add takes 4 clock cycles then:

128 bit: A+B+C takes 8 clock cycles.
256 bit: A+B+C takes 9 clock cycles. (using pipelined 128 bit hardware)

128 bit: A+B+C+D takes 9 clock cycles.
256 bit: A+B+C+D takes 11 clock cycles. (using pipelined 128 bit hardware)

It all depends on how many unused time-slots there are due to the data
dependencies. A bigger bottleneck for Llano would be the L1 cache access
bandwidth: 32 bytes/cycle for Llano versus 48 bytes/cycle for Sandy Bridge.

Regards, Hans

**zir_blazer** · 04-22-2010, 04:02 PM

Currently, Athlon II X2 and Sempron parts based on Regor (Rev. DA-C2 or DA-C3) got 1 MB Cache L2. It never had real sense than the most cheapest parts had more Cache L2 than the bigger, more expensive ones with "only" 512 KB Cache L2, but yet still these do have more. Llano would standarize that size.

Originally Posted by Raqia

It's interesting that 3dnow! is still being kept around. It's a minuscule amount of die space but it can't be trivial to implement and debug. Company pride I suppose?

Compatibility. If we go though the entire x86 storyline, there are a whole bunch of things that aren't useful that could be dropped with some relative safety, but yet still they don't. Besides, there are always obscure things that are still used today. Considering the high possibility of issues by messing with the x86 compatibility, is better no not touch what already is.

Originally Posted by terrace215

Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

I guess it would be a trade-off. A little better performance than not supporting it vs. bad marketing from head-to-head comparisons with SB on "AVX benchmarks". Might it not be better to wait for a 256b implementation? I suppose that depends on how long it will be for such a successor in the llano market space.

That depends on AMD betting about how much AVX could take to be implementing on mainstream applications and performance starts to matter. Intel did exactly the same thing when SSE was introduced in the first Pentium 3, they had internally 64 Bits width while the instructions where supposed to work with 128 Bits.

**~~terrace215~~** · 04-22-2010, 05:02 PM

Originally Posted by Hans de Vries

It's not that "crippled", not by a factor 2 (=256/128). For example:
If an SIMD FP add takes 4 clock cycles then:

128 bit: A+B+C takes 8 clock cycles.
256 bit: A+B+C takes 9 clock cycles. (using pipelined 128 bit hardware)

128 bit: A+B+C+D takes 9 clock cycles.
256 bit: A+B+C+D takes 11 clock cycles. (using pipelined 128 bit hardware)

The comparison we want is 256 bit sum[A1 + A2 + ... + A_n] (to use your example) on 256 bit hardware vs 128 bit hardware.

Say n = 16, just for grins. What's the # clock cycles needed in each case?

**Stukov** · 04-22-2010, 05:44 PM

Originally Posted by zir_blazer

Currently, Athlon II X2 and Sempron parts based on Regor (Rev. DA-C2 or DA-C3) got 1 MB Cache L2. It never had real sense than the most cheapest parts had more Cache L2 than the bigger, more expensive ones with "only" 512 KB Cache L2, but yet still these do have more. Llano would standarize that size.

That is because higher up AMD chips have L3 cache and alot more than the 1MB L2 makes.

**Hans de Vries** · 04-22-2010, 06:27 PM

Originally Posted by terrace215

The comparison we want is 256 bit sum[A1 + A2 + ... + A_n] (to use your example) on 256 bit hardware vs 128 bit hardware.

Say n = 16, just for grins. What's the # clock cycles needed in each case?

This particular example is of course an ideal case for 256 bit hardware
and now Sandy Bridge's 48 byte/cycle versus Llano's 32 byte/cycle
L1 cache bandwidth will determine the throughput.

(Note that in this kind of cases there is no advantage from HT for Sandy
Bridge since a single thread already utilizes 100% of the resources)

Regards, Hans

**~~terrace215~~** · 04-22-2010, 06:38 PM

Originally Posted by Hans de Vries

This particular example is of course an ideal case for 256 bit hardware
and now Sandy Bridge's 48 byte/cycle versus Llano's 32 byte/cycle
L1 cache bandwidth will determine the throughput.
Regards, Hans

Well, it looks like 32 byte/cycle is enough to keep up (assuming we need 1 (or 2, for the 128-bit hardware) cycle to accumulate the new 256 bit value into the running sum anyhow), so wouldn't it be roughly 2:1 in this case?

**Hans de Vries** · 04-22-2010, 06:56 PM

Originally Posted by terrace215

Well, it looks like 32 byte/cycle is enough to keep up (assuming we need 1 (or 2, for the 128-bit hardware) cycle to accumulate the new 256 bit value into the running sum anyhow), so wouldn't it be roughly 2:1 in this case?

OK, it seems you have found such an ideal case. Not unlikely that this case
will be used as a synthetic benchmark ....

Regards, Hans.

**~~terrace215~~** · 04-22-2010, 07:21 PM

Originally Posted by Hans de Vries

OK, it seems you have found such an ideal case.

Hooray! I must give full credit though-- it was YOUR example, after all, I just had to compare the 256bit hardware implementation, and not let you get away with obscuring the difference through the initial latency

**saaya** · 04-22-2010, 07:55 PM

awesome job hans! thx for sharing!

Originally Posted by Hans de Vries

-It would be faster even if it's still using 128 bit hardware for the 256 bit
operations since typically many time slots are unused in FP units.

does that mean an fpu boost even for x87 and sse code? sounds like it... any idea how much faster? 10%?

Originally Posted by Hans de Vries

The second level TLB units for the data cache have been doubled from
512 entries to 1024 entries.

higher virtualization perf?

Originally Posted by Hans de Vries

There is extra integer logic. A good guess would be a faster version
of the Integer divider. One that can produce multiple result bits/cycle
like the ones in the Core2 and Nehalem architecture.

that would be nice! a preview of whats to come in bulldozer?

Originally Posted by Hans de Vries

(Note that in this kind of cases there is no advantage from HT for Sandy
Bridge since a single thread already utilizes 100% of the resources)
Regards, Hans

hmmmm really? i didnt know that...
hmmm do you remember when people started talking about reverse hyper threading? intel can split the fpu, ie hyper threading... amd is going to use one fpu for 2 integer cores... this is what people could have interpreted or misunderstood as reverse hyper threading right?

does anybody know how much work needs to be done to offload fpu code like avx to the gpu cores? any idea?

Originally Posted by terrace215

Hooray! I must give full credit though-- it was YOUR example, after all, I just had to compare the 256bit hardware implementation, and not let you get away with obscuring the difference through the initial latency

you come off extremely rude... just an fyi...

**jogshy** · 04-22-2010, 09:07 PM

If Llano can execute AVX instructions will be fantastic... but it will be even better if you can use the GPU "wavefronts"/"warps" as SIMD registers ( 64/32 floats vs 8 of the AVX ). After all, is not that a real "fusion"?

Btw, I really think new processors should remove the old functionality like the defunct 3dnow!, EMMS/x87, etc...

**~~terrace215~~** · 04-22-2010, 09:19 PM

Originally Posted by saaya

you come off extremely rude... just an fyi...

That's okay. Judging by your S|A article, you come off extremely uninformed... just an fyi...

**kl0012** · 04-22-2010, 09:33 PM

Originally Posted by Hans de Vries

A few observations suggest that AMD's Llano could do AVX instructions.

Don't you think that such predictions are a little... childish? To me, it's like trying to predict the hair color of an unborn child based on picture of his DNA cell without having a complete understanding of the purpose of each molecule in a DNA cell.
Do you think you know enough about internal K10.5 microarhitecture to argue that the 3-operand instruction will not require a major modifications of an entire "front-end" and "back-end"?
Don't you think you're underestimating a compexity of AVX impimentation? As an example Intel chose not to impliment even microcoded version of FMA in Sandy Bridge and I think it's not because Intel is lazy.

**kl0012** · 04-22-2010, 09:39 PM

Originally Posted by saaya

does anybody know how much work needs to be done to offload fpu code like avx to the gpu cores? any idea?

Wow... It is an multi-billion $$$ question. The last 10 years the world's best minds are trying to find an answer... without much success yet.

Thread: Can Llano do AVX?

Thread Tools

Search Thread

Rate This Thread

Display

Can Llano do AVX?

Bookmarks

Bookmarks

Posting Permissions