MMM
Page 1 of 5 1234 ... LastLast
Results 1 to 25 of 124

Thread: Can Llano do AVX?

  1. #1
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235

    Can Llano do AVX?

    A few observations suggest that AMD's Llano could do AVX instructions.

    1) A reasonably large new block next to the FP register file.
    2) Something what could be a new 3-way extra decoding stage in front of the FP units.
    3) The large increase in size of the reorder buffer (3x24 to 3x32 or 3x36)



    -It would be faster even if it's still using 128 bit hardware for the 256 bit
    operations since typically many time slots are unused in FP units.

    -The AVX performance would be ultimately limited by the cache bandwidth
    to/from the SSE/AVX units (32 byte/cycle versus 48 byte/cycle for Sandy
    Bridge)

    -The 256 bit operations would be split into independent 128 bit operations
    which would explain the increase in size of the reorder buffer.

    -The size of the 3-way decode pack stage in front of the Integer units
    has also increased also suggesting that something is added to the
    decoding units (cache access for 2x128 bit words?)

    ------------------------------

    Some extra points:

    The second level TLB units for the data cache have been doubled from
    512 entries to 1024 entries.

    There is extra integer logic. A good guess would be a faster version
    of the Integer divider. One that can produce multiple result bits/cycle
    like the ones in the Core2 and Nehalem architecture.


    Regards, Hans
    Last edited by Hans de Vries; 04-22-2010 at 10:56 AM.

  2. #2
    Xtreme Mentor
    Join Date
    Jul 2008
    Location
    Shimla , India
    Posts
    2,631
    I have to say bravo AMD and bravo to you Hans.

    My question is will there be any kind of FMA implementation? I know that will be difficult but is it possible?
    Coming Soon

  3. #3
    Xtreme Enthusiast
    Join Date
    Mar 2005
    Posts
    644
    Basically, your statements are that AMD didn't just take a current Deneb K10 Core and inserted in on a Llano, but that it suffered modest modifications to the K10 Core design. I don't know anything about how to recognize the transistors blocks (With the exeption of usually the obvious Cache L2 and L3), but if they were added on a revision of the K10 Core in Llano, then it is interesing. Not counting AVX because they requiere applications capable to use it, how do you think that the new Integer unit and double TLB could impact performance?

  4. #4
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by ajaidev View Post
    My question is will there be any kind of FMA implementation? I know that will be difficult but is it possible?
    I think it should be possible.


    Regards, Hans

  5. #5
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by zir_blazer View Post
    Basically, your statements are that AMD didn't just take a current Deneb K10 Core and inserted in on a Llano, but that it suffered modest modifications to the K10 Core design. I don't know anything about how to recognize the transistors blocks (With the exeption of usually the obvious Cache L2 and L3), but if they were added on a revision of the K10 Core in Llano, then it is interesing. Not counting AVX because they requiere applications capable to use it, how do you think that the new Integer unit and double TLB could impact performance?
    The larger TLB is good for newer large workloads. A fast Integer divide
    is a bit overdue compared to Core/Nehalem. I think the somewhat larger
    L1 caches (8 transitor/bit instead of 6 transistor/bit) opened up the
    required extra space in the layout needed for a fast integer divider.
    Any impact is very program specific.


    Regards, Hans

  6. #6
    Xtreme Member
    Join Date
    Aug 2004
    Posts
    210
    Hmm ... even AVX, I just thought about SSSE3 and SSE4.1, because these are already mentioned in AMD's CPUID PDF since 2008.
    But who knows, maybe that update was just for the canceled 1st Fusion generation cores. We now have 2010, more changes are possible, AVX - why not.

    The main new feature of AVX besides the 256bit are 3operand instructions, could that also explain some of the changes ?

    Thx

    Opteron146

  7. #7
    Xtreme Member
    Join Date
    Jan 2010
    Posts
    323
    Hmmm maybe I'm the only one here...but what is AVX? Some kind of instructions like SSE?

  8. #8
    Xtreme Member
    Join Date
    Aug 2004
    Posts
    210
    Quote Originally Posted by vitchilo View Post
    Hmmm maybe I'm the only one here...but what is AVX? Some kind of instructions like SSE?
    Yes it's SSE's successor; it widens the registers from 126bit to 256bit and instructions can handle 3 operands instead of 2.

    Just remembered that Charlie wrote about Llano some time ago, too:

    The core itself is changed a bit, but if you are familiar with the current 45nm K10h parts, you will feel right at home. AMD upped the L2 cache to 1MB per core, up from the current 512K, but it maintains the current 16-way associativity. The instruction window is enlarged to 84 entries so things should be a bit more efficient, and the instruction scheduler is now 30 entries for Integer, 36 for FP.

    Hardware integer divide is said to be improved and latency for FP instructions has been reduced as well. To fill these windows, there is a better prefetcher, cache lines transition between states faster, and memory fill speed is increased. The TLB is also improved for better residency. Although these little details may not seem like all that much, a percent or three here and there adds up to quite noticeable improvements when everything is added up.
    http://www.semiaccurate.com/2010/02/...nm-llano-core/

    Clarifies some numbers & assumptions.

  9. #9
    Registered User
    Join Date
    Feb 2010
    Location
    Poland
    Posts
    6
    Great Job! The AVX is probably not possible (this is only slightly modified K10 core).

    I think that AMD could modify the unit's branch prediction (is now a lot of free space next to the L1 Instruction cache and and branch selectors/branch targets).

    Sorry for my english. Regards

  10. #10
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by Opteron146 View Post
    Yes it's SSE's successor; it widens the registers from 126bit to 256bit and instructions can handle 3 operands instead of 2.

    Just remembered that Charlie wrote about Llano some time ago, too:


    http://www.semiaccurate.com/2010/02/...nm-llano-core/

    Clarifies some numbers & assumptions.
    Indeed,

    This "memory fill speed" might be the bandwidth between the L2 and L1
    caches since the L2 cache doubles but the number of banks is kept
    the same (16)


    Regards, Hans

  11. #11
    Registered User
    Join Date
    Nov 2008
    Posts
    28
    It's interesting that 3dnow! is still being kept around. It's a minuscule amount of die space but it can't be trivial to implement and debug. Company pride I suppose?

  12. #12
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

    I guess it would be a trade-off. A little better performance than not supporting it vs. bad marketing from head-to-head comparisons with SB on "AVX benchmarks". Might it not be better to wait for a 256b implementation? I suppose that depends on how long it will be for such a successor in the llano market space.

  13. #13
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by terrace215 View Post
    Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

    I guess it would be a trade-off. A little better performance than not supporting it vs. bad marketing from head-to-head comparisons with SB on "AVX benchmarks". Might it not be better to wait for a 256b implementation? I suppose that depends on how long it will be for such a successor in the llano market space.
    It's not that "crippled", not by a factor 2 (=256/128). For example:
    If an SIMD FP add takes 4 clock cycles then:

    128 bit: A+B+C takes 8 clock cycles.
    256 bit: A+B+C takes 9 clock cycles. (using pipelined 128 bit hardware)

    128 bit: A+B+C+D takes 9 clock cycles.
    256 bit: A+B+C+D takes 11 clock cycles. (using pipelined 128 bit hardware)

    It all depends on how many unused time-slots there are due to the data
    dependencies. A bigger bottleneck for Llano would be the L1 cache access
    bandwidth: 32 bytes/cycle for Llano versus 48 bytes/cycle for Sandy Bridge.


    Regards, Hans

  14. #14
    Xtreme Enthusiast
    Join Date
    Mar 2005
    Posts
    644
    Currently, Athlon II X2 and Sempron parts based on Regor (Rev. DA-C2 or DA-C3) got 1 MB Cache L2. It never had real sense than the most cheapest parts had more Cache L2 than the bigger, more expensive ones with "only" 512 KB Cache L2, but yet still these do have more. Llano would standarize that size.


    Quote Originally Posted by Raqia View Post
    It's interesting that 3dnow! is still being kept around. It's a minuscule amount of die space but it can't be trivial to implement and debug. Company pride I suppose?
    Compatibility. If we go though the entire x86 storyline, there are a whole bunch of things that aren't useful that could be dropped with some relative safety, but yet still they don't. Besides, there are always obscure things that are still used today. Considering the high possibility of issues by messing with the x86 compatibility, is better no not touch what already is.


    Quote Originally Posted by terrace215 View Post
    Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

    I guess it would be a trade-off. A little better performance than not supporting it vs. bad marketing from head-to-head comparisons with SB on "AVX benchmarks". Might it not be better to wait for a 256b implementation? I suppose that depends on how long it will be for such a successor in the llano market space.
    That depends on AMD betting about how much AVX could take to be implementing on mainstream applications and performance starts to matter. Intel did exactly the same thing when SSE was introduced in the first Pentium 3, they had internally 64 Bits width while the instructions where supposed to work with 128 Bits.

  15. #15
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by Hans de Vries View Post
    It's not that "crippled", not by a factor 2 (=256/128). For example:
    If an SIMD FP add takes 4 clock cycles then:

    128 bit: A+B+C takes 8 clock cycles.
    256 bit: A+B+C takes 9 clock cycles. (using pipelined 128 bit hardware)

    128 bit: A+B+C+D takes 9 clock cycles.
    256 bit: A+B+C+D takes 11 clock cycles. (using pipelined 128 bit hardware)
    The comparison we want is 256 bit sum[A1 + A2 + ... + A_n] (to use your example) on 256 bit hardware vs 128 bit hardware.

    Say n = 16, just for grins. What's the # clock cycles needed in each case?

  16. #16
    Xtreme Enthusiast
    Join Date
    Aug 2008
    Posts
    577
    Quote Originally Posted by zir_blazer View Post
    Currently, Athlon II X2 and Sempron parts based on Regor (Rev. DA-C2 or DA-C3) got 1 MB Cache L2. It never had real sense than the most cheapest parts had more Cache L2 than the bigger, more expensive ones with "only" 512 KB Cache L2, but yet still these do have more. Llano would standarize that size.
    That is because higher up AMD chips have L3 cache and alot more than the 1MB L2 makes.
    --Intel i5 3570k 4.4ghz (stock volts) - Corsair H100 - 6970 UL XFX 2GB - - Asrock Z77 Professional - 16GB Gskill 1866mhz - 2x90GB Agility 3 - WD640GB - 2xWD320GB - 2TB Samsung Spinpoint F4 - Audigy-- --NZXT Phantom - Samsung SATA DVD--(old systems Intel E8400 Wolfdale/Asus P45, AMD965BEC3 790X, Antec 180, Sapphire 4870 X2 (dead twice))

  17. #17
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by terrace215 View Post
    The comparison we want is 256 bit sum[A1 + A2 + ... + A_n] (to use your example) on 256 bit hardware vs 128 bit hardware.

    Say n = 16, just for grins. What's the # clock cycles needed in each case?
    This particular example is of course an ideal case for 256 bit hardware
    and now Sandy Bridge's 48 byte/cycle versus Llano's 32 byte/cycle
    L1 cache bandwidth will determine the throughput.

    (Note that in this kind of cases there is no advantage from HT for Sandy
    Bridge since a single thread already utilizes 100% of the resources)


    Regards, Hans

  18. #18
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by Hans de Vries View Post
    This particular example is of course an ideal case for 256 bit hardware
    and now Sandy Bridge's 48 byte/cycle versus Llano's 32 byte/cycle
    L1 cache bandwidth will determine the throughput.
    Regards, Hans
    Well, it looks like 32 byte/cycle is enough to keep up (assuming we need 1 (or 2, for the 128-bit hardware) cycle to accumulate the new 256 bit value into the running sum anyhow), so wouldn't it be roughly 2:1 in this case?
    Last edited by terrace215; 04-22-2010 at 06:42 PM.

  19. #19
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by terrace215 View Post
    Well, it looks like 32 byte/cycle is enough to keep up (assuming we need 1 (or 2, for the 128-bit hardware) cycle to accumulate the new 256 bit value into the running sum anyhow), so wouldn't it be roughly 2:1 in this case?
    OK, it seems you have found such an ideal case. Not unlikely that this case
    will be used as a synthetic benchmark ....

    Regards, Hans.

  20. #20
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by Hans de Vries View Post
    OK, it seems you have found such an ideal case.
    Hooray! I must give full credit though-- it was YOUR example, after all, I just had to compare the 256bit hardware implementation, and not let you get away with obscuring the difference through the initial latency
    Last edited by terrace215; 04-22-2010 at 07:24 PM.

  21. #21
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    awesome job hans! thx for sharing!

    Quote Originally Posted by Hans de Vries View Post
    -It would be faster even if it's still using 128 bit hardware for the 256 bit
    operations since typically many time slots are unused in FP units.
    does that mean an fpu boost even for x87 and sse code? sounds like it... any idea how much faster? 10%?

    Quote Originally Posted by Hans de Vries View Post
    The second level TLB units for the data cache have been doubled from
    512 entries to 1024 entries.
    higher virtualization perf?

    Quote Originally Posted by Hans de Vries View Post
    There is extra integer logic. A good guess would be a faster version
    of the Integer divider. One that can produce multiple result bits/cycle
    like the ones in the Core2 and Nehalem architecture.
    that would be nice! a preview of whats to come in bulldozer?

    Quote Originally Posted by Hans de Vries View Post
    (Note that in this kind of cases there is no advantage from HT for Sandy
    Bridge since a single thread already utilizes 100% of the resources)
    Regards, Hans
    hmmmm really? i didnt know that...
    hmmm do you remember when people started talking about reverse hyper threading? intel can split the fpu, ie hyper threading... amd is going to use one fpu for 2 integer cores... this is what people could have interpreted or misunderstood as reverse hyper threading right?

    does anybody know how much work needs to be done to offload fpu code like avx to the gpu cores? any idea?

    Quote Originally Posted by terrace215 View Post
    Hooray! I must give full credit though-- it was YOUR example, after all, I just had to compare the 256bit hardware implementation, and not let you get away with obscuring the difference through the initial latency
    you come off extremely rude... just an fyi...

  22. #22
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    If Llano can execute AVX instructions will be fantastic... but it will be even better if you can use the GPU "wavefronts"/"warps" as SIMD registers ( 64/32 floats vs 8 of the AVX ). After all, is not that a real "fusion"?

    Btw, I really think new processors should remove the old functionality like the defunct 3dnow!, EMMS/x87, etc...
    Last edited by jogshy; 04-22-2010 at 09:14 PM.

  23. #23
    Banned
    Join Date
    Jul 2004
    Posts
    1,125
    Quote Originally Posted by saaya View Post
    you come off extremely rude... just an fyi...
    That's okay. Judging by your S|A article, you come off extremely uninformed... just an fyi...

  24. #24
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by Hans de Vries View Post
    A few observations suggest that AMD's Llano could do AVX instructions.
    Don't you think that such predictions are a little... childish? To me, it's like trying to predict the hair color of an unborn child based on picture of his DNA cell without having a complete understanding of the purpose of each molecule in a DNA cell.
    Do you think you know enough about internal K10.5 microarhitecture to argue that the 3-operand instruction will not require a major modifications of an entire "front-end" and "back-end"?
    Don't you think you're underestimating a compexity of AVX impimentation? As an example Intel chose not to impliment even microcoded version of FMA in Sandy Bridge and I think it's not because Intel is lazy.

  25. #25
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by saaya View Post
    does anybody know how much work needs to be done to offload fpu code like avx to the gpu cores? any idea?
    Wow... It is an multi-billion $$$ question. The last 10 years the world's best minds are trying to find an answer... without much success yet.

Page 1 of 5 1234 ... LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •