Results 1 to 25 of 122

Thread: Enter the Dragon: AMD AM3 6-core desktop arriving in 2010

Hybrid View

  1. #1
    Registered User
    Join Date
    Apr 2008
    Posts
    23
    Quote Originally Posted by ajaidev View Post
    Na he is rite its inclusive. Both those arcs are inclusive, nehalem and shrunk is exclusive cache.
    No it's not. Barcelona/Shanghai's L3 is pseudo exclusive (if you will), with cachelines that are shared/contain code being possibly duplicated (instead of being evicted on a fetch from L3 they're maintained, thus breaking the exclusiveness). Nehalem's L3 is inclusive, read Ronak's presentation about it.

    Higher associativity means longer snoop time, thus higher latency. There's no such thing as a free lunch in chip design. The prefetcher has merit for desktop workloads mostly, where datasets are small and with good prefetching you could, in theory, avoid fetching from RAM (which is still lol-slow even with IMCs). For large datasets it may be detrimental since it's more likely to increase cache trashing. Also, Barcelona/Shanghai also include prefetchers, albeit less apt ones compared to Intel's.

  2. #2
    Xtreme Mentor
    Join Date
    Jul 2008
    Location
    Shimla , India
    Posts
    2,631
    Quote Originally Posted by Morgoth Bauglir View Post
    No it's not. Barcelona/Shanghai's L3 is pseudo exclusive (if you will), with cachelines that are shared/contain code being possibly duplicated (instead of being evicted on a fetch from L3 they're maintained, thus breaking the exclusiveness). Nehalem's L3 is inclusive, read Ronak's presentation about it.

    Higher associativity means longer snoop time, thus higher latency. There's no such thing as a free lunch in chip design. The prefetcher has merit for desktop workloads mostly, where datasets are small and with good prefetching you could, in theory, avoid fetching from RAM (which is still lol-slow even with IMCs). For large datasets it may be detrimental since it's more likely to increase cache trashing. Also, Barcelona/Shanghai also include prefetchers, albeit less apt ones compared to Intel's.
    The reason why i said that Barcelona/Shanghai's L3 acts as an inclusive cache by keeping a duplicate of the data , if it is likely the data is being accessed by multiple cores "Thus the shearing between all cores via L3" and if it was exclusive the data would be sent to L1 only to be copied to L2 and then L3. As you said its pseudo exclusive that means exceptions and this is one of them.


    You are rite about the nahelem, forgot about the reworked memory system.

    In prefetching its all about the size aint it. The nahelem has small but faster L2's as compared to K10/K10.5. That means whatever does not fit in the L2 goes to L3. AMD's L2 even with 16 way is faster for prefetching than the L3 in Nehalem. If working with large data chunks the nahalem may use L3 more than L2. Also on top of that the associativity that L2 grants is no laughing matter. Take a look at Sandy bridge and you will find the answer to the large data chunk problem. To me sandy bridge looks like a prefetching monster.
    Coming Soon

  3. #3
    Xtreme Addict
    Join Date
    Aug 2004
    Location
    Austin, TX
    Posts
    1,346
    Quote Originally Posted by gosh View Post
    And that was what I wanted you to explain, how does prefetching work to get more hits in cache without slowing down the work that is done
    Quote Originally Posted by ajaidev View Post
    In prefetching its all about the size aint it. The nahelem has small but faster L2's as compared to K10/K10.5. That means whatever does not fit in the L2 goes to L3. AMD's L2 even with 16 way is faster for prefetching than the L3 in Nehalem. If working with large data chunks the nahalem may use L3 more than L2. Also on top of that the associativity that L2 grants is no laughing matter. Take a look at Sandy bridge and you will find the answer to the large data chunk problem. To me sandy bridge looks like a prefetching monster.
    OK sorry ajaidev, but I think you don't really understand prefetching.

    Prefetching is a method to predict which data the CPU will be requesting in the future based on its access patterns in the past. Simple prefetching mechanisms are next-line (prefetch sequential pieces of memory after a load/store req), stride-based (determine if there's a constant stride (eg 0, 4, 8, 12, 16...) and prefetch based off off the stride), and target-based (keep track of branches that cause misses).

    Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

    As a general rule of thumb, prefetching can improve the performance of most applications regardless of the amount of bandwidth available and even with "low latency" DDR. Remember that accessing "low latency" DDR3 is still ~50-70ns round trip, which translates to at least 150 cycles at 3GHz. Compare this to even the slowest 30 cycle L3. The fact that prefetching is even done from L3 -> L2 or L2 -> L1 (move items from lower level caches to higher level caches), which saves a "mere" 10 or so cycles, should clue you in on how important it is.
    Last edited by Shadowmage; 09-03-2009 at 11:25 AM.
    oh man

  4. #4
    Xtreme Mentor
    Join Date
    Jul 2008
    Location
    Shimla , India
    Posts
    2,631
    Quote Originally Posted by Shadowmage View Post
    OK sorry ajaidev, but I think you don't really understand prefetching.

    Prefetching is a method to predict which data the CPU will be requesting in the future based on its access patterns in the past. Simple prefetching mechanisms are next-line (prefetch sequential pieces of memory after a load/store req), stride-based (determine if there's a constant stride (eg 0, 4, 8, 12, 16...) and prefetch based off off the stride), and target-based (keep track of branches that cause misses).

    Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

    As a general rule of thumb, prefetching can improve the performance of most applications regardless of the amount of bandwidth available and even with "low latency" DDR. Remember that accessing "low latency" DDR3 is still ~50-70ns round trip, which translates to at least 150 cycles at 3GHz. Compare this to even the slowest 30 cycle L3. The fact that prefetching is even done from L3 -> L2 or L2 -> L1 (move items from lower level caches to higher level caches), which saves a "mere" 10 or so cycles, should clue you in on how important it is.
    Actually that's what i said i was talking about the time delay it would take for a nahelem arc to prefetching from L3 "L3->L2 and then L2->L1" as compared to K10's L2 "L2->L1"

    I never said memory prefetching is better than cache prefetching !!!

    Quote Originally Posted by Smartidiot89 View Post
    2,8-3,0GHz seems reasonable on 45nm within a 125W TDP ratio. Remember that AMD will release a new stepping later this year so I think 125W will be possible for such "high" clocks

    There are also rumors of 32nm desktop parts based on K10.5 but I have no clue if those are true or not...

    Even a 3Ghz 6 core with 140w TDP is great in my head. 32nm desktop parts do u mean the AMD LIano if AMD does release a K10.5 core at 32nm in 2011 they are out of their minds. Sandy bridge will totally kill them.
    Last edited by ajaidev; 09-03-2009 at 12:48 PM.
    Coming Soon

  5. #5
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by gosh View Post
    And that was what I wanted you to explain, how does prefetching work to get more hits in cache without slowing down the work that is done
    Sorry, I have no info about exact implementation of Intel's prefetchers.
    Quote Originally Posted by ajaidev View Post
    Actually that's what i said i was talking about the time delay it would take for a nahelem arc to prefetching from L3 "L3->L2 and then L2->L1" as compared to K10's L2 "L2->L1"
    In Nehalem data/instructions are prefetched directly into L1 or L2 cache depending on detected scenario. In fact Nehalem implements 2 prefetchers for L1 and two prefetchers for L2 for different scenarios. Intel also stated that Nehalem's prefetchers are more effective and does not hurt performance any more during inconvenient scenarios (as it was in Core 2).
    Last edited by kl0012; 09-03-2009 at 02:22 PM.

  6. #6
    Xtreme Mentor
    Join Date
    Mar 2006
    Posts
    2,978
    Quote Originally Posted by kl0012 View Post
    Sorry, I have no info about exact implementation of Intel's prefetchers.
    He is goading you ... you will not find that kind of information, nor does he know, caching policy and algorithms have never been disclosed to any level of detail that would satisfy a good answer to his question.

    What he is trying to get you to admit that for sparse memory locality, large cache line fetches pollute the cache -- and Intel's prefetching sux rocks.

    In some cases, as you are obviously aware, HW based prefetching acts as a detriment as opposed to a benefit. However, his logic -- through generalization -- would then make one wonder why Intel and AMD are so stupid to even use HW prefetching and not trust the OS or app itself to manage the cache policy... go figure, I am certain AMD and Intel love wasting transistor budget implementing something that is completely useless (of course, I am being facetious).

    The truth of the matter is, HW prefetching over sparse memory locality is much more of a concern in large memory footprint applications, such as transactional database software. For us mere mortals with single socket desktops, HW prefetching most always provides a benefit, though it would be fun if BIOS writers would give us an option to disable HWP as they do in servers.

    In Nehalem data/instructions are prefetched directly into L1 or L2 cache depending on detected scenario. In fact Nehalem implements 2 prefetchers for L1 and two prefetchers for L2 for different scenarios. Intel also stated that Nehalem's prefetchers are more effective and does not hurt performance any more during inconvenient scenarios (as it was in Core 2).
    A little googling produced some (though generalized information) concerning nehalem -- http://www.scribd.com/doc/15507330/I...-Architecture- (Page 75).

    Nehalem did redo the prefetchers over Merom/Penryn, the most interesting statement is that they removed the need to disable HWP in the Nehalem revision, leading me to believe that they have implemented algorithms to detect sparse fetching patterns and to shut them down as needed. Who knows, speculation on m part.

    There are hundreds of patents in the patent database around prefetching and cache policy, maybe someone with some time can do an exhaustive review to see if there is details on how each player implemented their prefetch mechanisms, what you won't find is any public disclosure. I thought Shadowmage above did a nice job of condensing an explanation around this concept.

    Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.
    This is a true statement. Another google yields some good papers on the topic:
    http://www.iolanguage.com/Library/Pa.../LECTURE13.pdf
    http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
    Last edited by JumpingJack; 09-03-2009 at 09:11 PM.
    One hundred years from now It won't matter
    What kind of car I drove What kind of house I lived in
    How much money I had in the bank Nor what my cloths looked like.... But The world may be a little better Because, I was important In the life of a child.
    -- from "Within My Power" by Forest Witcraft

  7. #7
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by gosh View Post
    prefetch:

    Reading one byte will not read one byte from memory, it reads a cache line (64 byte).
    if the cpu will guess next it will load next cache line (64 byte)? that is going to trash cache very fast
    You have a very narrow view on the prefetch mechanism. You're missing here that prefetchers dosn't prefetch data all the time but only when specific scenario (such as reading sequential data) was detected. Then there is an exellent chance that prefetched line will required later on. Indeed, prefetcher can make mistakes but after all we don't want to get rid of caches (because of high latency on cache miss) or branch predictors (because of pipeline flush on wrong prediction). Except that there is many different techniques to avoid unwanted side effects (such as cache trashing).
    Your view on the cache architecture seems to me narrow too. Higher associativity leads to higher latency. So you can not make conclusion that more associativity is allways better. There is no clear winner here.

    Quote Originally Posted by JumpingJack View Post
    He is goading you ... you will not find that kind of information, nor does he know, caching policy and algorithms have never been disclosed to any level of detail that would satisfy a good answer to his question.

    What he is trying to get you to admit that for sparse memory locality, large cache line fetches pollute the cache -- and Intel's prefetching sux rocks.
    Yeh, now I see it.

  8. #8
    Xtreme Enthusiast
    Join Date
    May 2008
    Posts
    612
    prefetch:

    Reading one byte will not read one byte from memory, it reads a cache line (64 byte).
    if the cpu will guess next it will load next cache line (64 byte)? that is going to trash cache very fast

  9. #9
    Xtreme Addict
    Join Date
    Dec 2008
    Location
    Sweden, Linköping
    Posts
    2,034
    Quote Originally Posted by ajaidev View Post
    Even a 3Ghz 6 core with 140w TDP is great in my head. 32nm desktop parts do u mean the AMD LIano if AMD does release a K10.5 core at 32nm in 2011 they are out of their minds. Sandy bridge will totally kill them.
    32nm desktop parts I meen Phenom II shrinked simply... Would Llano be K10.5 based I don't see Sandy-Bridge killing - it alot of tweaks can be done to the architecture. Llano is focused on a market that want powerefficiency so I'd be careful to make such a statement until we know anything of either Sandy-Bridge or Llano.

    Llano won't be a performance monster it is aimed as a cheap product for a much wider market, and some clues points towards features such as Z-RAM for example... Llano "Fusion" is AMDs path towards the same goal Intel has with it's Atom architecture, System on a Chip, small, powerefficient enough to be used in cellphones, "true laptops with great batterylife". The first Llano although will be aimed towards the laptop market and possibly cheap desktops aswell since they will consume to much for things such as cellphones.

    Bulldozer will be released about the same time as Sandy-Bridge most likely.
    Last edited by Smartidiot89; 09-03-2009 at 10:20 PM.
    SweClockers.com

    CPU: Phenom II X4 955BE
    Clock: 4200MHz 1.4375v
    Memory: Dominator GT 2x2GB 1600MHz 6-6-6-20 1.65v
    Motherboard: ASUS Crosshair IV Formula
    GPU: HD 5770

  10. #10
    Xtreme Mentor
    Join Date
    Mar 2006
    Posts
    2,978
    Quote Originally Posted by Smartidiot89 View Post

    Bulldozer will be released about the same time as Sandy-Bridge most likely.
    Now there will be two fun CPUs to dissect, I am looking forward to getting one of each
    One hundred years from now It won't matter
    What kind of car I drove What kind of house I lived in
    How much money I had in the bank Nor what my cloths looked like.... But The world may be a little better Because, I was important In the life of a child.
    -- from "Within My Power" by Forest Witcraft

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •