Enter the Dragon: AMD AM3 6-core desktop arriving in 2010

**Morgoth Bauglir** · 09-03-2009, 07:43 AM

Originally Posted by ajaidev

Na he is rite its inclusive. Both those arcs are inclusive, nehalem and shrunk is exclusive cache.

No it's not. Barcelona/Shanghai's L3 is pseudo exclusive (if you will), with cachelines that are shared/contain code being possibly duplicated (instead of being evicted on a fetch from L3 they're maintained, thus breaking the exclusiveness). Nehalem's L3 is inclusive, read Ronak's presentation about it.

Higher associativity means longer snoop time, thus higher latency. There's no such thing as a free lunch in chip design. The prefetcher has merit for desktop workloads mostly, where datasets are small and with good prefetching you could, in theory, avoid fetching from RAM (which is still lol-slow even with IMCs). For large datasets it may be detrimental since it's more likely to increase cache trashing. Also, Barcelona/Shanghai also include prefetchers, albeit less apt ones compared to Intel's.

**ajaidev** · 09-03-2009, 09:00 AM

Originally Posted by Morgoth Bauglir

No it's not. Barcelona/Shanghai's L3 is pseudo exclusive (if you will), with cachelines that are shared/contain code being possibly duplicated (instead of being evicted on a fetch from L3 they're maintained, thus breaking the exclusiveness). Nehalem's L3 is inclusive, read Ronak's presentation about it.

Higher associativity means longer snoop time, thus higher latency. There's no such thing as a free lunch in chip design. The prefetcher has merit for desktop workloads mostly, where datasets are small and with good prefetching you could, in theory, avoid fetching from RAM (which is still lol-slow even with IMCs). For large datasets it may be detrimental since it's more likely to increase cache trashing. Also, Barcelona/Shanghai also include prefetchers, albeit less apt ones compared to Intel's.

The reason why i said that Barcelona/Shanghai's L3 acts as an inclusive cache by keeping a duplicate of the data , if it is likely the data is being accessed by multiple cores "Thus the shearing between all cores via L3" and if it was exclusive the data would be sent to L1 only to be copied to L2 and then L3. As you said its pseudo exclusive that means exceptions and this is one of them.

You are rite about the nahelem, forgot about the reworked memory system.

In prefetching its all about the size aint it. The nahelem has small but faster L2's as compared to K10/K10.5. That means whatever does not fit in the L2 goes to L3. AMD's L2 even with 16 way is faster for prefetching than the L3 in Nehalem. If working with large data chunks the nahalem may use L3 more than L2. Also on top of that the associativity that L2 grants is no laughing matter. Take a look at Sandy bridge and you will find the answer to the large data chunk problem. To me sandy bridge looks like a prefetching monster.

**Shadowmage** · 09-03-2009, 11:16 AM

Originally Posted by gosh

And that was what I wanted you to explain, how does prefetching work to get more hits in cache without slowing down the work that is done

Originally Posted by ajaidev

In prefetching its all about the size aint it. The nahelem has small but faster L2's as compared to K10/K10.5. That means whatever does not fit in the L2 goes to L3. AMD's L2 even with 16 way is faster for prefetching than the L3 in Nehalem. If working with large data chunks the nahalem may use L3 more than L2. Also on top of that the associativity that L2 grants is no laughing matter. Take a look at Sandy bridge and you will find the answer to the large data chunk problem. To me sandy bridge looks like a prefetching monster.

OK sorry ajaidev, but I think you don't really understand prefetching.

Prefetching is a method to predict which data the CPU will be requesting in the future based on its access patterns in the past. Simple prefetching mechanisms are next-line (prefetch sequential pieces of memory after a load/store req), stride-based (determine if there's a constant stride (eg 0, 4, 8, 12, 16...) and prefetch based off off the stride), and target-based (keep track of branches that cause misses).

Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

As a general rule of thumb, prefetching can improve the performance of most applications regardless of the amount of bandwidth available and even with "low latency" DDR. Remember that accessing "low latency" DDR3 is still ~50-70ns round trip, which translates to at least 150 cycles at 3GHz. Compare this to even the slowest 30 cycle L3. The fact that prefetching is even done from L3 -> L2 or L2 -> L1 (move items from lower level caches to higher level caches), which saves a "mere" 10 or so cycles, should clue you in on how important it is.

**ajaidev** · 09-03-2009, 12:40 PM

Originally Posted by Shadowmage

OK sorry ajaidev, but I think you don't really understand prefetching.

Prefetching is a method to predict which data the CPU will be requesting in the future based on its access patterns in the past. Simple prefetching mechanisms are next-line (prefetch sequential pieces of memory after a load/store req), stride-based (determine if there's a constant stride (eg 0, 4, 8, 12, 16...) and prefetch based off off the stride), and target-based (keep track of branches that cause misses).

Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

As a general rule of thumb, prefetching can improve the performance of most applications regardless of the amount of bandwidth available and even with "low latency" DDR. Remember that accessing "low latency" DDR3 is still ~50-70ns round trip, which translates to at least 150 cycles at 3GHz. Compare this to even the slowest 30 cycle L3. The fact that prefetching is even done from L3 -> L2 or L2 -> L1 (move items from lower level caches to higher level caches), which saves a "mere" 10 or so cycles, should clue you in on how important it is.

Actually that's what i said i was talking about the time delay it would take for a nahelem arc to prefetching from L3 "L3->L2 and then L2->L1" as compared to K10's L2 "L2->L1"

I never said memory prefetching is better than cache prefetching !!!

Originally Posted by Smartidiot89

2,8-3,0GHz seems reasonable on 45nm within a 125W TDP ratio. Remember that AMD will release a new stepping later this year so I think 125W will be possible for such "high" clocks

There are also rumors of 32nm desktop parts based on K10.5 but I have no clue if those are true or not...

Even a 3Ghz 6 core with 140w TDP is great in my head. 32nm desktop parts do u mean the AMD LIano if AMD does release a K10.5 core at 32nm in 2011 they are out of their minds. Sandy bridge will totally kill them.

**kl0012** · 09-03-2009, 02:19 PM

Originally Posted by gosh

And that was what I wanted you to explain, how does prefetching work to get more hits in cache without slowing down the work that is done

Sorry, I have no info about exact implementation of Intel's prefetchers.

Originally Posted by ajaidev

Actually that's what i said i was talking about the time delay it would take for a nahelem arc to prefetching from L3 "L3->L2 and then L2->L1" as compared to K10's L2 "L2->L1"

In Nehalem data/instructions are prefetched directly into L1 or L2 cache depending on detected scenario. In fact Nehalem implements 2 prefetchers for L1 and two prefetchers for L2 for different scenarios. Intel also stated that Nehalem's prefetchers are more effective and does not hurt performance any more during inconvenient scenarios (as it was in Core 2).

**JumpingJack** · 09-03-2009, 08:04 PM

Originally Posted by kl0012

Sorry, I have no info about exact implementation of Intel's prefetchers.

He is goading you ... you will not find that kind of information, nor does he know, caching policy and algorithms have never been disclosed to any level of detail that would satisfy a good answer to his question.

What he is trying to get you to admit that for sparse memory locality, large cache line fetches pollute the cache -- and Intel's prefetching sux rocks.

In some cases, as you are obviously aware, HW based prefetching acts as a detriment as opposed to a benefit. However, his logic -- through generalization -- would then make one wonder why Intel and AMD are so stupid to even use HW prefetching and not trust the OS or app itself to manage the cache policy... go figure, I am certain AMD and Intel love wasting transistor budget implementing something that is completely useless (of course, I am being facetious).

The truth of the matter is, HW prefetching over sparse memory locality is much more of a concern in large memory footprint applications, such as transactional database software. For us mere mortals with single socket desktops, HW prefetching most always provides a benefit, though it would be fun if BIOS writers would give us an option to disable HWP as they do in servers.

In Nehalem data/instructions are prefetched directly into L1 or L2 cache depending on detected scenario. In fact Nehalem implements 2 prefetchers for L1 and two prefetchers for L2 for different scenarios. Intel also stated that Nehalem's prefetchers are more effective and does not hurt performance any more during inconvenient scenarios (as it was in Core 2).

A little googling produced some (though generalized information) concerning nehalem -- http://www.scribd.com/doc/15507330/I...-Architecture- (Page 75).

Nehalem did redo the prefetchers over Merom/Penryn, the most interesting statement is that they removed the need to disable HWP in the Nehalem revision, leading me to believe that they have implemented algorithms to detect sparse fetching patterns and to shut them down as needed. Who knows, speculation on m part.

There are hundreds of patents in the patent database around prefetching and cache policy, maybe someone with some time can do an exhaustive review to see if there is details on how each player implemented their prefetch mechanisms, what you won't find is any public disclosure. I thought Shadowmage above did a nice job of condensing an explanation around this concept.

Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

This is a true statement. Another google yields some good papers on the topic:
http://www.iolanguage.com/Library/Pa.../LECTURE13.pdf
http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

**kl0012** · 09-03-2009, 10:44 PM

Originally Posted by gosh

prefetch:

Reading one byte will not read one byte from memory, it reads a cache line (64 byte).
if the cpu will guess next it will load next cache line (64 byte)? that is going to trash cache very fast

You have a very narrow view on the prefetch mechanism. You're missing here that prefetchers dosn't prefetch data all the time but only when specific scenario (such as reading sequential data) was detected. Then there is an exellent chance that prefetched line will required later on. Indeed, prefetcher can make mistakes but after all we don't want to get rid of caches (because of high latency on cache miss) or branch predictors (because of pipeline flush on wrong prediction). Except that there is many different techniques to avoid unwanted side effects (such as cache trashing).
Your view on the cache architecture seems to me narrow too. Higher associativity leads to higher latency. So you can not make conclusion that more associativity is allways better. There is no clear winner here.

Originally Posted by JumpingJack

He is goading you ... you will not find that kind of information, nor does he know, caching policy and algorithms have never been disclosed to any level of detail that would satisfy a good answer to his question.

What he is trying to get you to admit that for sparse memory locality, large cache line fetches pollute the cache -- and Intel's prefetching sux rocks.

Yeh, now I see it.

**gosh** · 09-03-2009, 02:24 PM

prefetch:

Reading one byte will not read one byte from memory, it reads a cache line (64 byte).
if the cpu will guess next it will load next cache line (64 byte)? that is going to trash cache very fast

**Smartidiot89** · 09-03-2009, 10:18 PM

Originally Posted by ajaidev

Even a 3Ghz 6 core with 140w TDP is great in my head. 32nm desktop parts do u mean the AMD LIano if AMD does release a K10.5 core at 32nm in 2011 they are out of their minds. Sandy bridge will totally kill them.

32nm desktop parts I meen Phenom II shrinked simply... Would Llano be K10.5 based I don't see Sandy-Bridge killing - it alot of tweaks can be done to the architecture. Llano is focused on a market that want powerefficiency so I'd be careful to make such a statement until we know anything of either Sandy-Bridge or Llano.

Llano won't be a performance monster it is aimed as a cheap product for a much wider market, and some clues points towards features such as Z-RAM for example... Llano "Fusion" is AMDs path towards the same goal Intel has with it's Atom architecture, System on a Chip, small, powerefficient enough to be used in cellphones, "true laptops with great batterylife". The first Llano although will be aimed towards the laptop market and possibly cheap desktops aswell since they will consume to much for things such as cellphones.

Bulldozer will be released about the same time as Sandy-Bridge most likely.

**JumpingJack** · 09-03-2009, 10:44 PM

Originally Posted by Smartidiot89

Bulldozer will be released about the same time as Sandy-Bridge most likely.

Now there will be two fun CPUs to dissect, I am looking forward to getting one of each

Thread: Enter the Dragon: AMD AM3 6-core desktop arriving in 2010

Thread Tools

Search Thread

Rate This Thread

Display

Hybrid View

Bookmarks

Bookmarks

Posting Permissions