Enter the Dragon: AMD AM3 6-core desktop arriving in 2010

**gosh** · 09-03-2009, 06:44 AM

Originally Posted by kl0012

Latencies of L1/L2/L3 always better then memory latency. Don't you prefer to find your data in cache instead of memory when you need it?

And that was what I wanted you to explain, how does prefetching work to get more hits in cache without slowing down the work that is done

**Morgoth Bauglir** · 09-03-2009, 07:03 AM

Originally Posted by gosh

Cache on Intel is inclusive, data is mirrored among cache levels. That means no more cache then 12 MB.
AMD has exclusive cache, data isn't mirrored among cache levels. L1 + L2 + L3 = total cache size

AMD also has 48 way L3 cache versus i7 (i9) 16 way L3 cache. i7 evicts data soner because when memory is mapped agains cache each address on i7 have 16 different places to be stored at. On AMD it has 48

L3 isn't exclusive for Barcelona/Shanghai(nor is it inclusive). That would be pretty retarded considering its intended usage. Associativity for large caches is less important, it's for small caches where you see big benefits from higher associativity.

**Manicdan** · 09-03-2009, 07:07 AM

ok to step away from the cache arguments that i get so lost in (im not that cpu savvy)

what kind of clocks can the desktop market see? ignoring high TDP, what kind of stable clocks should be possible. there isnt much istanbul OCing, but should these be just as good as a phenom II with an extra 2 cores?

**gosh** · 09-03-2009, 07:13 AM

Originally Posted by Morgoth Bauglir

it's for small caches where you see big benefits from higher associativity.

AMD: L2 - 512 KBytes 16-way set associative
i7(i9): L2 - 256 KBytes 8-way set associative

sorry about the cache discussion, just wanted to inform about the strange "better prefetcher" that for some is some magic bullet

**ajaidev** · 09-03-2009, 07:15 AM

Originally Posted by Morgoth Bauglir

L3 isn't exclusive for Barcelona/Shanghai(nor is it inclusive). That would be pretty retarded considering its intended usage. Associativity for large caches is less important, it's for small caches where you see big benefits from higher associativity.

Na he is rite its inclusive. Both those arcs are inclusive, nehalem and shrunk is exclusive cache.

Originally Posted by gosh

AMD: L2 - 512 KBytes 16-way set associative
i7(i9): L2 - 256 KBytes 8-way set associative

sorry about the cache discussion, just wanted to inform about the strange "better prefetcher" that for some is some magic bullet

But the sad part is higher associative has some side effects. The added size of the AMD L2 slow the process down a bit, but as associative is about half so that way AMD's L2 would win.

**demonkevy666** · 09-03-2009, 07:41 AM

this inclusive and exclusive thing is

exclusive won't let other cores copy the data in L3 once the core has it, load it in to L2 or straight L1

inclusive means more then one core can copy the L3 data, load it in to L2 or straight L1.

I'm pretty sure Nehalem was inclusive and Barcelona was exclusive as it's L1 cache isn't 4 way.

**Morgoth Bauglir** · 09-03-2009, 07:43 AM

Originally Posted by ajaidev

Na he is rite its inclusive. Both those arcs are inclusive, nehalem and shrunk is exclusive cache.

No it's not. Barcelona/Shanghai's L3 is pseudo exclusive (if you will), with cachelines that are shared/contain code being possibly duplicated (instead of being evicted on a fetch from L3 they're maintained, thus breaking the exclusiveness). Nehalem's L3 is inclusive, read Ronak's presentation about it.

Higher associativity means longer snoop time, thus higher latency. There's no such thing as a free lunch in chip design. The prefetcher has merit for desktop workloads mostly, where datasets are small and with good prefetching you could, in theory, avoid fetching from RAM (which is still lol-slow even with IMCs). For large datasets it may be detrimental since it's more likely to increase cache trashing. Also, Barcelona/Shanghai also include prefetchers, albeit less apt ones compared to Intel's.

**gosh** · 09-03-2009, 08:25 AM

Originally Posted by ajaidev

But the sad part is higher associative has some side effects. The added size of the AMD L2 slow the process down a bit, but as associative is about half so that way AMD's L2 would win.

If you run one single application so yes, this is not a the optimal solution. If you run many applications then you start to get better speed as the load increases. I think AMD has tried to construct the cpu for stable performance rather than getting peak performance but if the load changes it can degrade faster.

**ajaidev** · 09-03-2009, 09:00 AM

Originally Posted by Morgoth Bauglir

No it's not. Barcelona/Shanghai's L3 is pseudo exclusive (if you will), with cachelines that are shared/contain code being possibly duplicated (instead of being evicted on a fetch from L3 they're maintained, thus breaking the exclusiveness). Nehalem's L3 is inclusive, read Ronak's presentation about it.

Higher associativity means longer snoop time, thus higher latency. There's no such thing as a free lunch in chip design. The prefetcher has merit for desktop workloads mostly, where datasets are small and with good prefetching you could, in theory, avoid fetching from RAM (which is still lol-slow even with IMCs). For large datasets it may be detrimental since it's more likely to increase cache trashing. Also, Barcelona/Shanghai also include prefetchers, albeit less apt ones compared to Intel's.

The reason why i said that Barcelona/Shanghai's L3 acts as an inclusive cache by keeping a duplicate of the data , if it is likely the data is being accessed by multiple cores "Thus the shearing between all cores via L3" and if it was exclusive the data would be sent to L1 only to be copied to L2 and then L3. As you said its pseudo exclusive that means exceptions and this is one of them.

You are rite about the nahelem, forgot about the reworked memory system.

In prefetching its all about the size aint it. The nahelem has small but faster L2's as compared to K10/K10.5. That means whatever does not fit in the L2 goes to L3. AMD's L2 even with 16 way is faster for prefetching than the L3 in Nehalem. If working with large data chunks the nahalem may use L3 more than L2. Also on top of that the associativity that L2 grants is no laughing matter. Take a look at Sandy bridge and you will find the answer to the large data chunk problem. To me sandy bridge looks like a prefetching monster.

**FlanK3r** · 09-03-2009, 09:10 AM

overall, AMD rullez :-)

**Shadowmage** · 09-03-2009, 11:16 AM

Originally Posted by gosh

And that was what I wanted you to explain, how does prefetching work to get more hits in cache without slowing down the work that is done

Originally Posted by ajaidev

In prefetching its all about the size aint it. The nahelem has small but faster L2's as compared to K10/K10.5. That means whatever does not fit in the L2 goes to L3. AMD's L2 even with 16 way is faster for prefetching than the L3 in Nehalem. If working with large data chunks the nahalem may use L3 more than L2. Also on top of that the associativity that L2 grants is no laughing matter. Take a look at Sandy bridge and you will find the answer to the large data chunk problem. To me sandy bridge looks like a prefetching monster.

OK sorry ajaidev, but I think you don't really understand prefetching.

Prefetching is a method to predict which data the CPU will be requesting in the future based on its access patterns in the past. Simple prefetching mechanisms are next-line (prefetch sequential pieces of memory after a load/store req), stride-based (determine if there's a constant stride (eg 0, 4, 8, 12, 16...) and prefetch based off off the stride), and target-based (keep track of branches that cause misses).

Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

As a general rule of thumb, prefetching can improve the performance of most applications regardless of the amount of bandwidth available and even with "low latency" DDR. Remember that accessing "low latency" DDR3 is still ~50-70ns round trip, which translates to at least 150 cycles at 3GHz. Compare this to even the slowest 30 cycle L3. The fact that prefetching is even done from L3 -> L2 or L2 -> L1 (move items from lower level caches to higher level caches), which saves a "mere" 10 or so cycles, should clue you in on how important it is.

**Manicdan** · 09-03-2009, 11:31 AM

enough with the cache, what kind of clocks can we expect, u know the important details

**Smartidiot89** · 09-03-2009, 11:57 AM

Originally Posted by Manicdan

enough with the cache, what kind of clocks can we expect, u know the important details

2,8-3,0GHz seems reasonable on 45nm within a 125W TDP ratio. Remember that AMD will release a new stepping later this year so I think 125W will be possible for such "high" clocks

There are also rumors of 32nm desktop parts based on K10.5 but I have no clue if those are true or not...

**ajaidev** · 09-03-2009, 12:40 PM

Originally Posted by Shadowmage

OK sorry ajaidev, but I think you don't really understand prefetching.

Prefetching is a method to predict which data the CPU will be requesting in the future based on its access patterns in the past. Simple prefetching mechanisms are next-line (prefetch sequential pieces of memory after a load/store req), stride-based (determine if there's a constant stride (eg 0, 4, 8, 12, 16...) and prefetch based off off the stride), and target-based (keep track of branches that cause misses).

Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

As a general rule of thumb, prefetching can improve the performance of most applications regardless of the amount of bandwidth available and even with "low latency" DDR. Remember that accessing "low latency" DDR3 is still ~50-70ns round trip, which translates to at least 150 cycles at 3GHz. Compare this to even the slowest 30 cycle L3. The fact that prefetching is even done from L3 -> L2 or L2 -> L1 (move items from lower level caches to higher level caches), which saves a "mere" 10 or so cycles, should clue you in on how important it is.

Actually that's what i said i was talking about the time delay it would take for a nahelem arc to prefetching from L3 "L3->L2 and then L2->L1" as compared to K10's L2 "L2->L1"

I never said memory prefetching is better than cache prefetching !!!

Originally Posted by Smartidiot89

2,8-3,0GHz seems reasonable on 45nm within a 125W TDP ratio. Remember that AMD will release a new stepping later this year so I think 125W will be possible for such "high" clocks

There are also rumors of 32nm desktop parts based on K10.5 but I have no clue if those are true or not...

Even a 3Ghz 6 core with 140w TDP is great in my head. 32nm desktop parts do u mean the AMD LIano if AMD does release a K10.5 core at 32nm in 2011 they are out of their minds. Sandy bridge will totally kill them.

**kl0012** · 09-03-2009, 02:19 PM

Originally Posted by gosh

And that was what I wanted you to explain, how does prefetching work to get more hits in cache without slowing down the work that is done

Sorry, I have no info about exact implementation of Intel's prefetchers.

Originally Posted by ajaidev

Actually that's what i said i was talking about the time delay it would take for a nahelem arc to prefetching from L3 "L3->L2 and then L2->L1" as compared to K10's L2 "L2->L1"

In Nehalem data/instructions are prefetched directly into L1 or L2 cache depending on detected scenario. In fact Nehalem implements 2 prefetchers for L1 and two prefetchers for L2 for different scenarios. Intel also stated that Nehalem's prefetchers are more effective and does not hurt performance any more during inconvenient scenarios (as it was in Core 2).

**gosh** · 09-03-2009, 02:24 PM

prefetch:

Reading one byte will not read one byte from memory, it reads a cache line (64 byte).
if the cpu will guess next it will load next cache line (64 byte)? that is going to trash cache very fast

**JumpingJack** · 09-03-2009, 08:04 PM

Originally Posted by kl0012

Sorry, I have no info about exact implementation of Intel's prefetchers.

He is goading you ... you will not find that kind of information, nor does he know, caching policy and algorithms have never been disclosed to any level of detail that would satisfy a good answer to his question.

What he is trying to get you to admit that for sparse memory locality, large cache line fetches pollute the cache -- and Intel's prefetching sux rocks.

In some cases, as you are obviously aware, HW based prefetching acts as a detriment as opposed to a benefit. However, his logic -- through generalization -- would then make one wonder why Intel and AMD are so stupid to even use HW prefetching and not trust the OS or app itself to manage the cache policy... go figure, I am certain AMD and Intel love wasting transistor budget implementing something that is completely useless (of course, I am being facetious).

The truth of the matter is, HW prefetching over sparse memory locality is much more of a concern in large memory footprint applications, such as transactional database software. For us mere mortals with single socket desktops, HW prefetching most always provides a benefit, though it would be fun if BIOS writers would give us an option to disable HWP as they do in servers.

In Nehalem data/instructions are prefetched directly into L1 or L2 cache depending on detected scenario. In fact Nehalem implements 2 prefetchers for L1 and two prefetchers for L2 for different scenarios. Intel also stated that Nehalem's prefetchers are more effective and does not hurt performance any more during inconvenient scenarios (as it was in Core 2).

A little googling produced some (though generalized information) concerning nehalem -- http://www.scribd.com/doc/15507330/I...-Architecture- (Page 75).

Nehalem did redo the prefetchers over Merom/Penryn, the most interesting statement is that they removed the need to disable HWP in the Nehalem revision, leading me to believe that they have implemented algorithms to detect sparse fetching patterns and to shut them down as needed. Who knows, speculation on m part.

There are hundreds of patents in the patent database around prefetching and cache policy, maybe someone with some time can do an exhaustive review to see if there is details on how each player implemented their prefetch mechanisms, what you won't find is any public disclosure. I thought Shadowmage above did a nice job of condensing an explanation around this concept.

Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

This is a true statement. Another google yields some good papers on the topic:
http://www.iolanguage.com/Library/Pa.../LECTURE13.pdf
http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

**Smartidiot89** · 09-03-2009, 10:18 PM

Originally Posted by ajaidev

Even a 3Ghz 6 core with 140w TDP is great in my head. 32nm desktop parts do u mean the AMD LIano if AMD does release a K10.5 core at 32nm in 2011 they are out of their minds. Sandy bridge will totally kill them.

32nm desktop parts I meen Phenom II shrinked simply... Would Llano be K10.5 based I don't see Sandy-Bridge killing - it alot of tweaks can be done to the architecture. Llano is focused on a market that want powerefficiency so I'd be careful to make such a statement until we know anything of either Sandy-Bridge or Llano.

Llano won't be a performance monster it is aimed as a cheap product for a much wider market, and some clues points towards features such as Z-RAM for example... Llano "Fusion" is AMDs path towards the same goal Intel has with it's Atom architecture, System on a Chip, small, powerefficient enough to be used in cellphones, "true laptops with great batterylife". The first Llano although will be aimed towards the laptop market and possibly cheap desktops aswell since they will consume to much for things such as cellphones.

Bulldozer will be released about the same time as Sandy-Bridge most likely.

**kl0012** · 09-03-2009, 10:44 PM

Originally Posted by gosh

prefetch:

Reading one byte will not read one byte from memory, it reads a cache line (64 byte).
if the cpu will guess next it will load next cache line (64 byte)? that is going to trash cache very fast

You have a very narrow view on the prefetch mechanism. You're missing here that prefetchers dosn't prefetch data all the time but only when specific scenario (such as reading sequential data) was detected. Then there is an exellent chance that prefetched line will required later on. Indeed, prefetcher can make mistakes but after all we don't want to get rid of caches (because of high latency on cache miss) or branch predictors (because of pipeline flush on wrong prediction). Except that there is many different techniques to avoid unwanted side effects (such as cache trashing).
Your view on the cache architecture seems to me narrow too. Higher associativity leads to higher latency. So you can not make conclusion that more associativity is allways better. There is no clear winner here.

Originally Posted by JumpingJack

He is goading you ... you will not find that kind of information, nor does he know, caching policy and algorithms have never been disclosed to any level of detail that would satisfy a good answer to his question.

What he is trying to get you to admit that for sparse memory locality, large cache line fetches pollute the cache -- and Intel's prefetching sux rocks.

Yeh, now I see it.

**JumpingJack** · 09-03-2009, 10:44 PM

Originally Posted by Smartidiot89

Bulldozer will be released about the same time as Sandy-Bridge most likely.

Now there will be two fun CPUs to dissect, I am looking forward to getting one of each

**gosh** · 09-04-2009, 03:17 AM

Originally Posted by kl0012

Let's not forget 3-channel DDR3, better prefetchers and memory disambiguation (which is missing in Phenom).

Originally Posted by kl0012

You have a very narrow view on the prefetch mechanism. You're missing here that prefetchers dosn't prefetch data all the time but only when specific scenario (such as reading sequential data) was detected.

If you want to read more about prefetching and how it is done on the CPU's here is two good pdf's
http://www.amd.com/us-en/assets/cont...docs/40546.pdf
http://developer.intel.com/design/pr...als/248966.pdf

Yes I know that there are some (very few) scenarios where hardware prefetch can kick in (AMD has prefetchers too), but it isn't a magic bullet that suddenly makes one CPU faster compared to another (intel vs amd) and you can't get sloppy with L1 cache and need to be carefull with L2 cache.

**SocketMan** · 09-10-2009, 03:46 AM

Originally Posted by Shadov

Interesting indeed, though:

Barcelona -> Phenom
Shanghai -> Phenom II
Istanbul -> Phenom II X6

Why should AMD take a different approach?

Exactly why I am sceptical about this "X6",the time period is too long
between the last 2.It would have been out by ~ March 2010 max., if
it was in the "works" already.

That's what the Xbit is saying here:

"Advanced Micro Devices is preparing a desktop processor with six processing engines, sources familiar with the company’s plans revealed."

The new central processing units (CPUs) will not be available this year, but are likely to boost performance of AMD’s desktop platforms sometime in 2010.

""Sometime"" in 2010? Cmon-

Sounds to me like a safety net (1.5 years) for the Xbit - so that people can't
call