ok to step away from the cache arguments that i get so lost in (im not that cpu savvy)
what kind of clocks can the desktop market see? ignoring high TDP, what kind of stable clocks should be possible. there isnt much istanbul OCing, but should these be just as good as a phenom II with an extra 2 cores?
Na he is rite its inclusive. Both those arcs are inclusive, nehalem and shrunk is exclusive cache.
But the sad part is higher associative has some side effects. The added size of the AMD L2 slow the process down a bit, but as associative is about half so that way AMD's L2 would win.
Last edited by ajaidev; 09-03-2009 at 07:19 AM.
Coming Soon
this inclusive and exclusive thing is
exclusive won't let other cores copy the data in L3 once the core has it, load it in to L2 or straight L1
inclusive means more then one core can copy the L3 data, load it in to L2 or straight L1.
I'm pretty sure Nehalem was inclusive and Barcelona was exclusive as it's L1 cache isn't 4 way.
Last edited by demonkevy666; 09-03-2009 at 07:44 AM.
No it's not. Barcelona/Shanghai's L3 is pseudo exclusive (if you will), with cachelines that are shared/contain code being possibly duplicated (instead of being evicted on a fetch from L3 they're maintained, thus breaking the exclusiveness). Nehalem's L3 is inclusive, read Ronak's presentation about it.
Higher associativity means longer snoop time, thus higher latency. There's no such thing as a free lunch in chip design. The prefetcher has merit for desktop workloads mostly, where datasets are small and with good prefetching you could, in theory, avoid fetching from RAM (which is still lol-slow even with IMCs). For large datasets it may be detrimental since it's more likely to increase cache trashing. Also, Barcelona/Shanghai also include prefetchers, albeit less apt ones compared to Intel's.
If you run one single application so yes, this is not a the optimal solution. If you run many applications then you start to get better speed as the load increases. I think AMD has tried to construct the cpu for stable performance rather than getting peak performance but if the load changes it can degrade faster.
The reason why i said that Barcelona/Shanghai's L3 acts as an inclusive cache by keeping a duplicate of the data , if it is likely the data is being accessed by multiple cores "Thus the shearing between all cores via L3" and if it was exclusive the data would be sent to L1 only to be copied to L2 and then L3. As you said its pseudo exclusive that means exceptions and this is one of them.
You are rite about the nahelem, forgot about the reworked memory system.
In prefetching its all about the size aint it. The nahelem has small but faster L2's as compared to K10/K10.5. That means whatever does not fit in the L2 goes to L3. AMD's L2 even with 16 way is faster for prefetching than the L3 in Nehalem. If working with large data chunks the nahalem may use L3 more than L2. Also on top of that the associativity that L2 grants is no laughing matter. Take a look at Sandy bridge and you will find the answer to the large data chunk problem. To me sandy bridge looks like a prefetching monster.
Coming Soon
overall, AMD rullez :-)
ROG Power PCs - Intel and AMD
CPUs:i9-7900X, i9-9900K, i7-6950X, i7-5960X, i7-8086K, i7-8700K, 4x i7-7700K, i3-7350K, 2x i7-6700K, i5-6600K, R7-2700X, 4x R5 2600X, R5 2400G, R3 1200, R7-1800X, R7-1700X, 3x AMD FX-9590, 1x AMD FX-9370, 4x AMD FX-8350,1x AMD FX-8320,1x AMD FX-8300, 2x AMD FX-6300,2x AMD FX-4300, 3x AMD FX-8150, 2x AMD FX-8120 125 and 95W, AMD X2 555 BE, AMD x4 965 BE C2 and C3, AMD X4 970 BE, AMD x4 975 BE, AMD x4 980 BE, AMD X6 1090T BE, AMD X6 1100T BE, A10-7870K, Athlon 845, Athlon 860K,AMD A10-7850K, AMD A10-6800K, A8-6600K, 2x AMD A10-5800K, AMD A10-5600K, AMD A8-3850, AMD A8-3870K, 2x AMD A64 3000+, AMD 64+ X2 4600+ EE, Intel i7-980X, Intel i7-2600K, Intel i7-3770K,2x i7-4770K, Intel i7-3930KAMD Cinebench R10 challenge AMD Cinebench R15 thread Intel Cinebench R15 thread
OK sorry ajaidev, but I think you don't really understand prefetching.
Prefetching is a method to predict which data the CPU will be requesting in the future based on its access patterns in the past. Simple prefetching mechanisms are next-line (prefetch sequential pieces of memory after a load/store req), stride-based (determine if there's a constant stride (eg 0, 4, 8, 12, 16...) and prefetch based off off the stride), and target-based (keep track of branches that cause misses).
Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.
As a general rule of thumb, prefetching can improve the performance of most applications regardless of the amount of bandwidth available and even with "low latency" DDR. Remember that accessing "low latency" DDR3 is still ~50-70ns round trip, which translates to at least 150 cycles at 3GHz. Compare this to even the slowest 30 cycle L3. The fact that prefetching is even done from L3 -> L2 or L2 -> L1 (move items from lower level caches to higher level caches), which saves a "mere" 10 or so cycles, should clue you in on how important it is.
Last edited by Shadowmage; 09-03-2009 at 11:25 AM.
oh man
enough with the cache, what kind of clocks can we expect, u know the important details
2,8-3,0GHz seems reasonable on 45nm within a 125W TDP ratio. Remember that AMD will release a new stepping later this year so I think 125W will be possible for such "high" clocks
There are also rumors of 32nm desktop parts based on K10.5 but I have no clue if those are true or not...
SweClockers.com
CPU: Phenom II X4 955BE
Clock: 4200MHz 1.4375v
Memory: Dominator GT 2x2GB 1600MHz 6-6-6-20 1.65v
Motherboard: ASUS Crosshair IV Formula
GPU: HD 5770
Actually that's what i said i was talking about the time delay it would take for a nahelem arc to prefetching from L3 "L3->L2 and then L2->L1" as compared to K10's L2 "L2->L1"
I never said memory prefetching is better than cache prefetching !!!
Even a 3Ghz 6 core with 140w TDP is great in my head. 32nm desktop parts do u mean the AMD LIano if AMD does release a K10.5 core at 32nm in 2011 they are out of their minds. Sandy bridge will totally kill them.
Last edited by ajaidev; 09-03-2009 at 12:48 PM.
Coming Soon
Sorry, I have no info about exact implementation of Intel's prefetchers.
In Nehalem data/instructions are prefetched directly into L1 or L2 cache depending on detected scenario. In fact Nehalem implements 2 prefetchers for L1 and two prefetchers for L2 for different scenarios. Intel also stated that Nehalem's prefetchers are more effective and does not hurt performance any more during inconvenient scenarios (as it was in Core 2).
Last edited by kl0012; 09-03-2009 at 02:22 PM.
prefetch:
Reading one byte will not read one byte from memory, it reads a cache line (64 byte).
if the cpu will guess next it will load next cache line (64 byte)? that is going to trash cache very fast
He is goading you ... you will not find that kind of information, nor does he know, caching policy and algorithms have never been disclosed to any level of detail that would satisfy a good answer to his question.
What he is trying to get you to admit that for sparse memory locality, large cache line fetches pollute the cache -- and Intel's prefetching sux rocks.
In some cases, as you are obviously aware, HW based prefetching acts as a detriment as opposed to a benefit. However, his logic -- through generalization -- would then make one wonder why Intel and AMD are so stupid to even use HW prefetching and not trust the OS or app itself to manage the cache policy... go figure, I am certain AMD and Intel love wasting transistor budget implementing something that is completely useless (of course, I am being facetious).
The truth of the matter is, HW prefetching over sparse memory locality is much more of a concern in large memory footprint applications, such as transactional database software. For us mere mortals with single socket desktops, HW prefetching most always provides a benefit, though it would be fun if BIOS writers would give us an option to disable HWP as they do in servers.
A little googling produced some (though generalized information) concerning nehalem -- http://www.scribd.com/doc/15507330/I...-Architecture- (Page 75).In Nehalem data/instructions are prefetched directly into L1 or L2 cache depending on detected scenario. In fact Nehalem implements 2 prefetchers for L1 and two prefetchers for L2 for different scenarios. Intel also stated that Nehalem's prefetchers are more effective and does not hurt performance any more during inconvenient scenarios (as it was in Core 2).
Nehalem did redo the prefetchers over Merom/Penryn, the most interesting statement is that they removed the need to disable HWP in the Nehalem revision, leading me to believe that they have implemented algorithms to detect sparse fetching patterns and to shut them down as needed. Who knows, speculation on m part.
There are hundreds of patents in the patent database around prefetching and cache policy, maybe someone with some time can do an exhaustive review to see if there is details on how each player implemented their prefetch mechanisms, what you won't find is any public disclosure. I thought Shadowmage above did a nice job of condensing an explanation around this concept.
This is a true statement. Another google yields some good papers on the topic:Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.
http://www.iolanguage.com/Library/Pa.../LECTURE13.pdf
http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
Last edited by JumpingJack; 09-03-2009 at 09:11 PM.
One hundred years from now It won't matter
What kind of car I drove What kind of house I lived in
How much money I had in the bank Nor what my cloths looked like.... But The world may be a little better Because, I was important In the life of a child.
-- from "Within My Power" by Forest Witcraft
32nm desktop parts I meen Phenom II shrinked simply... Would Llano be K10.5 based I don't see Sandy-Bridge killing - it alot of tweaks can be done to the architecture. Llano is focused on a market that want powerefficiency so I'd be careful to make such a statement until we know anything of either Sandy-Bridge or Llano.
Llano won't be a performance monster it is aimed as a cheap product for a much wider market, and some clues points towards features such as Z-RAM for example... Llano "Fusion" is AMDs path towards the same goal Intel has with it's Atom architecture, System on a Chip, small, powerefficient enough to be used in cellphones, "true laptops with great batterylife". The first Llano although will be aimed towards the laptop market and possibly cheap desktops aswell since they will consume to much for things such as cellphones.
Bulldozer will be released about the same time as Sandy-Bridge most likely.
Last edited by Smartidiot89; 09-03-2009 at 10:20 PM.
SweClockers.com
CPU: Phenom II X4 955BE
Clock: 4200MHz 1.4375v
Memory: Dominator GT 2x2GB 1600MHz 6-6-6-20 1.65v
Motherboard: ASUS Crosshair IV Formula
GPU: HD 5770
You have a very narrow view on the prefetch mechanism. You're missing here that prefetchers dosn't prefetch data all the time but only when specific scenario (such as reading sequential data) was detected. Then there is an exellent chance that prefetched line will required later on. Indeed, prefetcher can make mistakes but after all we don't want to get rid of caches (because of high latency on cache miss) or branch predictors (because of pipeline flush on wrong prediction). Except that there is many different techniques to avoid unwanted side effects (such as cache trashing).
Your view on the cache architecture seems to me narrow too. Higher associativity leads to higher latency. So you can not make conclusion that more associativity is allways better. There is no clear winner here.
Yeh, now I see it.
One hundred years from now It won't matter
What kind of car I drove What kind of house I lived in
How much money I had in the bank Nor what my cloths looked like.... But The world may be a little better Because, I was important In the life of a child.
-- from "Within My Power" by Forest Witcraft
If you want to read more about prefetching and how it is done on the CPU's here is two good pdf's
http://www.amd.com/us-en/assets/cont...docs/40546.pdf
http://developer.intel.com/design/pr...als/248966.pdf
Yes I know that there are some (very few) scenarios where hardware prefetch can kick in (AMD has prefetchers too), but it isn't a magic bullet that suddenly makes one CPU faster compared to another (intel vs amd) and you can't get sloppy with L1 cache and need to be carefull with L2 cache.
Exactly why I am sceptical about this "X6",the time period is too long
between the last 2.It would have been out by ~ March 2010 max., if
it was in the "works" already.
That's what the Xbit is saying here:
"Advanced Micro Devices is preparing a desktop processor with six processing engines, sources familiar with the company’s plans revealed."
The new central processing units (CPUs) will not be available this year, but are likely to boost performance of AMD’s desktop platforms sometime in 2010.
""Sometime"" in 2010? Cmon-
Sounds to me like a safety net (1.5 years) for the Xbit - so that people can't
calltill 2011.
Besides the 2P C32/34 parts look much more attractive to me.
""FASN 12/24"" anyone ?![]()
Time will tell of course.I hope they're right,but.....
Bookmarks