Hmmmmm... this is the first time I have heard of cache associated with the northbridge....
Printable View
You may be thinking of the time when L2 was off die, in which case it was loaded off the northbridge... through the backside bus then later on the frontside bus. In fact, the term northbridge is historic, in that in diagrams, this chip (with the memory controller) and the L2 were north of the CPU... :)
Later, the northbridge moved down below the CPU between the CPU and southbridge, but the L2 was still distinct.
Here is an example, but the diagram is after the north
http://www.via.com.tw/en/products/ap.../blockmvp4.gif
Here is another example:
http://www.via.com.tw/en/products/ap.../blockmvp3.gif
Of course, I could be wrong -- I am going from memory mostly. Hang tight, I will go look up the 'evolution of the northbridge' paper that shows the history, if I can find it.
EDIT: I knew there was a configuration for off-die L2 cache through the backside bus.... still cannot find the paper, but found the block diagram:
http://www.karbosguide.com/images/_975.gif
Yes Jack, on K7 L2 was driven from BackSide Bus :up: (cache was 2/3 or 3/5 of the core clock and packed on Slot A module), but earlier I'm not sure TBH!
It was my personal experience which lead me to believe that L2 cache on motherboards was connected to Northbridge because performance of CPU varied considerably from motherboard you used... (I'm speaking solely about cache intensive tests).
Besides if L2 cache on older motherboards was driven by CPU backside bus then how on earth very old P60 could known about 2MB cache on my Epox board??
Edit: I found something!
LinkQuote:
M1541 includes the higher CPU bus frequency (up to 100 MHz) interface for all Socket-7 compatible processors, PBSRAM and Memory Cache L2 controller to reduce cost and enhance performance, high performance FPM/EDO/SDRAM DRAM controller, PCI 2.1 compliant bus interface, smart deep buffer design for CPU-to-DRAM, CPU-to-PCI, and PCI-to-DRAM to achieve the best system performance. It also has the highly efficient PCI fair arbiter. M1541 also provides the most flexible 64-bit memory bus interface for the best DRAM upgrade-ability and ECC/Parity design to enhance the system reliability.
a nice picture reference for mobos http://redhill.net.au/b/b-92.html
but it ends at 2002
At Barcelona's presentation it was said that the L3 cache runs at Core clock frequency, not at northbridges' (wich runs a little lower. i.e: on 2GHz Barcelona northbridge runs at 1.6GHz on a uniplane board, and at 1.8GHz on a dual plane board)
But the L3 cache beeing a part of the northbridge does make sense, by the latency numbers we've seen and all the async stuff...
Let's put it in simple words as guys speaking techno will lose most other people and that way no one learns or benefits accurately, which is the whole purpose of sharing and explaining. :)
I mentioned about the L3 cache many pages back. AMD outlined it at "less than 38 cycles". 38 cycles is it's slowest latency. L3 cache is just a victim cache. It is very good for improving latency of data transfer to the core which needs it.
Usually, the data (retained in cache lines) evicted from L2 cache, has to be thrown back to memory (RAM). If needed again, it has to be re-fetched. This distance->latency is very large in comparison to on-die distance->latency.
Thus, by including "another" on-die cache but this time to retain L2 evicted datum in, they can refill the L2 or mostly, send it directly to L1D cache very quickly and any core can access the data (shared cache) far quicker than when going to memory.
The inclusion of this benefits AMD K10 very much similar to how it does for Core 2/Penryn with the large L2 cache, and that's why it will be increased proportionally with time as die size decreases in K10.
IIRC L3 populates itself from L1D cache, L2 cache and also memory based on prediction algorithms too.
K8 is not K10, so don't be using it's architectural features to undermine or imply those to the K10 - just because one doesn't know any accurate information or any better.
L2 and L3 cache use 64B lines. L3 cache controller is variable and flexible to support 8MB.
Also "Size and associativity of the AMD Family 10h processor L2 cache is implementation dependent. See the appropriate BIOS and Kernel Developer’s Guide for details."
L3 cache in not inclusive nor orthodoxly exclusive but tweaked - cache line can be fetched into the L1D and still be retained in L3 for core sharing based on sharing history which is tracked.
L3 doesn't evict using the same LRU algorithm the K8 used for L2, but evicts lines based on the LRU unshared cache line.
Northbridge and core frequency somewhat determines L3 cache relative latency.
L3 cache and IMC (northbridge) runs at independent speeds and voltages from all cores.
"Furthermore, the cache features bandwidth-adaptive policies that optimize latency when requested bandwidth is low, but allows scaling to higher aggregate L3 bandwidth when required (such as in a multi-core environment)." http://www.amd.com/us-en/assets/cont...docs/40546.pdf
"L3 cache latency will evidently be higher than L2 cache latency. However, AMD materials suggest that the latency will vary adaptively depending on the workload. If the workload isn’t too heavy, latency will improve, and under heavy workload the bandwidth will rise. We still have to check what really stands behind this." http://www.xbitlabs.com/articles/cpu...amd-k10_8.html
You already know about HT and it's implementational advantages, exactly as why Intel is developing CSI for high bandwidth, and some of their CSI design team left over this: http://www.realworldtech.com/page.cf...05-05-2006#361
HT link bandwidth is dependent on the controller implemented and increasing this becomes a real power hungry system, which is why processor architects will withhold to do so but in limited very high-end and future shrunk node levels.
In all honesty, Barcelona front end and out-of-order engine is a very complex architecture.
Each clock cycle fetches 32B of data from the L1 instruction cache into the predecode/pick buffer - K8 does 16B - Core 2 does 16B.
The data travels on a bi-directional bus which is 256-bit wide - K8 has it at 128-bit - Core 2 is at 128-bit.
The predecode and pick buffer maybe increased to 48B.
Direct branch prediction is improved > the global history register tracks the last 12 entries - K8 one tracked 8.
New indirect branch prediction of 512-entries.
The FIFO stack including the return addresses of function calls 3, 11 and 15 is 24 entries - K8 pushed 12 out.
Core 2 uses a single execution cluster with a unified scheduler and reservation stations across multiple ports, like in Athlons - K10 has a split integer and floating point cluster with distributed scheduler and reservation stations.
IDIV instructions are variable latency and not a fixed iteration as in K8 previously. This K10 32 bit divide latency is roughly 10 cycles faster than in K8.
Third ALU is now for LZCOUNT/POPCOUNT.
ALUs are most optimized for power efficiency rather than peak performance ATM.
Out-of-order memory access prevents operation stalls seen in K8 (especially with load).
In K10, each core has 8 prefetchers which fetch data into the L1D cache whereas in K8 they were prefetched to the L2 cache.
That's some of it's better features over the K8 (let alone the DRAM controllers) which make it by no way equal. According to what I see, Barcelona quintessence is the Budapest core at higher clock speeds.
Now to the argument of efficiency (a) can't be and can be correlated with frequency (b) in a format we don't expect - all based on what Gary said. Gary commented on what he saw, simply. Processors have an upper limit whereafter more frequency=little performance gain, a mid optimum band where more frequency=inclining gradient performance, a lower band whereby the frequencies=sub-optimal performance.
I will provide you an example of an older CPU I have with some SPI mod 1.5XS 1M tests I ran 1-3 months or so back (P4 Celeron D DO S478, Malaysia week 21 of '04). Memory is synchronous 512MB.
Clock (MHz) - Time (sec) - Scaling
1802MHz = 116.547sec = 100%
1997MHz = 104.110sec = 112%
2104MHz = 97.480sec = 119.6%
2205MHz = 73.455sec = 158.67%
2302MHz = 88.307sec = 132%
2402MHz = 85.009sec = 137%
2504MHz = 80.486sec = 144.8%
2604MHz = 77.161sec = 151%
2701MHz = 73.541sec = 158.47%
2798MHz = 70.431sec = 165.5%
2999MHz = 65.044sec = 179%
3201MHz = 60.396sec = 193%
3301MHz = 60.036sec = 194%
3360MHz = 48.460sec = 240.5%
3503MHz = 47.849sec = 243.6%
See what happened from 1800-2300 and from 3300-3360? Now let's place K10 in there assuming relative scaling like I witnessed with my P4 (not compared to P4 though, but K8). What would happen as a result?
Base comparison: (P4 1800) K8 1900 = 100%
(P4 2300) K10 1900 = 132%
(P4 2400) K10 2000 = 137%
(P4 2500) K10 2100 = 145%
(P4 2600) K10 2200 = 151%
(P4 2700) K10 2300 = 158%
(P4 2800) K10 2400 = 166%
(P4 2900) K10 2500 = 171%
(P4 3000) K10 2600 = 179%
(P4 3100) K10 2700 = 186%
(P4 3200) K10 2800 = 193%
(P4 3300) K10 2900 = 194%
(P4 3400) K10 3000 = 240% **PEAK MOST STABLE PERFORMANCE**
(P4 3500) K10 3100 = 243%
(P4 3600) K10 3200 = 244%
(P4 3700) K10 3300 = NA
This scenario I've witnessed and is based on real life processor performance, so it's equally possible. By AMD and others touting 3GHz Phenom, again and again, I would expect this is loosely what we are to expect.
Here's one Phenom MB BTW: http://img170.imageshack.us/img170/2...63c0cf7kk7.jpg
:hehe: nice figures....but the equating of k10 to real figures is....purely hypothetical.
how is the k8=100% (@1900) and k10=132% (@1900) derived? - does this assume a 32% gain for k10 over k8 baseline?
and does it also assume identical scaling for k10 versus an old p4?
:D It can get far worse believe me. :)
One of the points I wanted to show is how it clearly and perilously casts doubts on this http://img.coolaler.com.tw/images/zm...mlwemyzmzk.jpg compared to this http://img527.imageshack.us/img527/6582/sp47cj7.jpg
Bro, that was 1x 512MB RAM not even dual channel and an old CD. A P4 will get 10 seconds lower at least, beating the Barcelona quad core time they showed.
Yes, the K8 1900 was baseline compared to a K10 1900 - as an assumption for better clock per clock performance. It could be far lower, obviously, but this is a vague explanatory comparison of what Gary stated rather than a "prediction".Quote:
how is the k8=100% (@1900) and k10=132% (@1900) derived? - does this assume a 32% gain for k10 over k8 baseline?
What I was saying is, Gary of Anandtech could be basing his statements on similar performance scaling he saw with the K10, as I did with the Celeron chip.
Yes. It's not impossible is what I'm saying. Look at how the numbers fluctuate with the Celeron and where. Pure technical math cannot account for this, so we won't be able to explain it, but we'll experience it. It's possible that K10 at lower clocks does not scale as well as some higher clocks. I've just shown you in one application how my old Celeron did it, which means it's entirely possible.Quote:
and does it also assume identical scaling for k10 versus an old p4?
...but it is a real scaling example...ie p4.
He is using to argue non-linear scaling, but he misinterprets the data..
First, there are two typos, or atleast I think they are typos, the first is at 2205 Mhz, he says 73.455 but I think he meant 93.455, this is in line with what would be expected. The other is 3301 or 3201, I looks like he ran the same run twice at the same frequency but recorded a different speed. Here is a plot of his SP1M time vs frequency
http://img470.imageshack.us/img470/2...ata1zf7.th.jpg
If you correct his typo (heck sometimes my 9's look like's 7), then it behaves more as expected. For example:
http://img374.imageshack.us/img374/4...ata2qw1.th.jpg
The typo's or inconsistent data points are not what really matters, what does matter is that he normalizes to the slowest time, and converts to a percentage... he makes the most common mistake when one analyzes time to complete rather than rate ... SP1M is measured in time it takes to complete the task, but processor speed is measured in frequency ... he cannot calculate scaling factors when the units of one dimension is the inverse of the other.... he should have taken 1/time and plotted against fequency to check linearity, when you do that, it becomes completely linear:
http://img470.imageshack.us/img470/4...ata3fu9.th.jpg
In short, he data shows nothing but a handful of mistakes and that indeed SP1M scales linearly with clock speed.
Jack
http://img470.imageshack.us/img470/4914/ktedata3fu9.jpg
jumpingjack's scaling graph.
the non-linear data does look a bit fishy.
Is there a published scientific paper on that?Quote:
he cannot calculate scaling factors when the units of one dimension is the inverse of the other.... he should have taken 1/time and plotted against fequency to check linearity, when you do that, it becomes completely linear:
Thats reminiscent of people pricing things at $24.99 instead of $25 to make people see the smaller number. Its not hard to make your numbers look good when graphing, same things goes for finding percentages.
As to normalizing to a slower time we have to use stock as the baseline else you are going on prediction only. Using percentages wasnt needed but using either percentages of increase over stock or performance product factors would have had the same outcome, just maybe more confusing numbers for some.
No, this is Jr. High Math... Just plot it... :) .... Super Pi measured in time is an inverse function of frequency as it should, because frequency is 1/time.... this is not rocket science... heck run the experiment yourself... here is an X6800 stretching out the range of time and frequency so it is easier to see the functionality:
First the raw data:
http://img373.imageshack.us/img373/9...ing1bi5.th.jpg
Next, SP in time domain vs Frequency:
http://img470.imageshack.us/img470/9...ing2dq8.th.jpg
Finally, in the correct time domain (i.e. reciprocal time by taking 1/time, which makes it a linear function of frequency) for SP1,2,4,8M
http://img470.imageshack.us/img470/3...ing3hj8.th.jpg
http://img470.imageshack.us/img470/9...caling2dq8.jpg
does not this indicate that performance increase is non linear and that (as i thought) scaling gives diminishing returns...?
ie once you reach the shallower part of the curve.
ie performance increase starts off sharp then plataeus. :) - such that higher and higher clocks give less and less performance increase delta.... for a given core at speeds plotted on the graph.
it is a parabolic curve (i think)
and it is a long time since i did school mathematics.
the scaling is NOT linear. - extrapolating beyond 3500(speed) in this example yields negligible performance returns; indeed the difference between 2500 and 3500 is s. f. a.
Not exactly, that is the point of how a processor understands time..... a processor only understands a clock tick.. A signal that goes high then low. The total time for that is irrelevant. You and I perceive time not clock ticks, so in your words this is correct ... point of diminishing returns.
Frequency is cycles per second, or cycles/second. As frequency increases the number of ticks increases within one second. However, calculating Pi in super PI only depends X number of clock ticks that performs x number of instructions. As you increase frequency, you increase the number of clock ticks per unit time, but you are observing unit time.... so the functionality of the observed time it takes to complete a task (lower is better) is inversely proportional to frequency. I.e. F(x) = 1/x, so plot a simple plot of 1/x, it approaches zero asymptotically, a paraboly is afunction of f(x) = x^2 this is different.
Units of a quanity are important, and mathematically, they are treated like any variable or number. Super PI, Pi calculated and reported in time is not directly linear to Frequency because frequency is in 1/Time... to make one a direct finction of the other simply convert Super PI from time domain to frequncy domain and plot... wolla linear. This is the same as you will read around when people discuss how to calcuate % improvement for benches that are 'slower is better', slower is better benchmarks always approach a 'point of diminishing returns'.... go check it out, find any review of a series of processors varying as a function of frequency for the same core where the reported bench is in units of time... and plot time vs frequency, it will always be inversely proportional.... Very simple.
KTE's data is also 'paraboloic' to use your word -- (actually not parabolic, that is a function of a quadratic equation), he just chose 1M on short time scales and over a smaller frequecy range in that he was on a 'flatter' area of the curve... if he did 8M and repeated the same data, assuming he makes no mistakes... he would get what you see in the X6800 data above as the 8M stretches out time to see the inverse proportionality.
Now, how does this relate to K10... well not much, we first have to assume that the one data point is correct (i.e. ~39 second SP1M at 2.0 GHz), what people are arguing is that K10 'turns on' after 2.4-2.6 GHz range such that it scales 'better'.... this is odd way of arguing it, because the digital logic of a CPU is just that, it only knows a clock tick it does not care how long that clock tick is when all the transistors flip on and off to give the computational result for that tick... simply speeding up the ticks does not change the logistical arrangement of bits and the functional blocks that create the logic to actuate those bits.....
From this data point, again assuming it is true, in the absense of extrenal bottlenecks (such as the memory, if that is even important), Super Pi should scale at best linear to frequency ... so, within a few %+/- due to noise (background processes, etc), SP1M for K10 would scale as such:
2.0 Ghz == 39 seconds
2.2 Ghz == 36.4 seconds
2.4 GHz == 34.2 seconds
2.6 GHz == 32.3 seconds
2.8 Ghz == 30.7 seconds
3.0 Ghz == 29.3 seconds
But this is gross based on one data point, I personally have a hard time believing K10 will give this kind of super pi performance, it is barely better than a K8....
Jack
A question.
If you were to slow a processor down, lower than its intended clockspeed, would it be possible that there comes a point at which performance suddenly takes a bigger hit than is expected by the decreased clockspeed?
Is it possible that there are mechanisms in a processor that only contribute above a minimum clockspeed, and work against performance below a certain clockspeed?
what cpu has ever done that?Quote:
what people are arguing is that K10 'turns on' after 2.4-2.6 GHz range such that it scales 'better'....
not that id be complaining if somehow that was built in to the cpu (not likely)
multithreaded superpi anyone??
or a chip design that executes many more instructions per cycle or somesuch.
That's definitely interesting. I guess I'm so used to seeing "linear" scaling, that I didn't give it much thought. It will definitely be cool to see otherwise with K10.
Just a guess here concering SuperPI. Is it possible that we'll see non linear scores due to the L3 cache having decreased latency as the core clock scales higher or is the latency always the same, no matter what the core is clocked at?
The point Jack is trying to make is that, SuperPI, is LOOPED CODE and the IPC improvements of K10 are the same no matter what frequency the CPU is running at.
And that's that problem with looped code; it will only give you the performance of a very specific scenario and nothing more.