Nahhhh, we would get bored.
Printable View
Yeah, and I cannot believe I was so ignorant not to think this through. The L3 cache is shared, but each core is throttled independently. The overall intrinsic latency of the L3 will be like any cache, it is fixed by the size and quality of the process technology as well as the speed paths set at design.
However, since each core will throttle depending on load, the clock for each core can be different than L3... necessitating an asynchronous bus such that each core can still access the data....
Now, simple asynchronous communications will always have variable latency (as a function of the ratio or divider) as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....
But you also have to add into the mix the physical latency of the circuit to do this work.... it is a trade off, one that AMD obviously believes is better in the long run... so long as L3 'observed' latency is much less than that to main memory, there is a benefit.
Yes Jack,this is a simplified reason why it occurs.Quote:
However, since each core will throttle depending on load, the clock for each core will be different than L3... necessitating an asynchronous bus such that each core can still access the data....
Now, simply asynchronous communications will always have variable latency as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....
I thought you saw Kanter's article long ago(since it is online since the middle of May i think)
This is not entirely true...
I read somewhere long time ago that L3 in K10 is acting more like memory layer. In other words it is clocked by IMC independently from all 4 cores and on diagram I would put it after CrossBar...
That's why L3 latency can vary from core point of view (cache latency itself is probably constant). It is similar to how DDR2-800 latency (again from CPU point of view) is different compared to DDR2-667 (same timings of course:) ).
Edit: JumpingJack you typing too fast :) I barely read page 16 and typed my response and here surprise! another page with new info making my post partially obsolete :)
I am not sure I undestand if you understand what I am trying to say :) ...
A shared resource clocked at one speed to 4 other resources clocked at different speeds will necessitate asyncronous communications... there is no other way... thus AMD must provide functionality to account for floating clocks between 4 cores to one memory pool, L3.... just adding circuits to do this work will incur latency...
Add on top of that, 1:1 divide latency < 3:2 divider latency < 2:1 divider latnecy... hence the 'observed' latency from any core is variable...... at least if you read Kanter's article this is what the FIFO buffers do... he did not mention the x-bar.
There is research ongoing to work on achieving both low BW and low latency asynchronous networking, but there has always been this fundamental trade-off:
http://www.ee.technion.ac.il/courses...OC-async05.pdfQuote:
Previously published NoCs which provide GS are ÆTHEREAL [18][9] and NOSTRUM [14]. Both are synchronous and employ variants of time division multiplexing (TDM) for providing per connection bandwidth (BW) guarantees. TDM has the drawback of the connection latency being inversely proportional to the BW, thus connections with low BW and low latency requirements, e.g. interrupts, are not supported.
Not quite the paper I would use, but the one I could find recently written that summarized the issue at hand that I could quote as a source and not have you take my word for it .... i.e. connection latency is hard to get very low in networks where a globalized clock is not real.... here he discusses time division multiplexing, a type of clock dividing.
Edit: Found another paper which is much more detailed, and has some info on the FIFO implementation over a global clock:
http://www.collectionscanada.ca/obj/...11/MQ34126.pdfQuote:
Simulation results for the FIFO and the two versions of the adder are given in Table 1. The
optimized adder has 2-input c-elernents while the other adder is using 4-input C-elements.
The operations/second indicate the number of logic evaluations done pcr second in each
basic cell. Cycle time is the fastest time at which the pipeline cm send out successive data
values. Latency is the time it takes for data to go from the input of the circuit until it is
finally ready at the output. Pipelined systems work on the principle of reducing the cycle
time at the cost of increased latency. The next section examines how an enhancement to
the system cm reduce the latency even further.
(see page 73). This is an old paper, but he is showing 18 ns latency for a straight up FIFO buffer. This is a large number, and not to be considered true or accurate wrt K10.
Jack
I understand what your trying to say that's why I put edit.
As I said I read long time ago (probably on RWT but not coming from DK) that L3 will be operating in similar way to normal memory and will be possible to clock it independently form cores.
If I'm following your understanding correctly, your saying that L3 will be clocked from highest frequency core in CPU (2GHz K10-->2GHz L3) which in my opinion is not the case.
Of course asynchronous clocking will add latency but it might be a good trade off compared to gains in power/flexibility. (besides look at L3 latency numbers, they are high for a CPU cache so clearly we have lots of logic circuity in between)
Well, in the end we will find out shortly :)
Edit: I'm just thinking why would AMD release different Phenom models with differently clocked HTT bus (from official roadmaps)?? The answer can be that together with increased HTT speed L3 cache is also clocked higher (and IMC) and that gives some tangible performance improvements.
;) I don't know how the L3 will be clocked it will however need one clock and as Informal and others push the detail envelop, I am beginning to understand some of the L3 details that I had otherwise not really considered.
Your edit could be correct too....
AMD has had quite a bit of experience getting the best clock/latency performance out of different clocked agents, the IMC is a good example as it the HT links all of which time on clocks different than the core but put data into the core....
It is both interesting but irrelevant, performance will be what it performs at overall.... and we are hoping it is better than the showing that started this thread.
here are my results from my opteron 2218 , if you guys want me to run any test on my quad to compare to the phenom let me know.
http://i43.photobucket.com/albums/e3...NEBENCHR10.jpg
No problem ... when I get into detailed discussions like this, I tend to be verbose ... being a public forum, a number of people read what we write and, because it is a forum, I post a lot of references and quotes... don't take that as an afront to your knowledge base .... what I do try to do is provide ample detail so others, who may not completely follow, gain some level of understanding... (it also helps me learn more as I go along)
Jack
sorry about that guys
[QUOTE=bobjr;2406885]<censored>[QUOTE]
:) This is a good way to earn a ban. he... you edited it :)
[QUOTE=JumpingJack;2406887][QUOTE=bobjr;2406885]<censored>Yea, it is, and I'm probably one of the easier going Mods here.Quote:
:) This is a good way to earn a ban. he... you edited it :)
I think he and I will have a little talk..:D
:up:
@leoftw
How your system is configured memory wise. Do you have DIMMs plugged for both CPU sockets?
Can you run SuperPi? It is not x64 optimized so scores will be very comparable with your system. Same goes for CPU-Z cache latency test.
Thanks for your effort! :up:
EDIT: I just noticed over 4x speedup in muliCPU test! Why is that?? have you done 1-CPU test at lower clocks???
Excellent stuff leoftw. That's exactly what we need for a compo.
any chance you run the other isngle thread benchmarks we saw??
Super pi 1M
CPUmark99
informal
http://cbid.at.tut.by/work/L3Assoc.gifQuote:
Go laugh on the floor at yourself.
http://www.techarp.com/showarticle.a...tno=424&pgno=2
This cache is 32-way set associative and is based on a non-inclusive victim cache architecture.
This is from BIOS and Kernel Developer's Guide for AMD Family 10h Processors documentation.
This is not surprising.... the associativity of cache relates to the number of cache lines (or memory blocks) fixed to that set. The associativity will increase with size of the cache as each set is fixed with respect to the number of blocks allocated to them, and since AMD will ultimately up or lower L3 cache the number set associativity must change.
For example, AMD as a 2 meg-32 way associativity, a 4 meg would be 64 set associative, a 6 meg would be 96 way associative. Since they are allowing an associaitivty of 16 in their BIOS guide, then it appears AMD is at some point may be willing to release a 1 meg L3 cache chip (perhaps, because it is there does not mean there are plans).
Intel's associativity for wolfdale will be 24 way associative for 6 meg but 12 way associaitve for 3 meg. Intel has not changed their caching for Wolfdale over conroe other than raw size because their 2 Meg Allendale is 8 way associative, while Conroe 4 Meg is 16 way associative.
Jack
I wonder, will the L3 cache in K10 act also as a snoop filter in multiprocessor systems?
JumpingJack, a shared L3 cache is a configurable part of the Northbridge. It may not include the L3 cache as well.
You may be thinking of the time when L2 was off die, in which case it was loaded off the northbridge... through the backside bus then later on the frontside bus. In fact, the term northbridge is historic, in that in diagrams, this chip (with the memory controller) and the L2 were north of the CPU... :)
Later, the northbridge moved down below the CPU between the CPU and southbridge, but the L2 was still distinct.
Here is an example, but the diagram is after the north
http://www.via.com.tw/en/products/ap.../blockmvp4.gif
Here is another example:
http://www.via.com.tw/en/products/ap.../blockmvp3.gif
Of course, I could be wrong -- I am going from memory mostly. Hang tight, I will go look up the 'evolution of the northbridge' paper that shows the history, if I can find it.
EDIT: I knew there was a configuration for off-die L2 cache through the backside bus.... still cannot find the paper, but found the block diagram:
http://www.karbosguide.com/images/_975.gif
Yes Jack, on K7 L2 was driven from BackSide Bus :up: (cache was 2/3 or 3/5 of the core clock and packed on Slot A module), but earlier I'm not sure TBH!
It was my personal experience which lead me to believe that L2 cache on motherboards was connected to Northbridge because performance of CPU varied considerably from motherboard you used... (I'm speaking solely about cache intensive tests).
Besides if L2 cache on older motherboards was driven by CPU backside bus then how on earth very old P60 could known about 2MB cache on my Epox board??
Edit: I found something!
LinkQuote:
M1541 includes the higher CPU bus frequency (up to 100 MHz) interface for all Socket-7 compatible processors, PBSRAM and Memory Cache L2 controller to reduce cost and enhance performance, high performance FPM/EDO/SDRAM DRAM controller, PCI 2.1 compliant bus interface, smart deep buffer design for CPU-to-DRAM, CPU-to-PCI, and PCI-to-DRAM to achieve the best system performance. It also has the highly efficient PCI fair arbiter. M1541 also provides the most flexible 64-bit memory bus interface for the best DRAM upgrade-ability and ECC/Parity design to enhance the system reliability.
a nice picture reference for mobos http://redhill.net.au/b/b-92.html
but it ends at 2002
At Barcelona's presentation it was said that the L3 cache runs at Core clock frequency, not at northbridges' (wich runs a little lower. i.e: on 2GHz Barcelona northbridge runs at 1.6GHz on a uniplane board, and at 1.8GHz on a dual plane board)
But the L3 cache beeing a part of the northbridge does make sense, by the latency numbers we've seen and all the async stuff...
Let's put it in simple words as guys speaking techno will lose most other people and that way no one learns or benefits accurately, which is the whole purpose of sharing and explaining. :)
I mentioned about the L3 cache many pages back. AMD outlined it at "less than 38 cycles". 38 cycles is it's slowest latency. L3 cache is just a victim cache. It is very good for improving latency of data transfer to the core which needs it.
Usually, the data (retained in cache lines) evicted from L2 cache, has to be thrown back to memory (RAM). If needed again, it has to be re-fetched. This distance->latency is very large in comparison to on-die distance->latency.
Thus, by including "another" on-die cache but this time to retain L2 evicted datum in, they can refill the L2 or mostly, send it directly to L1D cache very quickly and any core can access the data (shared cache) far quicker than when going to memory.
The inclusion of this benefits AMD K10 very much similar to how it does for Core 2/Penryn with the large L2 cache, and that's why it will be increased proportionally with time as die size decreases in K10.
IIRC L3 populates itself from L1D cache, L2 cache and also memory based on prediction algorithms too.
K8 is not K10, so don't be using it's architectural features to undermine or imply those to the K10 - just because one doesn't know any accurate information or any better.
L2 and L3 cache use 64B lines. L3 cache controller is variable and flexible to support 8MB.
Also "Size and associativity of the AMD Family 10h processor L2 cache is implementation dependent. See the appropriate BIOS and Kernel Developer’s Guide for details."
L3 cache in not inclusive nor orthodoxly exclusive but tweaked - cache line can be fetched into the L1D and still be retained in L3 for core sharing based on sharing history which is tracked.
L3 doesn't evict using the same LRU algorithm the K8 used for L2, but evicts lines based on the LRU unshared cache line.
Northbridge and core frequency somewhat determines L3 cache relative latency.
L3 cache and IMC (northbridge) runs at independent speeds and voltages from all cores.
"Furthermore, the cache features bandwidth-adaptive policies that optimize latency when requested bandwidth is low, but allows scaling to higher aggregate L3 bandwidth when required (such as in a multi-core environment)." http://www.amd.com/us-en/assets/cont...docs/40546.pdf
"L3 cache latency will evidently be higher than L2 cache latency. However, AMD materials suggest that the latency will vary adaptively depending on the workload. If the workload isn’t too heavy, latency will improve, and under heavy workload the bandwidth will rise. We still have to check what really stands behind this." http://www.xbitlabs.com/articles/cpu...amd-k10_8.html
You already know about HT and it's implementational advantages, exactly as why Intel is developing CSI for high bandwidth, and some of their CSI design team left over this: http://www.realworldtech.com/page.cf...05-05-2006#361
HT link bandwidth is dependent on the controller implemented and increasing this becomes a real power hungry system, which is why processor architects will withhold to do so but in limited very high-end and future shrunk node levels.
In all honesty, Barcelona front end and out-of-order engine is a very complex architecture.
Each clock cycle fetches 32B of data from the L1 instruction cache into the predecode/pick buffer - K8 does 16B - Core 2 does 16B.
The data travels on a bi-directional bus which is 256-bit wide - K8 has it at 128-bit - Core 2 is at 128-bit.
The predecode and pick buffer maybe increased to 48B.
Direct branch prediction is improved > the global history register tracks the last 12 entries - K8 one tracked 8.
New indirect branch prediction of 512-entries.
The FIFO stack including the return addresses of function calls 3, 11 and 15 is 24 entries - K8 pushed 12 out.
Core 2 uses a single execution cluster with a unified scheduler and reservation stations across multiple ports, like in Athlons - K10 has a split integer and floating point cluster with distributed scheduler and reservation stations.
IDIV instructions are variable latency and not a fixed iteration as in K8 previously. This K10 32 bit divide latency is roughly 10 cycles faster than in K8.
Third ALU is now for LZCOUNT/POPCOUNT.
ALUs are most optimized for power efficiency rather than peak performance ATM.
Out-of-order memory access prevents operation stalls seen in K8 (especially with load).
In K10, each core has 8 prefetchers which fetch data into the L1D cache whereas in K8 they were prefetched to the L2 cache.
That's some of it's better features over the K8 (let alone the DRAM controllers) which make it by no way equal. According to what I see, Barcelona quintessence is the Budapest core at higher clock speeds.
Now to the argument of efficiency (a) can't be and can be correlated with frequency (b) in a format we don't expect - all based on what Gary said. Gary commented on what he saw, simply. Processors have an upper limit whereafter more frequency=little performance gain, a mid optimum band where more frequency=inclining gradient performance, a lower band whereby the frequencies=sub-optimal performance.
I will provide you an example of an older CPU I have with some SPI mod 1.5XS 1M tests I ran 1-3 months or so back (P4 Celeron D DO S478, Malaysia week 21 of '04). Memory is synchronous 512MB.
Clock (MHz) - Time (sec) - Scaling
1802MHz = 116.547sec = 100%
1997MHz = 104.110sec = 112%
2104MHz = 97.480sec = 119.6%
2205MHz = 73.455sec = 158.67%
2302MHz = 88.307sec = 132%
2402MHz = 85.009sec = 137%
2504MHz = 80.486sec = 144.8%
2604MHz = 77.161sec = 151%
2701MHz = 73.541sec = 158.47%
2798MHz = 70.431sec = 165.5%
2999MHz = 65.044sec = 179%
3201MHz = 60.396sec = 193%
3301MHz = 60.036sec = 194%
3360MHz = 48.460sec = 240.5%
3503MHz = 47.849sec = 243.6%
See what happened from 1800-2300 and from 3300-3360? Now let's place K10 in there assuming relative scaling like I witnessed with my P4 (not compared to P4 though, but K8). What would happen as a result?
Base comparison: (P4 1800) K8 1900 = 100%
(P4 2300) K10 1900 = 132%
(P4 2400) K10 2000 = 137%
(P4 2500) K10 2100 = 145%
(P4 2600) K10 2200 = 151%
(P4 2700) K10 2300 = 158%
(P4 2800) K10 2400 = 166%
(P4 2900) K10 2500 = 171%
(P4 3000) K10 2600 = 179%
(P4 3100) K10 2700 = 186%
(P4 3200) K10 2800 = 193%
(P4 3300) K10 2900 = 194%
(P4 3400) K10 3000 = 240% **PEAK MOST STABLE PERFORMANCE**
(P4 3500) K10 3100 = 243%
(P4 3600) K10 3200 = 244%
(P4 3700) K10 3300 = NA
This scenario I've witnessed and is based on real life processor performance, so it's equally possible. By AMD and others touting 3GHz Phenom, again and again, I would expect this is loosely what we are to expect.
Here's one Phenom MB BTW: http://img170.imageshack.us/img170/2...63c0cf7kk7.jpg
:hehe: nice figures....but the equating of k10 to real figures is....purely hypothetical.
how is the k8=100% (@1900) and k10=132% (@1900) derived? - does this assume a 32% gain for k10 over k8 baseline?
and does it also assume identical scaling for k10 versus an old p4?
:D It can get far worse believe me. :)
One of the points I wanted to show is how it clearly and perilously casts doubts on this http://img.coolaler.com.tw/images/zm...mlwemyzmzk.jpg compared to this http://img527.imageshack.us/img527/6582/sp47cj7.jpg
Bro, that was 1x 512MB RAM not even dual channel and an old CD. A P4 will get 10 seconds lower at least, beating the Barcelona quad core time they showed.
Yes, the K8 1900 was baseline compared to a K10 1900 - as an assumption for better clock per clock performance. It could be far lower, obviously, but this is a vague explanatory comparison of what Gary stated rather than a "prediction".Quote:
how is the k8=100% (@1900) and k10=132% (@1900) derived? - does this assume a 32% gain for k10 over k8 baseline?
What I was saying is, Gary of Anandtech could be basing his statements on similar performance scaling he saw with the K10, as I did with the Celeron chip.
Yes. It's not impossible is what I'm saying. Look at how the numbers fluctuate with the Celeron and where. Pure technical math cannot account for this, so we won't be able to explain it, but we'll experience it. It's possible that K10 at lower clocks does not scale as well as some higher clocks. I've just shown you in one application how my old Celeron did it, which means it's entirely possible.Quote:
and does it also assume identical scaling for k10 versus an old p4?
...but it is a real scaling example...ie p4.
He is using to argue non-linear scaling, but he misinterprets the data..
First, there are two typos, or atleast I think they are typos, the first is at 2205 Mhz, he says 73.455 but I think he meant 93.455, this is in line with what would be expected. The other is 3301 or 3201, I looks like he ran the same run twice at the same frequency but recorded a different speed. Here is a plot of his SP1M time vs frequency
http://img470.imageshack.us/img470/2...ata1zf7.th.jpg
If you correct his typo (heck sometimes my 9's look like's 7), then it behaves more as expected. For example:
http://img374.imageshack.us/img374/4...ata2qw1.th.jpg
The typo's or inconsistent data points are not what really matters, what does matter is that he normalizes to the slowest time, and converts to a percentage... he makes the most common mistake when one analyzes time to complete rather than rate ... SP1M is measured in time it takes to complete the task, but processor speed is measured in frequency ... he cannot calculate scaling factors when the units of one dimension is the inverse of the other.... he should have taken 1/time and plotted against fequency to check linearity, when you do that, it becomes completely linear:
http://img470.imageshack.us/img470/4...ata3fu9.th.jpg
In short, he data shows nothing but a handful of mistakes and that indeed SP1M scales linearly with clock speed.
Jack
http://img470.imageshack.us/img470/4914/ktedata3fu9.jpg
jumpingjack's scaling graph.
the non-linear data does look a bit fishy.
Is there a published scientific paper on that?Quote:
he cannot calculate scaling factors when the units of one dimension is the inverse of the other.... he should have taken 1/time and plotted against fequency to check linearity, when you do that, it becomes completely linear:
Thats reminiscent of people pricing things at $24.99 instead of $25 to make people see the smaller number. Its not hard to make your numbers look good when graphing, same things goes for finding percentages.
As to normalizing to a slower time we have to use stock as the baseline else you are going on prediction only. Using percentages wasnt needed but using either percentages of increase over stock or performance product factors would have had the same outcome, just maybe more confusing numbers for some.
No, this is Jr. High Math... Just plot it... :) .... Super Pi measured in time is an inverse function of frequency as it should, because frequency is 1/time.... this is not rocket science... heck run the experiment yourself... here is an X6800 stretching out the range of time and frequency so it is easier to see the functionality:
First the raw data:
http://img373.imageshack.us/img373/9...ing1bi5.th.jpg
Next, SP in time domain vs Frequency:
http://img470.imageshack.us/img470/9...ing2dq8.th.jpg
Finally, in the correct time domain (i.e. reciprocal time by taking 1/time, which makes it a linear function of frequency) for SP1,2,4,8M
http://img470.imageshack.us/img470/3...ing3hj8.th.jpg
http://img470.imageshack.us/img470/9...caling2dq8.jpg
does not this indicate that performance increase is non linear and that (as i thought) scaling gives diminishing returns...?
ie once you reach the shallower part of the curve.
ie performance increase starts off sharp then plataeus. :) - such that higher and higher clocks give less and less performance increase delta.... for a given core at speeds plotted on the graph.
it is a parabolic curve (i think)
and it is a long time since i did school mathematics.
the scaling is NOT linear. - extrapolating beyond 3500(speed) in this example yields negligible performance returns; indeed the difference between 2500 and 3500 is s. f. a.
Not exactly, that is the point of how a processor understands time..... a processor only understands a clock tick.. A signal that goes high then low. The total time for that is irrelevant. You and I perceive time not clock ticks, so in your words this is correct ... point of diminishing returns.
Frequency is cycles per second, or cycles/second. As frequency increases the number of ticks increases within one second. However, calculating Pi in super PI only depends X number of clock ticks that performs x number of instructions. As you increase frequency, you increase the number of clock ticks per unit time, but you are observing unit time.... so the functionality of the observed time it takes to complete a task (lower is better) is inversely proportional to frequency. I.e. F(x) = 1/x, so plot a simple plot of 1/x, it approaches zero asymptotically, a paraboly is afunction of f(x) = x^2 this is different.
Units of a quanity are important, and mathematically, they are treated like any variable or number. Super PI, Pi calculated and reported in time is not directly linear to Frequency because frequency is in 1/Time... to make one a direct finction of the other simply convert Super PI from time domain to frequncy domain and plot... wolla linear. This is the same as you will read around when people discuss how to calcuate % improvement for benches that are 'slower is better', slower is better benchmarks always approach a 'point of diminishing returns'.... go check it out, find any review of a series of processors varying as a function of frequency for the same core where the reported bench is in units of time... and plot time vs frequency, it will always be inversely proportional.... Very simple.
KTE's data is also 'paraboloic' to use your word -- (actually not parabolic, that is a function of a quadratic equation), he just chose 1M on short time scales and over a smaller frequecy range in that he was on a 'flatter' area of the curve... if he did 8M and repeated the same data, assuming he makes no mistakes... he would get what you see in the X6800 data above as the 8M stretches out time to see the inverse proportionality.
Now, how does this relate to K10... well not much, we first have to assume that the one data point is correct (i.e. ~39 second SP1M at 2.0 GHz), what people are arguing is that K10 'turns on' after 2.4-2.6 GHz range such that it scales 'better'.... this is odd way of arguing it, because the digital logic of a CPU is just that, it only knows a clock tick it does not care how long that clock tick is when all the transistors flip on and off to give the computational result for that tick... simply speeding up the ticks does not change the logistical arrangement of bits and the functional blocks that create the logic to actuate those bits.....
From this data point, again assuming it is true, in the absense of extrenal bottlenecks (such as the memory, if that is even important), Super Pi should scale at best linear to frequency ... so, within a few %+/- due to noise (background processes, etc), SP1M for K10 would scale as such:
2.0 Ghz == 39 seconds
2.2 Ghz == 36.4 seconds
2.4 GHz == 34.2 seconds
2.6 GHz == 32.3 seconds
2.8 Ghz == 30.7 seconds
3.0 Ghz == 29.3 seconds
But this is gross based on one data point, I personally have a hard time believing K10 will give this kind of super pi performance, it is barely better than a K8....
Jack
A question.
If you were to slow a processor down, lower than its intended clockspeed, would it be possible that there comes a point at which performance suddenly takes a bigger hit than is expected by the decreased clockspeed?
Is it possible that there are mechanisms in a processor that only contribute above a minimum clockspeed, and work against performance below a certain clockspeed?
what cpu has ever done that?Quote:
what people are arguing is that K10 'turns on' after 2.4-2.6 GHz range such that it scales 'better'....
not that id be complaining if somehow that was built in to the cpu (not likely)
multithreaded superpi anyone??
or a chip design that executes many more instructions per cycle or somesuch.
That's definitely interesting. I guess I'm so used to seeing "linear" scaling, that I didn't give it much thought. It will definitely be cool to see otherwise with K10.
Just a guess here concering SuperPI. Is it possible that we'll see non linear scores due to the L3 cache having decreased latency as the core clock scales higher or is the latency always the same, no matter what the core is clocked at?
The point Jack is trying to make is that, SuperPI, is LOOPED CODE and the IPC improvements of K10 are the same no matter what frequency the CPU is running at.
And that's that problem with looped code; it will only give you the performance of a very specific scenario and nothing more.
Nice analysis going on although it's becoming very offshoot to what I stated and did. Scaling wasn't shown in the true mathematical sense, little time to do it. This was data I already had for some time now, copied it over. All that was meant to be shown is the % change in SPI 1M times as frequency rises above the 1800MHz time, which is projected as 100%, or stock. Call is 0% if you like and anything on-top is the gained percentage.
K10 comes in where people state affirmatively that something they do not own nor have seen perform from a released product, cannot react in a certain way, for whatever musings. I've seen erratic jumps before in processor performance, and I'm showing one of them right here.
It''s not to "argue" anything but a viable possibility you cannot on any grounds reject yet. You have no evidence to. Or we'd like to see it. :)
No typo, I clearly state it's experimental results. I predicted linear scaling, as most people would. But investigation speaks otherwise - something I cannot explain but I experience.
Conditions are exactly the same for all but >3500 (unstable), in fact I'm sure they were ran one after another on the same day, with minimum services/processes running in the background. One thing I clearly stated I didn't and couldn't do, is keep the memory asynchronous. RAM timings/divider is kept the same. But the rest I'll show it you and I've repeated it time and time again from around 4 months back. All of the data stands final. ;)
I'll try and find what I still can, one second. Hmm... there's data missing here. I only seem to have a few of the run results saved but have more info in the text file saved instead. Anyway, the interesting ones are all still there, the rest of the missing +100MHz sampling points are progressive and linear as you would expect and I'll try finding them on the other drives. Here ya go.
1800MHz = http://img212.imageshack.us/img212/5...00spi1mlx8.jpg
2000MHz = http://img406.imageshack.us/img406/8...00spi1mus3.jpg
2100MHz = http://img510.imageshack.us/img510/6...00spi1mgc7.jpg
2200MHz = http://img205.imageshack.us/img205/1...00spi1mwn9.jpg
2300MHz = http://img250.imageshack.us/img250/8...00spi1mqk4.jpg
2500MHz = http://img508.imageshack.us/img508/7...00spi1mhm8.jpg
2600MHz = http://img209.imageshack.us/img209/3...00spi1mlb0.jpg
2800MHz = http://img510.imageshack.us/img510/2...00spi1map4.jpg
3000MHz = http://img515.imageshack.us/img515/1...00spi1mod1.jpg
3200MHz = http://img265.imageshack.us/img265/4...00spi1mnr5.jpg
3300MHz = http://img250.imageshack.us/img250/8...00spi1mzk4.jpg
3360MHz (3400 is the same) = http://img210.imageshack.us/img210/4379/spi48qg8.jpg
3500MHz = http://img250.imageshack.us/img250/2747/sp47dm3.jpg
3500MHz is quickest = after that, there's hardly any change if the processor runs it (memory bottleneck). If I had similar memory and system now, I'd repeat them now again just to refresh, but I don't and I didn't know this was coming to prepare but just did it for my own personal investigating back then.
BTW, science doesn't equal "we expect this and this all that can be true." Broken logic is to expect linear scaling and when something other occurs you start the conspiracies. Science broadens your horizons to accept observational finding, like the new colossal area devoid of matter found in space, even devoid of dark energy, which was NEVER predicted nor expected at those sizes and changes acceptance of many beliefs and idea's held by physicists beforehand.
Scientific experiment = controlled conditions + variable factor + experiment + observation + repetition + results
I've just given you results of some of my findings, enough for a genuinely interested person to see what's happening for themselves.
Conspiracy theories and conjectures will get one no where. All you have to do is ask for evidence before the insinuating. :)Quote:
The other is 3301 or 3201, I looks like he ran the same run twice at the same frequency but recorded a different speed. Here is a plot of his SP1M time vs frequency
That's what you want to believe Jack, not what the evidence shows. I'm sorry but you haven't proved how my finding ties in with your belief.Quote:
In short, he data shows nothing but a handful of mistakes and that indeed SP1M scales linearly with clock speed.
I was just looking at the forums at coolaler and saw this. This is K10 in Pi B0 stepping I think. And from the picture it does Pi in 32 seconds at only 1800Mhz. By the way, I'm new here.
Here is the Link http://forum.coolaler.com/showthread...161127&page=30.
I would say opposite, but this is only my opinion....
For me it will be like doing SuperPi with DDR2 667 4-4-4 and DDR2 800 4-4-4 on same CPU.
With faster K10 cores you will get differently clocked northbridge which clocks cache L3. It also will be possible to change cache frequency...
Don't take it as a certain, this is just speculation :) .
....people didn't get the chance to notice my post there,i will repost it here :) :
Quote:
AMD Phenom X4 can do 3GHz and above
http://my.ocworkbench.com/bbs/showth...200#post420200Quote:
By normal air cool, the AMD Phenom X4 can go beyond the 3GHz mark by overclocking. Although this can be done, there are some stability issues at such high speed.
Currently, there aren't any options to turn off one or two of the cores. Running it in single channel memory helps to stabilise it.
Bluetooth is one of those guys who is credible and who got the one of the 1st GA RD790 mobos to test.So this is very good news coming from them!
Note he says air cooling is used and it's a X4 ,so all of the cores must work at the same clock since something is wrong(bios?) and they can't use separate PLLs to clock the cores individually.
That's very good news....
With platform stability they have a bit of time to work out issues so I'm looking ahead to Phenom launch :) .
Bear in mind all this is on pre-production silicon ;)
BTW K10 is designed to hit 4GHz......
This 'image' was created by someone who was claiming that the Coolaler K10 data was faked, by producing a 'see how easy I can fake something post' on this thread:
http://forums.vr-zone.com/showthread.php?t=181755
I can't read Chinese, but likely someone referenced to this is similar discussions we are having here.
Yeah that "SPI score" of 31 is a bad fake.Best we ignore that and concentrate on the OCworkbench news.
What could have made them unable to individually clock the separate cores in Phenom ES?Something in the ES chip itself or the bios bug?Very good news about the beta silicon,mobos and bioses hitting >3GHZ OCs with air cooling.This speaks a lot about the potential of the core and at the same time amazes me what AMD managed to do in such a short time from the first reports of clock problems(speedpath issues in the chip).They conquered the whole GHz in a couple of respins(source :dailytech).
Still we can expect 125W spec for first retail 3Ghz version of X4(whenever they are out)
edit:
Latest info from AnandTechStaff:
Quote:
Originally Posted by G.Key@AT
26 seconds in SuperPI 1m for one
that's allot :)
coming from 39secs to 13secs would be great though :D
However, you did try to calculate a scaling factor and concluded on a data set that has 4 of the 14 data points fall away from the expected trend. There are two possibilities, either those points repesent something real or you made a mistake (no harm admitting a mistake). It is poor analysis to see such weird behavior and publish it as 'this is the way it is', without vetting those anomolous data points through repetition and attention to some detail and not expect to get challenged on those anomolies.
But is that due to the CPU or the person using the CPU? Do take offense to this, I can show you anomolous results by the very nature you go into below, where the conditions were not fully understood by which the tests were conducted.Quote:
K10 comes in where people state affirmatively that something they do not own nor have seen perform from a released product, cannot react in a certain way, for whatever musings. I've seen erratic jumps before in processor performance, and I'm showing one of them right here.
This is the failure of your approach, when you do see such anomolous behavior, there is always the cause and effect. I will show you an example,Quote:
No typo, I clearly state it's experimental results. I predicted linear scaling, as most people would. But investigation speaks otherwise - something I cannot explain but I experience.
http://img339.imageshack.us/img339/9...usesp1mjv1.jpg
54.328 seconds
http://img504.imageshack.us/img504/2...usesp1mcu1.jpg
58.766 seconds
Looking at this one data point, man, what the heck.... some variability... but, wait... the difference is in the first case I started the run and left alone, in the second I started the run but moved my mouse around.... ahhhh, so is there something wrong with CPU, not reproducible or was there something wrong with the way I ran the bench?.... hmmmmmm.
This fine, it stand final but still flawed.Quote:
Conditions are exactly the same for all but >3500 (unstable), in fact I'm sure they were ran one after another on the same day, with minimum services/processes running in the background. One thing I clearly stated I didn't and couldn't do, is keep the memory asynchronous. RAM timings/divider is kept the same. But the rest I'll show it you and I've repeated it time and time again from around 4 months back. All of the data stands final. ;)
Thanks for the screen shots, so my guess that you typo'ed was incorrect, no big deal ... it was one plausible explanation for the outlier data, so now we should look for other explanations, perhaps more experiments.Quote:
1800MHz = http://img212.imageshack.us/img212/5...00spi1mlx8.jpg
2000MHz = http://img406.imageshack.us/img406/8...00spi1mus3.jpg
2100MHz = http://img510.imageshack.us/img510/6...00spi1mgc7.jpg
2200MHz = http://img205.imageshack.us/img205/1...00spi1mwn9.jpg
2300MHz = http://img250.imageshack.us/img250/8...00spi1mqk4.jpg
2500MHz = http://img508.imageshack.us/img508/7...00spi1mhm8.jpg
2600MHz = http://img209.imageshack.us/img209/3...00spi1mlb0.jpg
2800MHz = http://img510.imageshack.us/img510/2...00spi1map4.jpg
3000MHz = http://img515.imageshack.us/img515/1...00spi1mod1.jpg
3200MHz = http://img265.imageshack.us/img265/4...00spi1mnr5.jpg
3300MHz = http://img250.imageshack.us/img250/8...00spi1mzk4.jpg
3360MHz (3400 is the same) = http://img210.imageshack.us/img210/4379/spi48qg8.jpg
3500MHz = http://img250.imageshack.us/img250/2747/sp47dm3.jpg
You need to be careful making this conclusion because your data set is inconsistent to begin with....Quote:
3500MHz is quickest = after that, there's hardly any change if the processor runs it (memory bottleneck). If I had similar memory and system now, I'd repeat them now again just to refresh, but I don't and I didn't know this was coming to prepare but just did it for my own personal investigating back then.
So you don't believe in science? Yes, science does broaden horizons, within the context of performing science correctly, here you did not.Quote:
BTW, science doesn't equal "we expect this and this all that can be true." Broken logic is to expect linear scaling and when something other occurs you start the conspiracies. Science broadens your horizons to accept observational finding, like the new colossal area devoid of matter found in space, even devoid of dark energy, which was NEVER predicted nor expected at those sizes and changes acceptance of many beliefs and idea's held by physicists beforehand.
You missed quite abit.... you failed to revise your hypothesis or repeat your experiment, I would really like to see you repeat this data set again, but I don't expect you to expend the time, I will do it myself, I have a Prescott I will rebuild.Quote:
Scientific experiment = controlled conditions + variable factor + Hypothesize+ experiment + observation + repetition + results + revise + repeat + conclude
I am not trying to prove anything, I am challenging your data set where 4 of your 14 points are outside expected..... in short, you should do a better job of rationalizing the outlying data before drawing conclusions and publishing results, some may look at this and be ... meh.. ok. But there could be someone who looks at this and thinks ' man, this is counter to all the existing data, there must be something wrong' -- which is what I did.Quote:
That's what you want to believe Jack, not what the evidence shows. I'm sorry but you haven't proved how my finding ties in with your belief.
Clock cycle latency remains the same..... though you understand the concept I will state it anyway an example:
12 cycles of L2 latency at 2 GHz gives 6 ns latency in time.
12 cycles of L2 latency at 3 GHz gives 4 ns latency in time.
The total time to propogate a signal through the chip has a ceiling, hence as you decrase the clock period at fixed cycle latency -- the wall will be hit and no more clocks for you :)
It seems a lot of people have a hard time understanding the digitial tick of a clock and the time period for that tick and how that translates into scaling. :)
For the record, a 2 GHz Conroe gets 26 seconds...
http://img70.imageshack.us/img70/391...0102000mr1.jpg
No the dude said he saw 26s decrease(from whatever the score was before)..Not that it scored 26s(it could be more or less for what we know..)
But i doubt the original score of that whatever stepping(of B01 revision) was worse than revF K8 in SPi 1m.What does 2Ghz K8 with DDR2-667 cas5 get in Spi 1m?Around 42s right?If the ES was somewhat worse than that than the "better" score could very well be very good.And the dude said they saw week-to-week improvements with both new chips and new boards.So who knows what floats around there in the wild.
Link for that quote:
http://forums.anandtech.com/messagev...VIEWTMP=Linear
Hmm is English your first language?(not bashing just asking :) )
Since you can read it for yourself one more time(pay attention to bold and blue) :
I "painted" in blue the important parts to make it more clear :)Quote:
there is a significant difference in performance in all areas (26 seconds in SuperPI 1m for one)
LOL...
This is NOT significant difference ...
This is HUGE difference i can say ....:rofl:
PHENOM @ 3GHZ+ on AIR :clap:
Didn't I say not to worry about AMD?
This is gonna be insane :up:
I fully trust Gary Key. Long time bencher who has always been credible. Sounds like K10 will be very good chip...
Ply
I don´t now if this is new info or old so take a look
http://www.hardwaresecrets.com/article/480
I read this statement as SuperPI 1m being done in 26 seconds. After re-reading it I am still not sure if Gary is speaking abt the a difference of 26 seconds or not.Quote:
there is a significant difference in performance in all areas (26 seconds in SuperPI 1m for one)
The statement can be 'read' in both ways, only Gary can tell us exactly what he meant and I am pretty sure this may have been worded they way it was as to not circumvent any 'agreements'
LOL
:)
no not necessarily, the improvement from "B00 chip from May to a B02" is probably due to some important parts being turned of in the B00 (maybe L3 or even L2) - the fact that *some parts were turned off* was already stated somewhere - thus it tells us nothing about the retail barcelona.
and most agree that coolaler's results show a B01 that is going to launch on sept. 10, however we don't know if the results aren't fake.
but don't forget that gary also said this about the results:
6 more days until we find out :)Quote:
Actually, based on the last chip we had, those numbers are in alignment with some of the results we noticed and others as well. I think they will be better than that at release and especially on a consumer board with Vista 64-bit
13s at Spi at 2ghz..yeah right =)
Some people need a reality check.
Who said it was 12s??? It is said the decrease from B00 to B02 was that..From what we know(nothing to be exact :) ),B00 could have score in a range of 40 to 50s in Pi 1m :)
There's no way that the SPi time is 13s @ 2ghz. If that's true, AMD would already showed it. The best we can hope for is that the 26s @ 2ghz is true.
I think you got one comin' in 6days buddy :up:
Why would AMD show it? They've never released any bench-scores early in the past....they're confident in themselves and their abilities and don't need to lure in buyers with leaked benchmarks ahead of time.
Only show what your good at, nothing else.
Thats how you get money to your company..
(and fool a few fanboys)
Smart ass is right ... AMD is a bit fishy ..... it shows specFP ... but not superPi? wtf ...
THere's no doubt that Conroe is Superb @ SuperPi.
I just hope that the new K10 can at least come close. Somebody has to keep em all honest.
BRUNO
Bad point ... why the hell would anyone use specFP? XD
(*poke* why not int as well ...)
GG, AMD Fanboy ...
Why not put a video of every type of benchmark, AMD?
(self proclaims good point .... actually, it has to be, because everyone is wanting all the benchmarks of K10)