K10 Scores starting to surface

Printable View

Show 100 post(s) from this thread on one page

09-01-2007, 01:37 PM
JumpingJack

Quote:

Originally Posted by onewingedangel

Why can't everyone just wait the week or two left? Then a lot of arguments can be settled...

Nahhhh, we would get bored.
09-01-2007, 01:37 PM
BeardyMan

Quote:

Originally Posted by informal

I think you quoted the wrong man,since i did gave the link for that patent...

I know, you may slap me for that :D
09-01-2007, 01:43 PM
JumpingJack

Quote:

Originally Posted by mstp2009

Thank you Jack.

So absolutely latency - the transfer from FIFO buffer to L3 is constant, it would just be the fill time of the FIFO buffer that is variable b/c it has async communication with each of the 4 cores (unless they are at full speed, one would assume).

The observed latency would be:

Core to FIFO Buf latency + FIFO Buf to L3 latency

Correct?

Yeah, and I cannot believe I was so ignorant not to think this through. The L3 cache is shared, but each core is throttled independently. The overall intrinsic latency of the L3 will be like any cache, it is fixed by the size and quality of the process technology as well as the speed paths set at design.

However, since each core will throttle depending on load, the clock for each core can be different than L3... necessitating an asynchronous bus such that each core can still access the data....

Now, simple asynchronous communications will always have variable latency (as a function of the ratio or divider) as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....

But you also have to add into the mix the physical latency of the circuit to do this work.... it is a trade off, one that AMD obviously believes is better in the long run... so long as L3 'observed' latency is much less than that to main memory, there is a benefit.
09-01-2007, 01:49 PM
informal

Quote:

However, since each core will throttle depending on load, the clock for each core will be different than L3... necessitating an asynchronous bus such that each core can still access the data....

Now, simply asynchronous communications will always have variable latency as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....

Yes Jack,this is a simplified reason why it occurs.
I thought you saw Kanter's article long ago(since it is online since the middle of May i think)
09-01-2007, 01:51 PM
JumpingJack

Quote:

Originally Posted by informal

Yes Jack,this is a simplified reason why it occurs.
I thought you saw Kanter's article long ago(since it is online since the middle of May i think)

I did, and I read it to... it just did not register enough to turn on lights as this topic has evolved.
09-01-2007, 02:05 PM
Lightman

Quote:

Originally Posted by JumpingJack

Informal, becareful and read Kanter's note carefully, this is beginning to focus up .... it was a good link... but you may be misinterpreting what Kanter is saying....

I am, myself, trying to understand at a detail that makes sense, this is much more complicated than what we are assuming....

Each core (that is execution core and the dedicated cache) will be clocked independently, a major power saving feature of K10. So in order to share a cache at L3 level, it will need to send data asynchronously to differently clocked cores... wow, this is complicated.... so what AMD has done (per Kanter) is build a 'translator', or a FIFO buffer to send data to and from the L3 -- this is not the same as dynamically adjusting L3 clock or latency, what it is doing is dynamically adjusting a clock divider to synch L3 with variable speed cores, now this variable L3 latency makes much much more sense.

Any asynchronous communication will incur extra latency (over a simple 1:1) simply as a result of clock mismatch ... this is a given ... (this is why C2D shows a dip in performance in DDR2-533 to DDR2-667 to DDR2-800 as dividers beyond 1:1 introduce extra latency).

So with this understanding, the observed latency (which is actually the important part) will be variable, not because L3 cache has variable latency but because it has to be sychronized through the FIFO buffers to cores of variable clocks.

Damn should have paid more attention to Kanter's article too....

Guys I am learning a lot here... thanks.

Jack

This is not entirely true...

I read somewhere long time ago that L3 in K10 is acting more like memory layer. In other words it is clocked by IMC independently from all 4 cores and on diagram I would put it after CrossBar...
That's why L3 latency can vary from core point of view (cache latency itself is probably constant). It is similar to how DDR2-800 latency (again from CPU point of view) is different compared to DDR2-667 (same timings of course:) ).

Edit: JumpingJack you typing too fast :) I barely read page 16 and typed my response and here surprise! another page with new info making my post partially obsolete :)
09-01-2007, 03:21 PM
JumpingJack

Quote:

Originally Posted by Lightman

This is not entirely true...

I read somewhere long time ago that L3 in K10 is acting more like memory layer. In other words it is clocked by IMC independently from all 4 cores and on diagram I would put it after CrossBar...
That's why L3 latency can vary from core point of view (cache latency itself is probably constant). It is similar to how DDR2-800 latency (again from CPU point of view) is different compared to DDR2-667 (same timings of course:) ).

Edit: JumpingJack you typing too fast :) I barely read page 16 and typed my response and here surprise! another page with new info making my post partially obsolete :)

I am not sure I undestand if you understand what I am trying to say :) ...

A shared resource clocked at one speed to 4 other resources clocked at different speeds will necessitate asyncronous communications... there is no other way... thus AMD must provide functionality to account for floating clocks between 4 cores to one memory pool, L3.... just adding circuits to do this work will incur latency...

Add on top of that, 1:1 divide latency < 3:2 divider latency < 2:1 divider latnecy... hence the 'observed' latency from any core is variable...... at least if you read Kanter's article this is what the FIFO buffers do... he did not mention the x-bar.

There is research ongoing to work on achieving both low BW and low latency asynchronous networking, but there has always been this fundamental trade-off:

Quote:

Previously published NoCs which provide GS are ÆTHEREAL [18][9] and NOSTRUM [14]. Both are synchronous and employ variants of time division multiplexing (TDM) for providing per connection bandwidth (BW) guarantees. TDM has the drawback of the connection latency being inversely proportional to the BW, thus connections with low BW and low latency requirements, e.g. interrupts, are not supported.

http://www.ee.technion.ac.il/courses...OC-async05.pdf

Not quite the paper I would use, but the one I could find recently written that summarized the issue at hand that I could quote as a source and not have you take my word for it .... i.e. connection latency is hard to get very low in networks where a globalized clock is not real.... here he discusses time division multiplexing, a type of clock dividing.

Edit: Found another paper which is much more detailed, and has some info on the FIFO implementation over a global clock:

Quote:

Simulation results for the FIFO and the two versions of the adder are given in Table 1. The
optimized adder has 2-input c-elernents while the other adder is using 4-input C-elements.
The operations/second indicate the number of logic evaluations done pcr second in each
basic cell. Cycle time is the fastest time at which the pipeline cm send out successive data
values. Latency is the time it takes for data to go from the input of the circuit until it is
finally ready at the output. Pipelined systems work on the principle of reducing the cycle
time at the cost of increased latency. The next section examines how an enhancement to
the system cm reduce the latency even further.

http://www.collectionscanada.ca/obj/...11/MQ34126.pdf
(see page 73). This is an old paper, but he is showing 18 ns latency for a straight up FIFO buffer. This is a large number, and not to be considered true or accurate wrt K10.

Jack
09-01-2007, 03:30 PM
Lightman

Quote:

Originally Posted by JumpingJack

I am not sure I undestand if you understand what I am trying to say :) ...

A shared resource clocked at one speed to 4 other resources clocked at different speeds will necessitate asyncronous communications... there is no other way... thus AMD must provide functionality to account for floating clocks between 4 cores to one memory pool, L3.... just adding circuits to do this work will incur latency...

Add on top of that, 1:1 divide latency < 3:2 divider latency < 2:1 divider latnecy... hence the 'observed' latency from any core is variable...... at least if you read Kanter's article this is what the FIFO buffers do... he did not mention the x-bar.

Jack

I understand what your trying to say that's why I put edit.
As I said I read long time ago (probably on RWT but not coming from DK) that L3 will be operating in similar way to normal memory and will be possible to clock it independently form cores.
If I'm following your understanding correctly, your saying that L3 will be clocked from highest frequency core in CPU (2GHz K10-->2GHz L3) which in my opinion is not the case.

Of course asynchronous clocking will add latency but it might be a good trade off compared to gains in power/flexibility. (besides look at L3 latency numbers, they are high for a CPU cache so clearly we have lots of logic circuity in between)

Well, in the end we will find out shortly :)

Edit: I'm just thinking why would AMD release different Phenom models with differently clocked HTT bus (from official roadmaps)?? The answer can be that together with increased HTT speed L3 cache is also clocked higher (and IMC) and that gives some tangible performance improvements.
09-01-2007, 03:58 PM
JumpingJack

Quote:

Originally Posted by Lightman

I understand what your trying to say that's why I put edit.
As I said I read long time ago (probably on RWT but not coming from DK) that L3 will be operating in similar way to normal memory and will be possible to clock it independently form cores.
If I'm following your understanding correctly, your saying that L3 will be clocked from highest frequency core in CPU (2GHz K10-->2GHz L3) which in my opinion is not the case.

Of course asynchronous clocking will add latency but it might be a good trade off compared to gains in power/flexibility. (besides look at L3 latency numbers, they are high for a CPU cache so clearly we have lots of logic circuity in between)

Well, in the end we will find out shortly :)

Edit: I'm just thinking why would AMD release different Phenom models with differently clocked HTT bus (from official roadmaps)?? The answer can be that together with increased HTT speed L3 cache is also clocked higher (and IMC) and that gives some tangible performance improvements.

;) I don't know how the L3 will be clocked it will however need one clock and as Informal and others push the detail envelop, I am beginning to understand some of the L3 details that I had otherwise not really considered.

Your edit could be correct too....

AMD has had quite a bit of experience getting the best clock/latency performance out of different clocked agents, the IMC is a good example as it the HT links all of which time on clocks different than the core but put data into the core....

It is both interesting but irrelevant, performance will be what it performs at overall.... and we are hoping it is better than the showing that started this thread.
09-01-2007, 04:04 PM
leoftw

here are my results from my opteron 2218 , if you guys want me to run any test on my quad to compare to the phenom let me know.

http://i43.photobucket.com/albums/e3...NEBENCHR10.jpg
09-01-2007, 04:26 PM
JumpingJack

Quote:

Originally Posted by leoftw

here are my results from my opteron 2218 , if you guys want me to run any test on my quad to compare to the phenom let me know.

Do you have a way to run 64-bit?
09-01-2007, 04:29 PM
JumpingJack

Quote:

Originally Posted by Lightman

I understand what your trying to say that's why I put edit.
As I said I read long time ago (probably on RWT but not coming from DK) that L3 will be operating in similar way to normal memory and will be possible to clock it independently form cores.
If I'm following your understanding correctly, your saying that L3 will be clocked from highest frequency core in CPU (2GHz K10-->2GHz L3) which in my opinion is not the case.

Of course asynchronous clocking will add latency but it might be a good trade off compared to gains in power/flexibility. (besides look at L3 latency numbers, they are high for a CPU cache so clearly we have lots of logic circuity in between)

Well, in the end we will find out shortly :)

Edit: I'm just thinking why would AMD release different Phenom models with differently clocked HTT bus (from official roadmaps)?? The answer can be that together with increased HTT speed L3 cache is also clocked higher (and IMC) and that gives some tangible performance improvements.

No problem ... when I get into detailed discussions like this, I tend to be verbose ... being a public forum, a number of people read what we write and, because it is a forum, I post a lot of references and quotes... don't take that as an afront to your knowledge base .... what I do try to do is provide ample detail so others, who may not completely follow, gain some level of understanding... (it also helps me learn more as I go along)

Jack
09-01-2007, 05:15 PM
leoftw

Quote:

Originally Posted by JumpingJack

Do you have a way to run 64-bit?

I have to install 64bit for that , sorry . I'm having enough problems with these damn 32bit vista drivers :P
09-01-2007, 06:02 PM
JumpingJack

Quote:

Originally Posted by leoftw

I have to install 64bit for that , sorry . I'm having enough problems with these damn 32bit vista drivers :P

No biggy....

From 32 to 64 bit comparisions, for version 9.5 anyway, the K8 arch can improve Cinbench 10-15% in my recollection, so comparing 64 to 64 is a better feel for the K10 v K8 ....
09-01-2007, 06:12 PM
leoftw

sorry about that guys
09-01-2007, 10:02 PM
JumpingJack

[QUOTE=bobjr;2406885]<censored>[QUOTE]

:) This is a good way to earn a ban. he... you edited it :)
09-01-2007, 10:06 PM
Movieman

[QUOTE=JumpingJack;2406887][QUOTE=bobjr;2406885]<censored>

Quote:

:) This is a good way to earn a ban. he... you edited it :)

Yea, it is, and I'm probably one of the easier going Mods here.
I think he and I will have a little talk..:D
09-01-2007, 11:31 PM
Lightman

Quote:

Originally Posted by JumpingJack

No problem ... when I get into detailed discussions like this, I tend to be verbose ... being a public forum, a number of people read what we write and, because it is a forum, I post a lot of references and quotes... don't take that as an afront to your knowledge base .... what I do try to do is provide ample detail so others, who may not completely follow, gain some level of understanding... (it also helps me learn more as I go along)

Jack

:up:

@leoftw
How your system is configured memory wise. Do you have DIMMs plugged for both CPU sockets?

Can you run SuperPi? It is not x64 optimized so scores will be very comparable with your system. Same goes for CPU-Z cache latency test.
Thanks for your effort! :up:

EDIT: I just noticed over 4x speedup in muliCPU test! Why is that?? have you done 1-CPU test at lower clocks???
09-02-2007, 12:56 AM
mAJORD

Excellent stuff leoftw. That's exactly what we need for a compo.

any chance you run the other isngle thread benchmarks we saw??

Super pi 1M
CPUmark99
09-02-2007, 04:44 AM
doompc

Quote:

Originally Posted by Lightman

EDIT: I just noticed over 4x speedup in muliCPU test! Why is that?? have you done 1-CPU test at lower clocks???

Probably since 1 core at full load is little total cpu load (25%) Cool 'n Quiet keeps all cores at lower clocks...

leoftw, try turning Cool 'n Quiet off.
09-02-2007, 05:23 AM
VVJ

informal

Quote:

Go laugh on the floor at yourself.
http://www.techarp.com/showarticle.a...tno=424&pgno=2
This cache is 32-way set associative and is based on a non-inclusive victim cache architecture.

http://cbid.at.tut.by/work/L3Assoc.gif
This is from BIOS and Kernel Developer's Guide for AMD Family 10h Processors documentation.
09-02-2007, 08:32 AM
JumpingJack

Quote:

Originally Posted by VVJ

informal

http://cbid.at.tut.by/work/L3Assoc.gif
This is from BIOS and Kernel Developer's Guide for AMD Family 10h Processors documentation.

This is not surprising.... the associativity of cache relates to the number of cache lines (or memory blocks) fixed to that set. The associativity will increase with size of the cache as each set is fixed with respect to the number of blocks allocated to them, and since AMD will ultimately up or lower L3 cache the number set associativity must change.

For example, AMD as a 2 meg-32 way associativity, a 4 meg would be 64 set associative, a 6 meg would be 96 way associative. Since they are allowing an associaitivty of 16 in their BIOS guide, then it appears AMD is at some point may be willing to release a 1 meg L3 cache chip (perhaps, because it is there does not mean there are plans).

Intel's associativity for wolfdale will be 24 way associative for 6 meg but 12 way associaitve for 3 meg. Intel has not changed their caching for Wolfdale over conroe other than raw size because their 2 Meg Allendale is 8 way associative, while Conroe 4 Meg is 16 way associative.

Jack
09-02-2007, 09:18 AM
fellix_bg

I wonder, will the L3 cache in K10 act also as a snoop filter in multiprocessor systems?
09-02-2007, 09:57 AM
VVJ

JumpingJack, a shared L3 cache is a configurable part of the Northbridge. It may not include the L3 cache as well.
09-02-2007, 10:29 AM
Lightman

Quote:

Originally Posted by VVJ

informal

http://cbid.at.tut.by/work/L3Assoc.gif
This is from BIOS and Kernel Developer's Guide for AMD Family 10h Processors documentation.

K10 design can support up to 8MB L3 cache then...
Interesting :) !
09-02-2007, 11:55 AM
JumpingJack

Quote:

Originally Posted by VVJ

JumpingJack, a shared L3 cache is a configurable part of the Northbridge. It may not include the L3 cache as well.

Hmmmmm... this is the first time I have heard of cache associated with the northbridge....
09-02-2007, 12:49 PM
Lightman

Quote:

Originally Posted by JumpingJack

Hmmmmm... this is the first time I have heard of cache associated with the northbridge....

Socket 7 and earlier motherboards had caches L2 (or L3 if you had K6-III) associated to northbridge and these caches was clocked by FSB :) . That was not that long ago :p: ....
09-02-2007, 12:59 PM
JumpingJack

Quote:

Originally Posted by Lightman

Socket 7 and earlier motherboards had caches L2 (or L3 if you had K6-III) associated to northbridge and these caches was clocked by FSB :) . That was not that long ago :p: ....

You may be thinking of the time when L2 was off die, in which case it was loaded off the northbridge... through the backside bus then later on the frontside bus. In fact, the term northbridge is historic, in that in diagrams, this chip (with the memory controller) and the L2 were north of the CPU... :)

Later, the northbridge moved down below the CPU between the CPU and southbridge, but the L2 was still distinct.

Here is an example, but the diagram is after the north
http://www.via.com.tw/en/products/ap.../blockmvp4.gif

Here is another example:
http://www.via.com.tw/en/products/ap.../blockmvp3.gif

Of course, I could be wrong -- I am going from memory mostly. Hang tight, I will go look up the 'evolution of the northbridge' paper that shows the history, if I can find it.

EDIT: I knew there was a configuration for off-die L2 cache through the backside bus.... still cannot find the paper, but found the block diagram:
http://www.karbosguide.com/images/_975.gif
09-02-2007, 01:43 PM
Lightman

Yes Jack, on K7 L2 was driven from BackSide Bus :up: (cache was 2/3 or 3/5 of the core clock and packed on Slot A module), but earlier I'm not sure TBH!

It was my personal experience which lead me to believe that L2 cache on motherboards was connected to Northbridge because performance of CPU varied considerably from motherboard you used... (I'm speaking solely about cache intensive tests).
Besides if L2 cache on older motherboards was driven by CPU backside bus then how on earth very old P60 could known about 2MB cache on my Epox board??

Edit: I found something!

Quote:

M1541 includes the higher CPU bus frequency (up to 100 MHz) interface for all Socket-7 compatible processors, PBSRAM and Memory Cache L2 controller to reduce cost and enhance performance, high performance FPM/EDO/SDRAM DRAM controller, PCI 2.1 compliant bus interface, smart deep buffer design for CPU-to-DRAM, CPU-to-PCI, and PCI-to-DRAM to achieve the best system performance. It also has the highly efficient PCI fair arbiter. M1541 also provides the most flexible 64-bit memory bus interface for the best DRAM upgrade-ability and ECC/Parity design to enhance the system reliability.

Link
09-02-2007, 02:11 PM
JumpingJack

Quote:

Originally Posted by Lightman

Yes Jack, on K7 L2 was driven from BackSide Bus :up: (cache was 2/3 or 3/5 of the core clock and packed on Slot A module), but earlier I'm not sure TBH!

It was my personal experience with lead me to believe that L2 cache on motherboards was connected to Northbridge because performance of CPU varied considerably from motherboard you used... (I'm speaking solely about cache intensive tests).
Besides if L2 cache on older motherboards was driven by CPU backside bus then how on earth very old P60 could known about 2MB cache on my Epox board??

Edit: I found something!

Link

Nice.
09-02-2007, 02:20 PM
nn_step

a nice picture reference for mobos http://redhill.net.au/b/b-92.html
but it ends at 2002
09-02-2007, 03:23 PM
doompc

At Barcelona's presentation it was said that the L3 cache runs at Core clock frequency, not at northbridges' (wich runs a little lower. i.e: on 2GHz Barcelona northbridge runs at 1.6GHz on a uniplane board, and at 1.8GHz on a dual plane board)

But the L3 cache beeing a part of the northbridge does make sense, by the latency numbers we've seen and all the async stuff...
09-02-2007, 05:33 PM
KTE

Let's put it in simple words as guys speaking techno will lose most other people and that way no one learns or benefits accurately, which is the whole purpose of sharing and explaining. :)

I mentioned about the L3 cache many pages back. AMD outlined it at "less than 38 cycles". 38 cycles is it's slowest latency. L3 cache is just a victim cache. It is very good for improving latency of data transfer to the core which needs it.

Usually, the data (retained in cache lines) evicted from L2 cache, has to be thrown back to memory (RAM). If needed again, it has to be re-fetched. This distance->latency is very large in comparison to on-die distance->latency.

Thus, by including "another" on-die cache but this time to retain L2 evicted datum in, they can refill the L2 or mostly, send it directly to L1D cache very quickly and any core can access the data (shared cache) far quicker than when going to memory.

The inclusion of this benefits AMD K10 very much similar to how it does for Core 2/Penryn with the large L2 cache, and that's why it will be increased proportionally with time as die size decreases in K10.

IIRC L3 populates itself from L1D cache, L2 cache and also memory based on prediction algorithms too.

K8 is not K10, so don't be using it's architectural features to undermine or imply those to the K10 - just because one doesn't know any accurate information or any better.

L2 and L3 cache use 64B lines. L3 cache controller is variable and flexible to support 8MB.

Also "Size and associativity of the AMD Family 10h processor L2 cache is implementation dependent. See the appropriate BIOS and Kernel Developer’s Guide for details."

L3 cache in not inclusive nor orthodoxly exclusive but tweaked - cache line can be fetched into the L1D and still be retained in L3 for core sharing based on sharing history which is tracked.

L3 doesn't evict using the same LRU algorithm the K8 used for L2, but evicts lines based on the LRU unshared cache line.

Northbridge and core frequency somewhat determines L3 cache relative latency.

L3 cache and IMC (northbridge) runs at independent speeds and voltages from all cores.

"Furthermore, the cache features bandwidth-adaptive policies that optimize latency when requested bandwidth is low, but allows scaling to higher aggregate L3 bandwidth when required (such as in a multi-core environment)." http://www.amd.com/us-en/assets/cont...docs/40546.pdf

"L3 cache latency will evidently be higher than L2 cache latency. However, AMD materials suggest that the latency will vary adaptively depending on the workload. If the workload isn’t too heavy, latency will improve, and under heavy workload the bandwidth will rise. We still have to check what really stands behind this." http://www.xbitlabs.com/articles/cpu...amd-k10_8.html

You already know about HT and it's implementational advantages, exactly as why Intel is developing CSI for high bandwidth, and some of their CSI design team left over this: http://www.realworldtech.com/page.cf...05-05-2006#361

HT link bandwidth is dependent on the controller implemented and increasing this becomes a real power hungry system, which is why processor architects will withhold to do so but in limited very high-end and future shrunk node levels.

In all honesty, Barcelona front end and out-of-order engine is a very complex architecture.

Each clock cycle fetches 32B of data from the L1 instruction cache into the predecode/pick buffer - K8 does 16B - Core 2 does 16B.

The data travels on a bi-directional bus which is 256-bit wide - K8 has it at 128-bit - Core 2 is at 128-bit.

The predecode and pick buffer maybe increased to 48B.

Direct branch prediction is improved > the global history register tracks the last 12 entries - K8 one tracked 8.

New indirect branch prediction of 512-entries.

The FIFO stack including the return addresses of function calls 3, 11 and 15 is 24 entries - K8 pushed 12 out.

Core 2 uses a single execution cluster with a unified scheduler and reservation stations across multiple ports, like in Athlons - K10 has a split integer and floating point cluster with distributed scheduler and reservation stations.

IDIV instructions are variable latency and not a fixed iteration as in K8 previously. This K10 32 bit divide latency is roughly 10 cycles faster than in K8.

Third ALU is now for LZCOUNT/POPCOUNT.

ALUs are most optimized for power efficiency rather than peak performance ATM.

Out-of-order memory access prevents operation stalls seen in K8 (especially with load).

In K10, each core has 8 prefetchers which fetch data into the L1D cache whereas in K8 they were prefetched to the L2 cache.

That's some of it's better features over the K8 (let alone the DRAM controllers) which make it by no way equal. According to what I see, Barcelona quintessence is the Budapest core at higher clock speeds.

Now to the argument of efficiency (a) can't be and can be correlated with frequency (b) in a format we don't expect - all based on what Gary said. Gary commented on what he saw, simply. Processors have an upper limit whereafter more frequency=little performance gain, a mid optimum band where more frequency=inclining gradient performance, a lower band whereby the frequencies=sub-optimal performance.

I will provide you an example of an older CPU I have with some SPI mod 1.5XS 1M tests I ran 1-3 months or so back (P4 Celeron D DO S478, Malaysia week 21 of '04). Memory is synchronous 512MB.

Clock (MHz) - Time (sec) - Scaling
1802MHz = 116.547sec = 100%
1997MHz = 104.110sec = 112%
2104MHz = 97.480sec = 119.6%
2205MHz = 73.455sec = 158.67%
2302MHz = 88.307sec = 132%
2402MHz = 85.009sec = 137%
2504MHz = 80.486sec = 144.8%
2604MHz = 77.161sec = 151%
2701MHz = 73.541sec = 158.47%
2798MHz = 70.431sec = 165.5%
2999MHz = 65.044sec = 179%
3201MHz = 60.396sec = 193%
3301MHz = 60.036sec = 194%
3360MHz = 48.460sec = 240.5%
3503MHz = 47.849sec = 243.6%

See what happened from 1800-2300 and from 3300-3360? Now let's place K10 in there assuming relative scaling like I witnessed with my P4 (not compared to P4 though, but K8). What would happen as a result?

Base comparison: (P4 1800) K8 1900 = 100%

(P4 2300) K10 1900 = 132%
(P4 2400) K10 2000 = 137%
(P4 2500) K10 2100 = 145%
(P4 2600) K10 2200 = 151%
(P4 2700) K10 2300 = 158%
(P4 2800) K10 2400 = 166%
(P4 2900) K10 2500 = 171%
(P4 3000) K10 2600 = 179%
(P4 3100) K10 2700 = 186%
(P4 3200) K10 2800 = 193%
(P4 3300) K10 2900 = 194%
(P4 3400) K10 3000 = 240% **PEAK MOST STABLE PERFORMANCE**
(P4 3500) K10 3100 = 243%
(P4 3600) K10 3200 = 244%
(P4 3700) K10 3300 = NA

This scenario I've witnessed and is based on real life processor performance, so it's equally possible. By AMD and others touting 3GHz Phenom, again and again, I would expect this is loosely what we are to expect.

Here's one Phenom MB BTW: http://img170.imageshack.us/img170/2...63c0cf7kk7.jpg
09-02-2007, 05:39 PM
adamsleath

:hehe: nice figures....but the equating of k10 to real figures is....purely hypothetical.

how is the k8=100% (@1900) and k10=132% (@1900) derived? - does this assume a 32% gain for k10 over k8 baseline?

and does it also assume identical scaling for k10 versus an old p4?
09-02-2007, 06:29 PM
KTE

Quote:

Originally Posted by adamsleath

:hehe: nice figures....but the equating of k10 to real figures is....purely hypothetical.

:D It can get far worse believe me. :)

One of the points I wanted to show is how it clearly and perilously casts doubts on this http://img.coolaler.com.tw/images/zm...mlwemyzmzk.jpg compared to this http://img527.imageshack.us/img527/6582/sp47cj7.jpg

Bro, that was 1x 512MB RAM not even dual channel and an old CD. A P4 will get 10 seconds lower at least, beating the Barcelona quad core time they showed.

Quote:

how is the k8=100% (@1900) and k10=132% (@1900) derived? - does this assume a 32% gain for k10 over k8 baseline?

Yes, the K8 1900 was baseline compared to a K10 1900 - as an assumption for better clock per clock performance. It could be far lower, obviously, but this is a vague explanatory comparison of what Gary stated rather than a "prediction".

What I was saying is, Gary of Anandtech could be basing his statements on similar performance scaling he saw with the K10, as I did with the Celeron chip.

Quote:

and does it also assume identical scaling for k10 versus an old p4?

Yes. It's not impossible is what I'm saying. Look at how the numbers fluctuate with the Celeron and where. Pure technical math cannot account for this, so we won't be able to explain it, but we'll experience it. It's possible that K10 at lower clocks does not scale as well as some higher clocks. I've just shown you in one application how my old Celeron did it, which means it's entirely possible.
09-02-2007, 07:11 PM
JumpingJack

Quote:

Originally Posted by adamsleath

:hehe: nice figures....but the equating of k10 to real figures is....purely hypothetical.

how is the k8=100% (@1900) and k10=132% (@1900) derived? - does this assume a 32% gain for k10 over k8 baseline?

and does it also assume identical scaling for k10 versus an old p4?

Don't believe his data, there are two typo's and he misinterpreted it.
09-02-2007, 07:44 PM
adamsleath

...but it is a real scaling example...ie p4.
09-02-2007, 08:22 PM
JumpingJack

Quote:

Originally Posted by adamsleath

...but it is a real scaling example...ie p4.

He is using to argue non-linear scaling, but he misinterprets the data..

First, there are two typos, or atleast I think they are typos, the first is at 2205 Mhz, he says 73.455 but I think he meant 93.455, this is in line with what would be expected. The other is 3301 or 3201, I looks like he ran the same run twice at the same frequency but recorded a different speed. Here is a plot of his SP1M time vs frequency

http://img470.imageshack.us/img470/2...ata1zf7.th.jpg

If you correct his typo (heck sometimes my 9's look like's 7), then it behaves more as expected. For example:
http://img374.imageshack.us/img374/4...ata2qw1.th.jpg

The typo's or inconsistent data points are not what really matters, what does matter is that he normalizes to the slowest time, and converts to a percentage... he makes the most common mistake when one analyzes time to complete rather than rate ... SP1M is measured in time it takes to complete the task, but processor speed is measured in frequency ... he cannot calculate scaling factors when the units of one dimension is the inverse of the other.... he should have taken 1/time and plotted against fequency to check linearity, when you do that, it becomes completely linear:
http://img470.imageshack.us/img470/4...ata3fu9.th.jpg

In short, he data shows nothing but a handful of mistakes and that indeed SP1M scales linearly with clock speed.

Jack
09-02-2007, 08:25 PM
adamsleath

http://img470.imageshack.us/img470/4914/ktedata3fu9.jpg
jumpingjack's scaling graph.
the non-linear data does look a bit fishy.
09-02-2007, 08:43 PM
JumpingJack

Quote:

Originally Posted by adamsleath

the non-linear data does look a bit fishy.

Yeah, every test I have ever performed on SP1M shows linear scaling with CPU clock... though I have not tested on a P4, I have tested on a K8 and C2D.
09-02-2007, 08:57 PM
STEvil

Quote:

he cannot calculate scaling factors when the units of one dimension is the inverse of the other.... he should have taken 1/time and plotted against fequency to check linearity, when you do that, it becomes completely linear:

Is there a published scientific paper on that?

Thats reminiscent of people pricing things at $24.99 instead of $25 to make people see the smaller number. Its not hard to make your numbers look good when graphing, same things goes for finding percentages.

As to normalizing to a slower time we have to use stock as the baseline else you are going on prediction only. Using percentages wasnt needed but using either percentages of increase over stock or performance product factors would have had the same outcome, just maybe more confusing numbers for some.
09-02-2007, 09:01 PM
JumpingJack

Quote:

Originally Posted by STEvil

Is there a published scientific paper on that?

Thats reminiscent of people pricing things at $24.99 instead of $25 to make people see the smaller number. Its not hard to make your numbers look good when graphing, same things goes for finding percentages.

As to normalizing to a slower time we have to use stock as the baseline else you are going on prediction only. Using percentages wasnt needed but using either percentages of increase over stock or performance product factors would have had the same outcome, just maybe more confusing numbers for some.

No, this is Jr. High Math... Just plot it... :) .... Super Pi measured in time is an inverse function of frequency as it should, because frequency is 1/time.... this is not rocket science... heck run the experiment yourself... here is an X6800 stretching out the range of time and frequency so it is easier to see the functionality:

First the raw data:
http://img373.imageshack.us/img373/9...ing1bi5.th.jpg

Next, SP in time domain vs Frequency:
http://img470.imageshack.us/img470/9...ing2dq8.th.jpg

Finally, in the correct time domain (i.e. reciprocal time by taking 1/time, which makes it a linear function of frequency) for SP1,2,4,8M
http://img470.imageshack.us/img470/3...ing3hj8.th.jpg
09-02-2007, 10:00 PM
adamsleath

http://img470.imageshack.us/img470/9...caling2dq8.jpg

does not this indicate that performance increase is non linear and that (as i thought) scaling gives diminishing returns...?
ie once you reach the shallower part of the curve.
ie performance increase starts off sharp then plataeus. :) - such that higher and higher clocks give less and less performance increase delta.... for a given core at speeds plotted on the graph.

it is a parabolic curve (i think)

and it is a long time since i did school mathematics.

the scaling is NOT linear. - extrapolating beyond 3500(speed) in this example yields negligible performance returns; indeed the difference between 2500 and 3500 is s. f. a.
09-02-2007, 10:58 PM
JumpingJack

Quote:

Originally Posted by adamsleath

does not this indicate that performance increase is non linear and that (as i thought) scaling gives diminishing returns...?
ie once you reach the shallower part of the curve.
ie performance increase starts off sharp then plataeus. :) - such that higher and higher clocks give less and less performance increase delta.... for a given core at speeds plotted on the graph.

it is a parabolic curve (i think)

and it is a long time since i did school mathematics.

the scaling is NOT linear. - extrapolating beyond 3500(speed) in this example yields negligible performance returns; indeed the difference between 2500 and 3500 is s. f. a.

Not exactly, that is the point of how a processor understands time..... a processor only understands a clock tick.. A signal that goes high then low. The total time for that is irrelevant. You and I perceive time not clock ticks, so in your words this is correct ... point of diminishing returns.

Frequency is cycles per second, or cycles/second. As frequency increases the number of ticks increases within one second. However, calculating Pi in super PI only depends X number of clock ticks that performs x number of instructions. As you increase frequency, you increase the number of clock ticks per unit time, but you are observing unit time.... so the functionality of the observed time it takes to complete a task (lower is better) is inversely proportional to frequency. I.e. F(x) = 1/x, so plot a simple plot of 1/x, it approaches zero asymptotically, a paraboly is afunction of f(x) = x^2 this is different.

Units of a quanity are important, and mathematically, they are treated like any variable or number. Super PI, Pi calculated and reported in time is not directly linear to Frequency because frequency is in 1/Time... to make one a direct finction of the other simply convert Super PI from time domain to frequncy domain and plot... wolla linear. This is the same as you will read around when people discuss how to calcuate % improvement for benches that are 'slower is better', slower is better benchmarks always approach a 'point of diminishing returns'.... go check it out, find any review of a series of processors varying as a function of frequency for the same core where the reported bench is in units of time... and plot time vs frequency, it will always be inversely proportional.... Very simple.

KTE's data is also 'paraboloic' to use your word -- (actually not parabolic, that is a function of a quadratic equation), he just chose 1M on short time scales and over a smaller frequecy range in that he was on a 'flatter' area of the curve... if he did 8M and repeated the same data, assuming he makes no mistakes... he would get what you see in the X6800 data above as the 8M stretches out time to see the inverse proportionality.

Now, how does this relate to K10... well not much, we first have to assume that the one data point is correct (i.e. ~39 second SP1M at 2.0 GHz), what people are arguing is that K10 'turns on' after 2.4-2.6 GHz range such that it scales 'better'.... this is odd way of arguing it, because the digital logic of a CPU is just that, it only knows a clock tick it does not care how long that clock tick is when all the transistors flip on and off to give the computational result for that tick... simply speeding up the ticks does not change the logistical arrangement of bits and the functional blocks that create the logic to actuate those bits.....

From this data point, again assuming it is true, in the absense of extrenal bottlenecks (such as the memory, if that is even important), Super Pi should scale at best linear to frequency ... so, within a few %+/- due to noise (background processes, etc), SP1M for K10 would scale as such:

2.0 Ghz == 39 seconds
2.2 Ghz == 36.4 seconds
2.4 GHz == 34.2 seconds
2.6 GHz == 32.3 seconds
2.8 Ghz == 30.7 seconds
3.0 Ghz == 29.3 seconds

But this is gross based on one data point, I personally have a hard time believing K10 will give this kind of super pi performance, it is barely better than a K8....

Jack
09-03-2007, 12:33 AM
Jakko

A question.
If you were to slow a processor down, lower than its intended clockspeed, would it be possible that there comes a point at which performance suddenly takes a bigger hit than is expected by the decreased clockspeed?
Is it possible that there are mechanisms in a processor that only contribute above a minimum clockspeed, and work against performance below a certain clockspeed?
09-03-2007, 12:45 AM
leoftw

Quote:

Originally Posted by Lightman

:up:

@leoftw
How your system is configured memory wise. Do you have DIMMs plugged for both CPU sockets?

Can you run SuperPi? It is not x64 optimized so scores will be very comparable with your system. Same goes for CPU-Z cache latency test.
Thanks for your effort! :up:

EDIT: I just noticed over 4x speedup in muliCPU test! Why is that?? have you done 1-CPU test at lower clocks???

Hey guys my system is 100% OEM , nothing is overclocked or I have no special settings . Yes I do have dimms plugged in for both CPU sockets , each processor supposedly gets 2gigs from what I've read in the bios .
09-03-2007, 12:49 AM
adamsleath

Quote:

what people are arguing is that K10 'turns on' after 2.4-2.6 GHz range such that it scales 'better'....

what cpu has ever done that?
not that id be complaining if somehow that was built in to the cpu (not likely)

multithreaded superpi anyone??

or a chip design that executes many more instructions per cycle or somesuch.
09-03-2007, 02:41 AM
freeloader

Quote:

Originally Posted by KTE

:D It can get far worse believe me. :)

One of the points I wanted to show is how it clearly and perilously casts doubts on this http://img.coolaler.com.tw/images/zm...mlwemyzmzk.jpg compared to this http://img527.imageshack.us/img527/6582/sp47cj7.jpg

Bro, that was 1x 512MB RAM not even dual channel and an old CD. A P4 will get 10 seconds lower at least, beating the Barcelona quad core time they showed.
Yes, the K8 1900 was baseline compared to a K10 1900 - as an assumption for better clock per clock performance. It could be far lower, obviously, but this is a vague explanatory comparison of what Gary stated rather than a "prediction".

What I was saying is, Gary of Anandtech could be basing his statements on similar performance scaling he saw with the K10, as I did with the Celeron chip.
Yes. It's not impossible is what I'm saying. Look at how the numbers fluctuate with the Celeron and where. Pure technical math cannot account for this, so we won't be able to explain it, but we'll experience it. It's possible that K10 at lower clocks does not scale as well as some higher clocks. I've just shown you in one application how my old Celeron did it, which means it's entirely possible.

That's definitely interesting. I guess I'm so used to seeing "linear" scaling, that I didn't give it much thought. It will definitely be cool to see otherwise with K10.

Just a guess here concering SuperPI. Is it possible that we'll see non linear scores due to the L3 cache having decreased latency as the core clock scales higher or is the latency always the same, no matter what the core is clocked at?
09-03-2007, 03:35 AM
CoW]8(0)

The point Jack is trying to make is that, SuperPI, is LOOPED CODE and the IPC improvements of K10 are the same no matter what frequency the CPU is running at.

And that's that problem with looped code; it will only give you the performance of a very specific scenario and nothing more.
09-03-2007, 03:46 AM
CoW]8(0)

Quote:

Originally Posted by freeloader

That's definitely interesting. I guess I'm so used to seeing "linear" scaling, that I didn't give it much thought. It will definitely be cool to see otherwise with K10.

Just a guess here concering SuperPI. Is it possible that we'll see non linear scores due to the L3 cache having decreased latency as the core clock scales higher or is the latency always the same, no matter what the core is clocked at?

You're thinking in the time domain again. The number of clock cycles for code to execute will always be the same regardless of how long those clock cycles are.

So to answer your question, no. It will not affect the scaling.
09-03-2007, 08:01 AM
KTE

Nice analysis going on although it's becoming very offshoot to what I stated and did. Scaling wasn't shown in the true mathematical sense, little time to do it. This was data I already had for some time now, copied it over. All that was meant to be shown is the % change in SPI 1M times as frequency rises above the 1800MHz time, which is projected as 100%, or stock. Call is 0% if you like and anything on-top is the gained percentage.

K10 comes in where people state affirmatively that something they do not own nor have seen perform from a released product, cannot react in a certain way, for whatever musings. I've seen erratic jumps before in processor performance, and I'm showing one of them right here.

It''s not to "argue" anything but a viable possibility you cannot on any grounds reject yet. You have no evidence to. Or we'd like to see it. :)

Quote:

Originally Posted by JumpingJack

He is using to argue non-linear scaling, but he misinterprets the data..

First, there are two typos, or atleast I think they are typos, the first is at 2205 Mhz, he says 73.455 but I think he meant 93.455, this is in line with what would be expected.

No typo, I clearly state it's experimental results. I predicted linear scaling, as most people would. But investigation speaks otherwise - something I cannot explain but I experience.
Conditions are exactly the same for all but >3500 (unstable), in fact I'm sure they were ran one after another on the same day, with minimum services/processes running in the background. One thing I clearly stated I didn't and couldn't do, is keep the memory asynchronous. RAM timings/divider is kept the same. But the rest I'll show it you and I've repeated it time and time again from around 4 months back. All of the data stands final. ;)

I'll try and find what I still can, one second. Hmm... there's data missing here. I only seem to have a few of the run results saved but have more info in the text file saved instead. Anyway, the interesting ones are all still there, the rest of the missing +100MHz sampling points are progressive and linear as you would expect and I'll try finding them on the other drives. Here ya go.

1800MHz = http://img212.imageshack.us/img212/5...00spi1mlx8.jpg
2000MHz = http://img406.imageshack.us/img406/8...00spi1mus3.jpg
2100MHz = http://img510.imageshack.us/img510/6...00spi1mgc7.jpg
2200MHz = http://img205.imageshack.us/img205/1...00spi1mwn9.jpg
2300MHz = http://img250.imageshack.us/img250/8...00spi1mqk4.jpg
2500MHz = http://img508.imageshack.us/img508/7...00spi1mhm8.jpg
2600MHz = http://img209.imageshack.us/img209/3...00spi1mlb0.jpg
2800MHz = http://img510.imageshack.us/img510/2...00spi1map4.jpg
3000MHz = http://img515.imageshack.us/img515/1...00spi1mod1.jpg
3200MHz = http://img265.imageshack.us/img265/4...00spi1mnr5.jpg
3300MHz = http://img250.imageshack.us/img250/8...00spi1mzk4.jpg
3360MHz (3400 is the same) = http://img210.imageshack.us/img210/4379/spi48qg8.jpg
3500MHz = http://img250.imageshack.us/img250/2747/sp47dm3.jpg

3500MHz is quickest = after that, there's hardly any change if the processor runs it (memory bottleneck). If I had similar memory and system now, I'd repeat them now again just to refresh, but I don't and I didn't know this was coming to prepare but just did it for my own personal investigating back then.

BTW, science doesn't equal "we expect this and this all that can be true." Broken logic is to expect linear scaling and when something other occurs you start the conspiracies. Science broadens your horizons to accept observational finding, like the new colossal area devoid of matter found in space, even devoid of dark energy, which was NEVER predicted nor expected at those sizes and changes acceptance of many beliefs and idea's held by physicists beforehand.

Scientific experiment = controlled conditions + variable factor + experiment + observation + repetition + results

I've just given you results of some of my findings, enough for a genuinely interested person to see what's happening for themselves.

Quote:

The other is 3301 or 3201, I looks like he ran the same run twice at the same frequency but recorded a different speed. Here is a plot of his SP1M time vs frequency

Conspiracy theories and conjectures will get one no where. All you have to do is ask for evidence before the insinuating. :)

Quote:

In short, he data shows nothing but a handful of mistakes and that indeed SP1M scales linearly with clock speed.

That's what you want to believe Jack, not what the evidence shows. I'm sorry but you haven't proved how my finding ties in with your belief.
09-03-2007, 08:08 AM
gbjorn

Hi there

I was just looking at the forums at coolaler and saw this. This is K10 in Pi B0 stepping I think. And from the picture it does Pi in 32 seconds at only 1800Mhz. By the way, I'm new here.
09-03-2007, 08:09 AM
gbjorn

sorry , forgot the link

Here is the Link http://forum.coolaler.com/showthread...161127&page=30.
09-03-2007, 08:39 AM
freeloader

Quote:

Originally Posted by CoW]8(0)

You're thinking in the time domain again. The number of clock cycles for code to execute will always be the same regardless of how long those clock cycles are.

So to answer your question, no. It will not affect the scaling.

So basically you're telling me that a 2ghz Barcelona would have the same L3 cache latency (any cache latency for that fact) as a 3ghz Barcelona? Thanks for helping out with this stuff, it's interesting.
09-03-2007, 08:51 AM
Sampsa

Quote:

Originally Posted by gbjorn

I was just looking at the forums at coolaler and saw this. This is K10 in Pi B0 stepping I think. And from the picture it does Pi in 32 seconds at only 1800Mhz. By the way, I'm new here.

Quote:

Originally Posted by gbjorn

Here is the Link http://forum.coolaler.com/showthread...161127&page=30.

CPU-Z-window has different theme than SuperPi window? Yeah right, nice photoshop.
09-03-2007, 08:54 AM
BeardyMan

Quote:

Originally Posted by Sampsa

CPU-Z-window has different theme than SuperPi window? Yeah right, nice photoshop.

i thought superpi theme was customizble :D
09-03-2007, 10:25 AM
Lightman

Quote:

Originally Posted by freeloader

So basically you're telling me that a 2ghz Barcelona would have the same L3 cache latency (any cache latency for that fact) as a 3ghz Barcelona? Thanks for helping out with this stuff, it's interesting.

I would say opposite, but this is only my opinion....
For me it will be like doing SuperPi with DDR2 667 4-4-4 and DDR2 800 4-4-4 on same CPU.
With faster K10 cores you will get differently clocked northbridge which clocks cache L3. It also will be possible to change cache frequency...

Don't take it as a certain, this is just speculation :) .
09-03-2007, 10:38 AM
informal

Sice due to heavy trolling and OT by Donnie27 and others in news "30K" thread ...

....people didn't get the chance to notice my post there,i will repost it here :) :

Quote:

AMD Phenom X4 can do 3GHz and above

Quote:

By normal air cool, the AMD Phenom X4 can go beyond the 3GHz mark by overclocking. Although this can be done, there are some stability issues at such high speed.

Currently, there aren't any options to turn off one or two of the cores. Running it in single channel memory helps to stabilise it.

http://my.ocworkbench.com/bbs/showth...200#post420200

Bluetooth is one of those guys who is credible and who got the one of the 1st GA RD790 mobos to test.So this is very good news coming from them!

Note he says air cooling is used and it's a X4 ,so all of the cores must work at the same clock since something is wrong(bios?) and they can't use separate PLLs to clock the cores individually.
09-03-2007, 10:56 AM
Lightman

That's very good news....
With platform stability they have a bit of time to work out issues so I'm looking ahead to Phenom launch :) .
Bear in mind all this is on pre-production silicon ;)

BTW K10 is designed to hit 4GHz......
09-03-2007, 10:58 AM
JumpingJack

Quote:

Originally Posted by gbjorn

Here is the Link http://forum.coolaler.com/showthread...161127&page=30.

This 'image' was created by someone who was claiming that the Coolaler K10 data was faked, by producing a 'see how easy I can fake something post' on this thread:
http://forums.vr-zone.com/showthread.php?t=181755

I can't read Chinese, but likely someone referenced to this is similar discussions we are having here.
09-03-2007, 11:05 AM
informal

Yeah that "SPI score" of 31 is a bad fake.Best we ignore that and concentrate on the OCworkbench news.
What could have made them unable to individually clock the separate cores in Phenom ES?Something in the ES chip itself or the bios bug?Very good news about the beta silicon,mobos and bioses hitting >3GHZ OCs with air cooling.This speaks a lot about the potential of the core and at the same time amazes me what AMD managed to do in such a short time from the first reports of clock problems(speedpath issues in the chip).They conquered the whole GHz in a couple of respins(source :dailytech).
Still we can expect 125W spec for first retail 3Ghz version of X4(whenever they are out)

edit:

Latest info from AnandTechStaff:

Quote:

Originally Posted by G.Key@AT

Throughout the entire prototype and pre-production (as stated in my last message) process, certain features on the CPU, in the BIOS, or on the chipsets have been turned off/on, latencies have changed, etc, etc. This is a normal part of the engineering process as the design is fleshed out and finalized. It does not represent final silicon capabilities and performance.

As I said earlier, I used a poor example as it was not meant to be taken literally spec for spec when comparing engines and CPUs. Regardless of the example, the point was that the platform performance improved significantly as the core speeds improved and this included performance per watt among other indicators. There is a myriad of reasons as to why this occured but considering the early silicon, BIOS, and chipset designs, we could only speculate as to why and I tried to present a few reasons that we honed in on.

If you compare a B00 chip from May to a B02 today, there is a significant difference in performance in all areas (26 seconds in SuperPI 1m for one) and my comments represent observations of what has occurred over this time period. We have final silicon now and results will be posted in the near future. My observations today are different than they were two weeks ago and as the platform matures they will change again.

Once we see the HT 3.0 capable chipsets and Phenom cores mature then we will have an even better indication of the performance of this core design in the consumer market but for now the initial release is Barcelona in the enterprise market.
09-03-2007, 11:49 AM
BeardyMan

26 seconds in SuperPI 1m for one

that's allot :)
coming from 39secs to 13secs would be great though :D
09-03-2007, 11:56 AM
JumpingJack

Quote:

Originally Posted by KTE

Nice analysis going on although it's becoming very offshoot to what I stated and did. Scaling wasn't shown in the true mathematical sense, little time to do it. This was data I already had for some time now, copied it over.

However, you did try to calculate a scaling factor and concluded on a data set that has 4 of the 14 data points fall away from the expected trend. There are two possibilities, either those points repesent something real or you made a mistake (no harm admitting a mistake). It is poor analysis to see such weird behavior and publish it as 'this is the way it is', without vetting those anomolous data points through repetition and attention to some detail and not expect to get challenged on those anomolies.

Quote:

K10 comes in where people state affirmatively that something they do not own nor have seen perform from a released product, cannot react in a certain way, for whatever musings. I've seen erratic jumps before in processor performance, and I'm showing one of them right here.

But is that due to the CPU or the person using the CPU? Do take offense to this, I can show you anomolous results by the very nature you go into below, where the conditions were not fully understood by which the tests were conducted.

Quote:

No typo, I clearly state it's experimental results. I predicted linear scaling, as most people would. But investigation speaks otherwise - something I cannot explain but I experience.

This is the failure of your approach, when you do see such anomolous behavior, there is always the cause and effect. I will show you an example,
http://img339.imageshack.us/img339/9...usesp1mjv1.jpg
54.328 seconds
http://img504.imageshack.us/img504/2...usesp1mcu1.jpg
58.766 seconds

Looking at this one data point, man, what the heck.... some variability... but, wait... the difference is in the first case I started the run and left alone, in the second I started the run but moved my mouse around.... ahhhh, so is there something wrong with CPU, not reproducible or was there something wrong with the way I ran the bench?.... hmmmmmm.

Quote:

Conditions are exactly the same for all but >3500 (unstable), in fact I'm sure they were ran one after another on the same day, with minimum services/processes running in the background. One thing I clearly stated I didn't and couldn't do, is keep the memory asynchronous. RAM timings/divider is kept the same. But the rest I'll show it you and I've repeated it time and time again from around 4 months back. All of the data stands final. ;)

This fine, it stand final but still flawed.

Quote:

1800MHz = http://img212.imageshack.us/img212/5...00spi1mlx8.jpg
2000MHz = http://img406.imageshack.us/img406/8...00spi1mus3.jpg
2100MHz = http://img510.imageshack.us/img510/6...00spi1mgc7.jpg
2200MHz = http://img205.imageshack.us/img205/1...00spi1mwn9.jpg
2300MHz = http://img250.imageshack.us/img250/8...00spi1mqk4.jpg
2500MHz = http://img508.imageshack.us/img508/7...00spi1mhm8.jpg
2600MHz = http://img209.imageshack.us/img209/3...00spi1mlb0.jpg
2800MHz = http://img510.imageshack.us/img510/2...00spi1map4.jpg
3000MHz = http://img515.imageshack.us/img515/1...00spi1mod1.jpg
3200MHz = http://img265.imageshack.us/img265/4...00spi1mnr5.jpg
3300MHz = http://img250.imageshack.us/img250/8...00spi1mzk4.jpg
3360MHz (3400 is the same) = http://img210.imageshack.us/img210/4379/spi48qg8.jpg
3500MHz = http://img250.imageshack.us/img250/2747/sp47dm3.jpg

Thanks for the screen shots, so my guess that you typo'ed was incorrect, no big deal ... it was one plausible explanation for the outlier data, so now we should look for other explanations, perhaps more experiments.

Quote:

3500MHz is quickest = after that, there's hardly any change if the processor runs it (memory bottleneck). If I had similar memory and system now, I'd repeat them now again just to refresh, but I don't and I didn't know this was coming to prepare but just did it for my own personal investigating back then.

You need to be careful making this conclusion because your data set is inconsistent to begin with....

Quote:

BTW, science doesn't equal "we expect this and this all that can be true." Broken logic is to expect linear scaling and when something other occurs you start the conspiracies. Science broadens your horizons to accept observational finding, like the new colossal area devoid of matter found in space, even devoid of dark energy, which was NEVER predicted nor expected at those sizes and changes acceptance of many beliefs and idea's held by physicists beforehand.

So you don't believe in science? Yes, science does broaden horizons, within the context of performing science correctly, here you did not.

Quote:

Scientific experiment = controlled conditions + variable factor + Hypothesize+ experiment + observation + repetition + results + revise + repeat + conclude

You missed quite abit.... you failed to revise your hypothesis or repeat your experiment, I would really like to see you repeat this data set again, but I don't expect you to expend the time, I will do it myself, I have a Prescott I will rebuild.

Quote:

That's what you want to believe Jack, not what the evidence shows. I'm sorry but you haven't proved how my finding ties in with your belief.

I am not trying to prove anything, I am challenging your data set where 4 of your 14 points are outside expected..... in short, you should do a better job of rationalizing the outlying data before drawing conclusions and publishing results, some may look at this and be ... meh.. ok. But there could be someone who looks at this and thinks ' man, this is counter to all the existing data, there must be something wrong' -- which is what I did.
09-03-2007, 12:02 PM
CoW]8(0)

Quote:

Originally Posted by freeloader

So basically you're telling me that a 2ghz Barcelona would have the same L3 cache latency (any cache latency for that fact) as a 3ghz Barcelona? Thanks for helping out with this stuff, it's interesting.

No, I'm saying it won't affect the scaling. Why is there a latency decrease when you increase frequency.

As you increase frequency, time decreases (latency). That's why there's a latency decrease and that's why I believe it will scale the same.
09-03-2007, 12:04 PM
gallag

Quote:

Originally Posted by informal

Best we ignore that and concentrate on the OCworkbench news.

lol Yeah lets ignore the bad and concentrate on the good lol:ROTF:
09-03-2007, 12:07 PM
JumpingJack

Quote:

Originally Posted by CoW]8(0)

No, I'm saying it won't affect the scaling. Why is there a latency decrease when you increase frequency.

As you increase frequency, time decreases (latency). That's why there's a latency decrease and that's why I believe it will scale the same.

Clock cycle latency remains the same..... though you understand the concept I will state it anyway an example:

12 cycles of L2 latency at 2 GHz gives 6 ns latency in time.
12 cycles of L2 latency at 3 GHz gives 4 ns latency in time.

The total time to propogate a signal through the chip has a ceiling, hence as you decrase the clock period at fixed cycle latency -- the wall will be hit and no more clocks for you :)

It seems a lot of people have a hard time understanding the digitial tick of a clock and the time period for that tick and how that translates into scaling. :)
09-03-2007, 12:16 PM
freeloader

Quote:

Originally Posted by CoW]8(0)

No, I'm saying it won't affect the scaling. Why is there a latency decrease when you increase frequency.

As you increase frequency, time decreases (latency). That's why there's a latency decrease and that's why I believe it will scale the same.

OK, now I fully understand. Thanks! :D
09-03-2007, 12:18 PM
JumpingJack

Quote:

Originally Posted by BeardyMan

26 seconds in SuperPI 1m for one

that's allot :)
coming from 39secs to 13secs would be great though :D

For the record, a 2 GHz Conroe gets 26 seconds...

http://img70.imageshack.us/img70/391...0102000mr1.jpg
09-03-2007, 12:24 PM
informal

No the dude said he saw 26s decrease(from whatever the score was before)..Not that it scored 26s(it could be more or less for what we know..)
09-03-2007, 12:43 PM
JumpingJack

Quote:

Originally Posted by informal

No the dude said he saw 26s decrease(from whatever the score was before)..Not that it scored 26s(it could be more or less for what we know..)

Ooops...
09-03-2007, 01:03 PM
Dagalidis

Quote:

Originally Posted by informal

Originally Posted by G.Key@AT
Throughout the entire prototype and pre-production (as stated in my last message) process, certain features on the CPU, in the BIOS, or on the chipsets have been turned off/on, latencies have changed, etc, etc. This is a normal part of the engineering process as the design is fleshed out and finalized. It does not represent final silicon capabilities and performance.

As I said earlier, I used a poor example as it was not meant to be taken literally spec for spec when comparing engines and CPUs. Regardless of the example, the point was that the platform performance improved significantly as the core speeds improved and this included performance per watt among other indicators. There is a myriad of reasons as to why this occured but considering the early silicon, BIOS, and chipset designs, we could only speculate as to why and I tried to present a few reasons that we honed in on.

If you compare a B00 chip from May to a B02 today, there is a significant difference in performance in all areas (26 seconds in SuperPI 1m for one) and my comments represent observations of what has occurred over this time period. We have final silicon now and results will be posted in the near future. My observations today are different than they were two weeks ago and as the platform matures they will change again.

Once we see the HT 3.0 capable chipsets and Phenom cores mature then we will have an even better indication of the performance of this core design in the consumer market but for now the initial release is Barcelona in the enterprise market.

Link please ?
09-03-2007, 01:17 PM
informal

But i doubt the original score of that whatever stepping(of B01 revision) was worse than revF K8 in SPi 1m.What does 2Ghz K8 with DDR2-667 cas5 get in Spi 1m?Around 42s right?If the ES was somewhat worse than that than the "better" score could very well be very good.And the dude said they saw week-to-week improvements with both new chips and new boards.So who knows what floats around there in the wild.

Link for that quote:
http://forums.anandtech.com/messagev...VIEWTMP=Linear
09-03-2007, 01:48 PM
Motiv

Quote:

Originally Posted by informal

No the dude said he saw 26s decrease(from whatever the score was before)..Not that it scored 26s(it could be more or less for what we know..)

The way it reads, he is stating the production chip does it in 26 Seconds.
09-03-2007, 01:57 PM
informal

Quote:

Originally Posted by Motiv

The way it reads, he is stating the production chip does it in 26 Seconds.

Hmm is English your first language?(not bashing just asking :) )

Since you can read it for yourself one more time(pay attention to bold and blue) :

Quote:

there is a significant difference in performance in all areas (26 seconds in SuperPI 1m for one)

I "painted" in blue the important parts to make it more clear :)
09-03-2007, 02:07 PM
Dagalidis

LOL...
This is NOT significant difference ...
This is HUGE difference i can say ....:rofl:
09-03-2007, 02:25 PM
Motiv

Quote:

Originally Posted by informal

Hmm is English your first language?(not bashing just asking :) )

Since you can read it for yourself one more time(pay attention to bold and blue) :

I "painted" in blue the important parts to make it more clear :)

yes it is my first Language.

It reads that there is a difference in the two cpus, in that the newer revision runs Pi at 26 seconds, not a 26 second difference in time.

That's how I read it.
09-03-2007, 02:46 PM
bodomax

Quote:

Originally Posted by Motiv

yes it is my first Language.

It reads that there is a difference in the two cpus, in that the newer revision runs Pi at 26 seconds, not a 26 second difference in time.

That's how I read it.

Well, english is my third language, and I interpreted it as a difference in time = 26s less, also considering that they cannot disclose any benchmark numbers yet.

Anyway, it looks quite promising, I think we all need a strong AMD and a strong intel.
09-03-2007, 03:10 PM
SOLDNER-MOFO64

PHENOM @ 3GHZ+ on AIR :clap:

Didn't I say not to worry about AMD?

This is gonna be insane :up:
09-03-2007, 03:15 PM
m411b

Quote:

Originally Posted by SOLDNER-MOFO64

PHENOM @ 3GHZ+ on AIR :clap:

Didn't I say not to worry about AMD?

This is gonna be insane :up:

Thats great Dude, but if Phenom isn't pulling the same or better bench scores then Intel is right NOW, then AMD better just pack it up and prepare for evolution.
09-03-2007, 05:16 PM
i found nemo

Quote:

Originally Posted by [XC] M411b

Thats great Dude, but if Phenom isn't pulling the same or better bench scores then Intel is right NOW, then AMD better just pack it up and prepare for evolution.

if .... but until then amd ftw!!!! lol
09-03-2007, 05:37 PM
Plywood99

I fully trust Gary Key. Long time bencher who has always been credible. Sounds like K10 will be very good chip...

Ply
09-03-2007, 11:50 PM
bokis

I don´t now if this is new info or old so take a look
http://www.hardwaresecrets.com/article/480
09-04-2007, 03:02 AM
mongoled

Quote:

Originally Posted by informal

Hmm is English your first language?(not bashing just asking :) )

Since you can read it for yourself one more time(pay attention to bold and blue) :

I "painted" in blue the important parts to make it more clear :)

Quote:

there is a significant difference in performance in all areas (26 seconds in SuperPI 1m for one)

I read this statement as SuperPI 1m being done in 26 seconds. After re-reading it I am still not sure if Gary is speaking abt the a difference of 26 seconds or not.

The statement can be 'read' in both ways, only Gary can tell us exactly what he meant and I am pretty sure this may have been worded they way it was as to not circumvent any 'agreements'

LOL

:)
09-04-2007, 03:26 AM
Jacky

Quote:

Originally Posted by Plywood99

I fully trust Gary Key. Long time bencher who has always been credible. Sounds like K10 will be very good chip...

Ply

no not necessarily, the improvement from "B00 chip from May to a B02" is probably due to some important parts being turned of in the B00 (maybe L3 or even L2) - the fact that *some parts were turned off* was already stated somewhere - thus it tells us nothing about the retail barcelona.
and most agree that coolaler's results show a B01 that is going to launch on sept. 10, however we don't know if the results aren't fake.
but don't forget that gary also said this about the results:

Quote:

Actually, based on the last chip we had, those numbers are in alignment with some of the results we noticed and others as well. I think they will be better than that at release and especially on a consumer board with Vista 64-bit

6 more days until we find out :)
09-04-2007, 08:44 AM
Ubermann

13s at Spi at 2ghz..yeah right =)
Some people need a reality check.
09-04-2007, 09:03 AM
informal

Who said it was 12s??? It is said the decrease from B00 to B02 was that..From what we know(nothing to be exact :) ),B00 could have score in a range of 40 to 50s in Pi 1m :)
09-04-2007, 09:22 AM
awdrifter

There's no way that the SPi time is 13s @ 2ghz. If that's true, AMD would already showed it. The best we can hope for is that the 26s @ 2ghz is true.
09-04-2007, 09:52 AM
SOLDNER-MOFO64

Quote:

Originally Posted by Ubermann

13s at Spi at 2ghz..yeah right =)
Some people need a reality check.

I think you got one comin' in 6days buddy :up:

Quote:

Originally Posted by awdrifter

There's no way that the SPi time is 13s @ 2ghz. If that's true, AMD would already showed it. The best we can hope for is that the 26s @ 2ghz is true.

Why would AMD show it? They've never released any bench-scores early in the past....they're confident in themselves and their abilities and don't need to lure in buyers with leaked benchmarks ahead of time.
09-04-2007, 10:32 AM
Ubermann

Only show what your good at, nothing else.
Thats how you get money to your company..

(and fool a few fanboys)
09-04-2007, 10:37 AM
ScythedBlade

Smart ass is right ... AMD is a bit fishy ..... it shows specFP ... but not superPi? wtf ...
09-04-2007, 11:06 AM
hawkeyefan

Quote:

Originally Posted by ScythedBlade

Smart ass is right ... AMD is a bit fishy ..... it shows specFP ... but not superPi? wtf ...

Why would a company whose success depends on the server market use SuperPi to lure customers? Nobody gives a crap about SuperPi except overclockers.
09-04-2007, 11:07 AM
dave_graham

Quote:

Originally Posted by hawkeyefan

Why would a company whose success depends on the server market use SuperPi to lure customers? Nobody gives a crap about SuperPi except overclockers.

good point.

cheers,

dave
09-04-2007, 12:10 PM
akaBruno

THere's no doubt that Conroe is Superb @ SuperPi.

I just hope that the new K10 can at least come close. Somebody has to keep em all honest.

BRUNO
09-04-2007, 12:12 PM
ScythedBlade

Bad point ... why the hell would anyone use specFP? XD

(*poke* why not int as well ...)

GG, AMD Fanboy ...

Why not put a video of every type of benchmark, AMD?

(self proclaims good point .... actually, it has to be, because everyone is wanting all the benchmarks of K10)
09-04-2007, 03:01 PM
SOLDNER-MOFO64

Quote:

Originally Posted by ScythedBlade

Smart ass is right ... AMD is a bit fishy ..... it shows specFP ... but not superPi? wtf ...

When you run Super Pi (in the version you most use) C2D's score waaaaay faster times but that ain't the only Pi proggy around.

Sys Tools Pi Bench gives a much closer result. SuperPi itself is an INTEL app IMO.

Show 100 post(s) from this thread on one page

All times are GMT -8. The time now is 01:09 AM.

XtremeSystems