K10 Scores starting to surface

Printable View

Show 100 post(s) from this thread on one page

09-01-2007, 01:37 PM
JumpingJack

Quote:

Originally Posted by onewingedangel

Why can't everyone just wait the week or two left? Then a lot of arguments can be settled...

Nahhhh, we would get bored.
09-01-2007, 01:37 PM
BeardyMan

Quote:

Originally Posted by informal

I think you quoted the wrong man,since i did gave the link for that patent...

I know, you may slap me for that :D
09-01-2007, 01:43 PM
JumpingJack

Quote:

Originally Posted by mstp2009

Thank you Jack.

So absolutely latency - the transfer from FIFO buffer to L3 is constant, it would just be the fill time of the FIFO buffer that is variable b/c it has async communication with each of the 4 cores (unless they are at full speed, one would assume).

The observed latency would be:

Core to FIFO Buf latency + FIFO Buf to L3 latency

Correct?

Yeah, and I cannot believe I was so ignorant not to think this through. The L3 cache is shared, but each core is throttled independently. The overall intrinsic latency of the L3 will be like any cache, it is fixed by the size and quality of the process technology as well as the speed paths set at design.

However, since each core will throttle depending on load, the clock for each core can be different than L3... necessitating an asynchronous bus such that each core can still access the data....

Now, simple asynchronous communications will always have variable latency (as a function of the ratio or divider) as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....

But you also have to add into the mix the physical latency of the circuit to do this work.... it is a trade off, one that AMD obviously believes is better in the long run... so long as L3 'observed' latency is much less than that to main memory, there is a benefit.
09-01-2007, 01:49 PM
informal

Quote:

However, since each core will throttle depending on load, the clock for each core will be different than L3... necessitating an asynchronous bus such that each core can still access the data....

Now, simply asynchronous communications will always have variable latency as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....

Yes Jack,this is a simplified reason why it occurs.
I thought you saw Kanter's article long ago(since it is online since the middle of May i think)
09-01-2007, 01:51 PM
JumpingJack

Quote:

Originally Posted by informal

Yes Jack,this is a simplified reason why it occurs.
I thought you saw Kanter's article long ago(since it is online since the middle of May i think)

I did, and I read it to... it just did not register enough to turn on lights as this topic has evolved.
09-01-2007, 02:05 PM
Lightman

Quote:

Originally Posted by JumpingJack

Informal, becareful and read Kanter's note carefully, this is beginning to focus up .... it was a good link... but you may be misinterpreting what Kanter is saying....

I am, myself, trying to understand at a detail that makes sense, this is much more complicated than what we are assuming....

Each core (that is execution core and the dedicated cache) will be clocked independently, a major power saving feature of K10. So in order to share a cache at L3 level, it will need to send data asynchronously to differently clocked cores... wow, this is complicated.... so what AMD has done (per Kanter) is build a 'translator', or a FIFO buffer to send data to and from the L3 -- this is not the same as dynamically adjusting L3 clock or latency, what it is doing is dynamically adjusting a clock divider to synch L3 with variable speed cores, now this variable L3 latency makes much much more sense.

Any asynchronous communication will incur extra latency (over a simple 1:1) simply as a result of clock mismatch ... this is a given ... (this is why C2D shows a dip in performance in DDR2-533 to DDR2-667 to DDR2-800 as dividers beyond 1:1 introduce extra latency).

So with this understanding, the observed latency (which is actually the important part) will be variable, not because L3 cache has variable latency but because it has to be sychronized through the FIFO buffers to cores of variable clocks.

Damn should have paid more attention to Kanter's article too....

Guys I am learning a lot here... thanks.

Jack

This is not entirely true...

I read somewhere long time ago that L3 in K10 is acting more like memory layer. In other words it is clocked by IMC independently from all 4 cores and on diagram I would put it after CrossBar...
That's why L3 latency can vary from core point of view (cache latency itself is probably constant). It is similar to how DDR2-800 latency (again from CPU point of view) is different compared to DDR2-667 (same timings of course:) ).

Edit: JumpingJack you typing too fast :) I barely read page 16 and typed my response and here surprise! another page with new info making my post partially obsolete :)
09-01-2007, 03:21 PM
JumpingJack

Quote:

Originally Posted by Lightman

This is not entirely true...

I read somewhere long time ago that L3 in K10 is acting more like memory layer. In other words it is clocked by IMC independently from all 4 cores and on diagram I would put it after CrossBar...
That's why L3 latency can vary from core point of view (cache latency itself is probably constant). It is similar to how DDR2-800 latency (again from CPU point of view) is different compared to DDR2-667 (same timings of course:) ).

Edit: JumpingJack you typing too fast :) I barely read page 16 and typed my response and here surprise! another page with new info making my post partially obsolete :)

I am not sure I undestand if you understand what I am trying to say :) ...

A shared resource clocked at one speed to 4 other resources clocked at different speeds will necessitate asyncronous communications... there is no other way... thus AMD must provide functionality to account for floating clocks between 4 cores to one memory pool, L3.... just adding circuits to do this work will incur latency...

Add on top of that, 1:1 divide latency < 3:2 divider latency < 2:1 divider latnecy... hence the 'observed' latency from any core is variable...... at least if you read Kanter's article this is what the FIFO buffers do... he did not mention the x-bar.

There is research ongoing to work on achieving both low BW and low latency asynchronous networking, but there has always been this fundamental trade-off:

Quote:

Previously published NoCs which provide GS are ÆTHEREAL [18][9] and NOSTRUM [14]. Both are synchronous and employ variants of time division multiplexing (TDM) for providing per connection bandwidth (BW) guarantees. TDM has the drawback of the connection latency being inversely proportional to the BW, thus connections with low BW and low latency requirements, e.g. interrupts, are not supported.

http://www.ee.technion.ac.il/courses...OC-async05.pdf

Not quite the paper I would use, but the one I could find recently written that summarized the issue at hand that I could quote as a source and not have you take my word for it .... i.e. connection latency is hard to get very low in networks where a globalized clock is not real.... here he discusses time division multiplexing, a type of clock dividing.

Edit: Found another paper which is much more detailed, and has some info on the FIFO implementation over a global clock:

Quote:

Simulation results for the FIFO and the two versions of the adder are given in Table 1. The
optimized adder has 2-input c-elernents while the other adder is using 4-input C-elements.
The operations/second indicate the number of logic evaluations done pcr second in each
basic cell. Cycle time is the fastest time at which the pipeline cm send out successive data
values. Latency is the time it takes for data to go from the input of the circuit until it is
finally ready at the output. Pipelined systems work on the principle of reducing the cycle
time at the cost of increased latency. The next section examines how an enhancement to
the system cm reduce the latency even further.

http://www.collectionscanada.ca/obj/...11/MQ34126.pdf
(see page 73). This is an old paper, but he is showing 18 ns latency for a straight up FIFO buffer. This is a large number, and not to be considered true or accurate wrt K10.

Jack
09-01-2007, 03:30 PM
Lightman

Quote:

Originally Posted by JumpingJack

I am not sure I undestand if you understand what I am trying to say :) ...

A shared resource clocked at one speed to 4 other resources clocked at different speeds will necessitate asyncronous communications... there is no other way... thus AMD must provide functionality to account for floating clocks between 4 cores to one memory pool, L3.... just adding circuits to do this work will incur latency...

Add on top of that, 1:1 divide latency < 3:2 divider latency < 2:1 divider latnecy... hence the 'observed' latency from any core is variable...... at least if you read Kanter's article this is what the FIFO buffers do... he did not mention the x-bar.

Jack

I understand what your trying to say that's why I put edit.
As I said I read long time ago (probably on RWT but not coming from DK) that L3 will be operating in similar way to normal memory and will be possible to clock it independently form cores.
If I'm following your understanding correctly, your saying that L3 will be clocked from highest frequency core in CPU (2GHz K10-->2GHz L3) which in my opinion is not the case.

Of course asynchronous clocking will add latency but it might be a good trade off compared to gains in power/flexibility. (besides look at L3 latency numbers, they are high for a CPU cache so clearly we have lots of logic circuity in between)

Well, in the end we will find out shortly :)

Edit: I'm just thinking why would AMD release different Phenom models with differently clocked HTT bus (from official roadmaps)?? The answer can be that together with increased HTT speed L3 cache is also clocked higher (and IMC) and that gives some tangible performance improvements.
09-01-2007, 03:58 PM
JumpingJack

Quote:

Originally Posted by Lightman

I understand what your trying to say that's why I put edit.
As I said I read long time ago (probably on RWT but not coming from DK) that L3 will be operating in similar way to normal memory and will be possible to clock it independently form cores.
If I'm following your understanding correctly, your saying that L3 will be clocked from highest frequency core in CPU (2GHz K10-->2GHz L3) which in my opinion is not the case.

Of course asynchronous clocking will add latency but it might be a good trade off compared to gains in power/flexibility. (besides look at L3 latency numbers, they are high for a CPU cache so clearly we have lots of logic circuity in between)

Well, in the end we will find out shortly :)

Edit: I'm just thinking why would AMD release different Phenom models with differently clocked HTT bus (from official roadmaps)?? The answer can be that together with increased HTT speed L3 cache is also clocked higher (and IMC) and that gives some tangible performance improvements.

;) I don't know how the L3 will be clocked it will however need one clock and as Informal and others push the detail envelop, I am beginning to understand some of the L3 details that I had otherwise not really considered.

Your edit could be correct too....

AMD has had quite a bit of experience getting the best clock/latency performance out of different clocked agents, the IMC is a good example as it the HT links all of which time on clocks different than the core but put data into the core....

It is both interesting but irrelevant, performance will be what it performs at overall.... and we are hoping it is better than the showing that started this thread.
09-01-2007, 04:04 PM
leoftw

here are my results from my opteron 2218 , if you guys want me to run any test on my quad to compare to the phenom let me know.

http://i43.photobucket.com/albums/e3...NEBENCHR10.jpg
09-01-2007, 04:26 PM
JumpingJack

Quote:

Originally Posted by leoftw

here are my results from my opteron 2218 , if you guys want me to run any test on my quad to compare to the phenom let me know.

Do you have a way to run 64-bit?
09-01-2007, 04:29 PM
JumpingJack

Quote:

Originally Posted by Lightman

I understand what your trying to say that's why I put edit.
As I said I read long time ago (probably on RWT but not coming from DK) that L3 will be operating in similar way to normal memory and will be possible to clock it independently form cores.
If I'm following your understanding correctly, your saying that L3 will be clocked from highest frequency core in CPU (2GHz K10-->2GHz L3) which in my opinion is not the case.

Of course asynchronous clocking will add latency but it might be a good trade off compared to gains in power/flexibility. (besides look at L3 latency numbers, they are high for a CPU cache so clearly we have lots of logic circuity in between)

Well, in the end we will find out shortly :)

Edit: I'm just thinking why would AMD release different Phenom models with differently clocked HTT bus (from official roadmaps)?? The answer can be that together with increased HTT speed L3 cache is also clocked higher (and IMC) and that gives some tangible performance improvements.

No problem ... when I get into detailed discussions like this, I tend to be verbose ... being a public forum, a number of people read what we write and, because it is a forum, I post a lot of references and quotes... don't take that as an afront to your knowledge base .... what I do try to do is provide ample detail so others, who may not completely follow, gain some level of understanding... (it also helps me learn more as I go along)

Jack
09-01-2007, 05:15 PM
leoftw

Quote:

Originally Posted by JumpingJack

Do you have a way to run 64-bit?

I have to install 64bit for that , sorry . I'm having enough problems with these damn 32bit vista drivers :P
09-01-2007, 06:02 PM
JumpingJack

Quote:

Originally Posted by leoftw

I have to install 64bit for that , sorry . I'm having enough problems with these damn 32bit vista drivers :P

No biggy....

From 32 to 64 bit comparisions, for version 9.5 anyway, the K8 arch can improve Cinbench 10-15% in my recollection, so comparing 64 to 64 is a better feel for the K10 v K8 ....
09-01-2007, 06:12 PM
leoftw

sorry about that guys
09-01-2007, 10:02 PM
JumpingJack

[QUOTE=bobjr;2406885]<censored>[QUOTE]

:) This is a good way to earn a ban. he... you edited it :)
09-01-2007, 10:06 PM
Movieman

[QUOTE=JumpingJack;2406887][QUOTE=bobjr;2406885]<censored>

Quote:

:) This is a good way to earn a ban. he... you edited it :)

Yea, it is, and I'm probably one of the easier going Mods here.
I think he and I will have a little talk..:D
09-01-2007, 11:31 PM
Lightman

Quote:

Originally Posted by JumpingJack

No problem ... when I get into detailed discussions like this, I tend to be verbose ... being a public forum, a number of people read what we write and, because it is a forum, I post a lot of references and quotes... don't take that as an afront to your knowledge base .... what I do try to do is provide ample detail so others, who may not completely follow, gain some level of understanding... (it also helps me learn more as I go along)

Jack

:up:

@leoftw
How your system is configured memory wise. Do you have DIMMs plugged for both CPU sockets?

Can you run SuperPi? It is not x64 optimized so scores will be very comparable with your system. Same goes for CPU-Z cache latency test.
Thanks for your effort! :up:

EDIT: I just noticed over 4x speedup in muliCPU test! Why is that?? have you done 1-CPU test at lower clocks???
09-02-2007, 12:56 AM
mAJORD

Excellent stuff leoftw. That's exactly what we need for a compo.

any chance you run the other isngle thread benchmarks we saw??

Super pi 1M
CPUmark99
09-02-2007, 04:44 AM
doompc

Quote:

Originally Posted by Lightman

EDIT: I just noticed over 4x speedup in muliCPU test! Why is that?? have you done 1-CPU test at lower clocks???

Probably since 1 core at full load is little total cpu load (25%) Cool 'n Quiet keeps all cores at lower clocks...

leoftw, try turning Cool 'n Quiet off.
09-02-2007, 05:23 AM
VVJ

informal

Quote:

Go laugh on the floor at yourself.
http://www.techarp.com/showarticle.a...tno=424&pgno=2
This cache is 32-way set associative and is based on a non-inclusive victim cache architecture.

http://cbid.at.tut.by/work/L3Assoc.gif
This is from BIOS and Kernel Developer's Guide for AMD Family 10h Processors documentation.
09-02-2007, 08:32 AM
JumpingJack

Quote:

Originally Posted by VVJ

informal

http://cbid.at.tut.by/work/L3Assoc.gif
This is from BIOS and Kernel Developer's Guide for AMD Family 10h Processors documentation.

This is not surprising.... the associativity of cache relates to the number of cache lines (or memory blocks) fixed to that set. The associativity will increase with size of the cache as each set is fixed with respect to the number of blocks allocated to them, and since AMD will ultimately up or lower L3 cache the number set associativity must change.

For example, AMD as a 2 meg-32 way associativity, a 4 meg would be 64 set associative, a 6 meg would be 96 way associative. Since they are allowing an associaitivty of 16 in their BIOS guide, then it appears AMD is at some point may be willing to release a 1 meg L3 cache chip (perhaps, because it is there does not mean there are plans).

Intel's associativity for wolfdale will be 24 way associative for 6 meg but 12 way associaitve for 3 meg. Intel has not changed their caching for Wolfdale over conroe other than raw size because their 2 Meg Allendale is 8 way associative, while Conroe 4 Meg is 16 way associative.

Jack
09-02-2007, 09:18 AM
fellix_bg

I wonder, will the L3 cache in K10 act also as a snoop filter in multiprocessor systems?
09-02-2007, 09:57 AM
VVJ

JumpingJack, a shared L3 cache is a configurable part of the Northbridge. It may not include the L3 cache as well.
09-02-2007, 10:29 AM
Lightman

Quote:

Originally Posted by VVJ

informal

http://cbid.at.tut.by/work/L3Assoc.gif
This is from BIOS and Kernel Developer's Guide for AMD Family 10h Processors documentation.

K10 design can support up to 8MB L3 cache then...
Interesting :) !

Show 100 post(s) from this thread on one page