Nahhhh, we would get bored.
Printable View
Yeah, and I cannot believe I was so ignorant not to think this through. The L3 cache is shared, but each core is throttled independently. The overall intrinsic latency of the L3 will be like any cache, it is fixed by the size and quality of the process technology as well as the speed paths set at design.
However, since each core will throttle depending on load, the clock for each core can be different than L3... necessitating an asynchronous bus such that each core can still access the data....
Now, simple asynchronous communications will always have variable latency (as a function of the ratio or divider) as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....
But you also have to add into the mix the physical latency of the circuit to do this work.... it is a trade off, one that AMD obviously believes is better in the long run... so long as L3 'observed' latency is much less than that to main memory, there is a benefit.
Yes Jack,this is a simplified reason why it occurs.Quote:
However, since each core will throttle depending on load, the clock for each core will be different than L3... necessitating an asynchronous bus such that each core can still access the data....
Now, simply asynchronous communications will always have variable latency as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....
I thought you saw Kanter's article long ago(since it is online since the middle of May i think)
This is not entirely true...
I read somewhere long time ago that L3 in K10 is acting more like memory layer. In other words it is clocked by IMC independently from all 4 cores and on diagram I would put it after CrossBar...
That's why L3 latency can vary from core point of view (cache latency itself is probably constant). It is similar to how DDR2-800 latency (again from CPU point of view) is different compared to DDR2-667 (same timings of course:) ).
Edit: JumpingJack you typing too fast :) I barely read page 16 and typed my response and here surprise! another page with new info making my post partially obsolete :)
I am not sure I undestand if you understand what I am trying to say :) ...
A shared resource clocked at one speed to 4 other resources clocked at different speeds will necessitate asyncronous communications... there is no other way... thus AMD must provide functionality to account for floating clocks between 4 cores to one memory pool, L3.... just adding circuits to do this work will incur latency...
Add on top of that, 1:1 divide latency < 3:2 divider latency < 2:1 divider latnecy... hence the 'observed' latency from any core is variable...... at least if you read Kanter's article this is what the FIFO buffers do... he did not mention the x-bar.
There is research ongoing to work on achieving both low BW and low latency asynchronous networking, but there has always been this fundamental trade-off:
http://www.ee.technion.ac.il/courses...OC-async05.pdfQuote:
Previously published NoCs which provide GS are ÆTHEREAL [18][9] and NOSTRUM [14]. Both are synchronous and employ variants of time division multiplexing (TDM) for providing per connection bandwidth (BW) guarantees. TDM has the drawback of the connection latency being inversely proportional to the BW, thus connections with low BW and low latency requirements, e.g. interrupts, are not supported.
Not quite the paper I would use, but the one I could find recently written that summarized the issue at hand that I could quote as a source and not have you take my word for it .... i.e. connection latency is hard to get very low in networks where a globalized clock is not real.... here he discusses time division multiplexing, a type of clock dividing.
Edit: Found another paper which is much more detailed, and has some info on the FIFO implementation over a global clock:
http://www.collectionscanada.ca/obj/...11/MQ34126.pdfQuote:
Simulation results for the FIFO and the two versions of the adder are given in Table 1. The
optimized adder has 2-input c-elernents while the other adder is using 4-input C-elements.
The operations/second indicate the number of logic evaluations done pcr second in each
basic cell. Cycle time is the fastest time at which the pipeline cm send out successive data
values. Latency is the time it takes for data to go from the input of the circuit until it is
finally ready at the output. Pipelined systems work on the principle of reducing the cycle
time at the cost of increased latency. The next section examines how an enhancement to
the system cm reduce the latency even further.
(see page 73). This is an old paper, but he is showing 18 ns latency for a straight up FIFO buffer. This is a large number, and not to be considered true or accurate wrt K10.
Jack
I understand what your trying to say that's why I put edit.
As I said I read long time ago (probably on RWT but not coming from DK) that L3 will be operating in similar way to normal memory and will be possible to clock it independently form cores.
If I'm following your understanding correctly, your saying that L3 will be clocked from highest frequency core in CPU (2GHz K10-->2GHz L3) which in my opinion is not the case.
Of course asynchronous clocking will add latency but it might be a good trade off compared to gains in power/flexibility. (besides look at L3 latency numbers, they are high for a CPU cache so clearly we have lots of logic circuity in between)
Well, in the end we will find out shortly :)
Edit: I'm just thinking why would AMD release different Phenom models with differently clocked HTT bus (from official roadmaps)?? The answer can be that together with increased HTT speed L3 cache is also clocked higher (and IMC) and that gives some tangible performance improvements.
;) I don't know how the L3 will be clocked it will however need one clock and as Informal and others push the detail envelop, I am beginning to understand some of the L3 details that I had otherwise not really considered.
Your edit could be correct too....
AMD has had quite a bit of experience getting the best clock/latency performance out of different clocked agents, the IMC is a good example as it the HT links all of which time on clocks different than the core but put data into the core....
It is both interesting but irrelevant, performance will be what it performs at overall.... and we are hoping it is better than the showing that started this thread.
here are my results from my opteron 2218 , if you guys want me to run any test on my quad to compare to the phenom let me know.
http://i43.photobucket.com/albums/e3...NEBENCHR10.jpg
No problem ... when I get into detailed discussions like this, I tend to be verbose ... being a public forum, a number of people read what we write and, because it is a forum, I post a lot of references and quotes... don't take that as an afront to your knowledge base .... what I do try to do is provide ample detail so others, who may not completely follow, gain some level of understanding... (it also helps me learn more as I go along)
Jack
sorry about that guys
[QUOTE=bobjr;2406885]<censored>[QUOTE]
:) This is a good way to earn a ban. he... you edited it :)
[QUOTE=JumpingJack;2406887][QUOTE=bobjr;2406885]<censored>Yea, it is, and I'm probably one of the easier going Mods here.Quote:
:) This is a good way to earn a ban. he... you edited it :)
I think he and I will have a little talk..:D
:up:
@leoftw
How your system is configured memory wise. Do you have DIMMs plugged for both CPU sockets?
Can you run SuperPi? It is not x64 optimized so scores will be very comparable with your system. Same goes for CPU-Z cache latency test.
Thanks for your effort! :up:
EDIT: I just noticed over 4x speedup in muliCPU test! Why is that?? have you done 1-CPU test at lower clocks???
Excellent stuff leoftw. That's exactly what we need for a compo.
any chance you run the other isngle thread benchmarks we saw??
Super pi 1M
CPUmark99
informal
http://cbid.at.tut.by/work/L3Assoc.gifQuote:
Go laugh on the floor at yourself.
http://www.techarp.com/showarticle.a...tno=424&pgno=2
This cache is 32-way set associative and is based on a non-inclusive victim cache architecture.
This is from BIOS and Kernel Developer's Guide for AMD Family 10h Processors documentation.
This is not surprising.... the associativity of cache relates to the number of cache lines (or memory blocks) fixed to that set. The associativity will increase with size of the cache as each set is fixed with respect to the number of blocks allocated to them, and since AMD will ultimately up or lower L3 cache the number set associativity must change.
For example, AMD as a 2 meg-32 way associativity, a 4 meg would be 64 set associative, a 6 meg would be 96 way associative. Since they are allowing an associaitivty of 16 in their BIOS guide, then it appears AMD is at some point may be willing to release a 1 meg L3 cache chip (perhaps, because it is there does not mean there are plans).
Intel's associativity for wolfdale will be 24 way associative for 6 meg but 12 way associaitve for 3 meg. Intel has not changed their caching for Wolfdale over conroe other than raw size because their 2 Meg Allendale is 8 way associative, while Conroe 4 Meg is 16 way associative.
Jack
I wonder, will the L3 cache in K10 act also as a snoop filter in multiprocessor systems?
JumpingJack, a shared L3 cache is a configurable part of the Northbridge. It may not include the L3 cache as well.