K10 Scores starting to surface

**JumpingJack** · 09-01-2007, 03:21 PM

Originally Posted by Lightman

This is not entirely true...

I read somewhere long time ago that L3 in K10 is acting more like memory layer. In other words it is clocked by IMC independently from all 4 cores and on diagram I would put it after CrossBar...
That's why L3 latency can vary from core point of view (cache latency itself is probably constant). It is similar to how DDR2-800 latency (again from CPU point of view) is different compared to DDR2-667 (same timings of course

).

Edit: JumpingJack you typing too fast

I barely read page 16 and typed my response and here surprise! another page with new info making my post partially obsolete

I am not sure I undestand if you understand what I am trying to say

...

A shared resource clocked at one speed to 4 other resources clocked at different speeds will necessitate asyncronous communications... there is no other way... thus AMD must provide functionality to account for floating clocks between 4 cores to one memory pool, L3.... just adding circuits to do this work will incur latency...

Add on top of that, 1:1 divide latency < 3:2 divider latency < 2:1 divider latnecy... hence the 'observed' latency from any core is variable...... at least if you read Kanter's article this is what the FIFO buffers do... he did not mention the x-bar.

There is research ongoing to work on achieving both low BW and low latency asynchronous networking, but there has always been this fundamental trade-off:

Previously published NoCs which provide GS are ÆTHEREAL [18][9] and NOSTRUM [14]. Both are synchronous and employ variants of time division multiplexing (TDM) for providing per connection bandwidth (BW) guarantees. TDM has the drawback of the connection latency being inversely proportional to the BW, thus connections with low BW and low latency requirements, e.g. interrupts, are not supported.

http://www.ee.technion.ac.il/courses...OC-async05.pdf

Not quite the paper I would use, but the one I could find recently written that summarized the issue at hand that I could quote as a source and not have you take my word for it .... i.e. connection latency is hard to get very low in networks where a globalized clock is not real.... here he discusses time division multiplexing, a type of clock dividing.

Edit: Found another paper which is much more detailed, and has some info on the FIFO implementation over a global clock:

Simulation results for the FIFO and the two versions of the adder are given in Table 1. The
optimized adder has 2-input c-elernents while the other adder is using 4-input C-elements.
The operations/second indicate the number of logic evaluations done pcr second in each
basic cell. Cycle time is the fastest time at which the pipeline cm send out successive data
values. Latency is the time it takes for data to go from the input of the circuit until it is
finally ready at the output. Pipelined systems work on the principle of reducing the cycle
time at the cost of increased latency. The next section examines how an enhancement to
the system cm reduce the latency even further.

http://www.collectionscanada.ca/obj/...11/MQ34126.pdf
(see page 73). This is an old paper, but he is showing 18 ns latency for a straight up FIFO buffer. This is a large number, and not to be considered true or accurate wrt K10.

Jack

Thread: K10 Scores starting to surface

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions