However, since each core will throttle depending on load, the clock for each core will be different than L3... necessitating an asynchronous bus such that each core can still access the data....

Now, simply asynchronous communications will always have variable latency as one agent will need to wait on the other at some point. Example, say you have a divider 6:5 let's call that agent a and b, so it is 6:5 A:B, to make this easy lets say 1 bit line so in 5 clock tickes it will send 5 bits for agent B, but agent A has put 6 clock ticks into the queue, one cycle will be left hanging until the next revolution around.... temporally this whould make no difference3, but agent A is only as fast as agent B.....
Yes Jack,this is a simplified reason why it occurs.
I thought you saw Kanter's article long ago(since it is online since the middle of May i think)