Enter the Dragon: AMD AM3 6-core desktop arriving in 2010

**Shadowmage** · 09-03-2009, 11:16 AM

Originally Posted by gosh

And that was what I wanted you to explain, how does prefetching work to get more hits in cache without slowing down the work that is done

Originally Posted by ajaidev

In prefetching its all about the size aint it. The nahelem has small but faster L2's as compared to K10/K10.5. That means whatever does not fit in the L2 goes to L3. AMD's L2 even with 16 way is faster for prefetching than the L3 in Nehalem. If working with large data chunks the nahalem may use L3 more than L2. Also on top of that the associativity that L2 grants is no laughing matter. Take a look at Sandy bridge and you will find the answer to the large data chunk problem. To me sandy bridge looks like a prefetching monster.

OK sorry ajaidev, but I think you don't really understand prefetching.

Prefetching is a method to predict which data the CPU will be requesting in the future based on its access patterns in the past. Simple prefetching mechanisms are next-line (prefetch sequential pieces of memory after a load/store req), stride-based (determine if there's a constant stride (eg 0, 4, 8, 12, 16...) and prefetch based off off the stride), and target-based (keep track of branches that cause misses).

Generally speaking, prefetch requests are the lowest priority accesses to memory, meaning that it will not "slow down the work that is done". However, there can be cases where a too-aggressive prefetching mechanism can evict cache lines that will be used in the future. For most applications, this problem is avoided through application profiling and tuning of the prefetching algorithm.

As a general rule of thumb, prefetching can improve the performance of most applications regardless of the amount of bandwidth available and even with "low latency" DDR. Remember that accessing "low latency" DDR3 is still ~50-70ns round trip, which translates to at least 150 cycles at 3GHz. Compare this to even the slowest 30 cycle L3. The fact that prefetching is even done from L3 -> L2 or L2 -> L1 (move items from lower level caches to higher level caches), which saves a "mere" 10 or so cycles, should clue you in on how important it is.

Thread: Enter the Dragon: AMD AM3 6-core desktop arriving in 2010

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions