just to be accurate, the celery uses the same deep pipelined architechture as the P4 so when prefetching fails flushing the whole pipeline and restarting takes a shedload longer on a P4 than a ath 64 or ath xp. That meaning cache is FAR more important to a P4 type cpu than a athlon 64 or xp. THis is why the diff between 256/512 is a lot less than on p4, and with the other gaps.
