Project "True 4x4"

**rcofell** · 08-24-2009, 09:28 PM

Sorry jcool if you feel I'm detracting too much from the thread, it's the WCG forum after all

I would say try that program mreuter80 linked to, since the registers SHOULD be the same, except there's the fact you're running 4physical cpus, so hopefully it knows how to address/configure all of them. I don't know the specific registers involved, but doing a registry dump might shed some light on the matter as well, but the winrar trick might be much easier.

The only reason I bring up NUMA is because if your tests are trying to pull from the wrong bank with the wrong cpu, then you'll definately feel a performance hit (CPU0-> HT -> CPU1's MemCtrlr/RAM -> HT -> CPU0. I've never had the chance to use a NUMA machine myself tho, so I don't know if it would be a problem by default or not.

Originally Posted by Chumbucket843

lol i know BP and TLB are very different things. i was comparing the miss penalty(even though the penalty can be much worse for p4). wouldnt a full pipeline flush be worse than a cache miss though?
the fix should already be enabled in the bios. here is an article for the patch.http://techreport.com/articles.x/13741/ latency is actually worse with it on but its better than a system hang.

K, I just found the analogy a little bit on the far side at the time, so I had to say my 2cents

(was bored at work waiting for a simulation to finish

)

As for pipeline flush vs cache misses, that all depends on the pipeline length and memory subsystem design, quite situation/implementation dependent. Also, I assume you're mainly referring to flushes caused by branch mispredicts, though quite a few other things can cause them as well.
You could say a branch mispredict caused pipeline flush is often (but not always, as below) cycle bounded by the pipeline's length (baaad for P4), whereas a cache miss could potentially cause a pipeline flush (since the subsequent instructions issued might depend on that load/save hit), plus the cache miss will have to wait for the retrieval from L2/Ram/Storage, which can also vary based on outstanding requests (transaction could take tens to hundreds-of-thousands of cycles, so probably much longer than the pipeline flush).
Note that I'm mainly referring to data cache misses, since if there's an instruction cache miss then you just plain have to wait for it to load from memory before you can fill the pipeline again, which could happen on the mispredict if the prefetcher didn't do its job well enough

Now if you'd throw in SMT, trying to compare things get even more fun, everything pulling from the same caches/etc., except pipeline flushes can now be marginalized by being thread specific. The kick-back is total throughput/efficiency gets a good boost, something we can see with having all the WCG workunit threads go, where we care about the total output

Thread: Project "True 4x4"

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions