What I can guess is that the BOINC Benchmark only stresses four cores for the test, or perhaps the workload assigned to the benchmark doesn't keep the low level caches very busy, whereas the workunits themselves are more complicated and cache/ram stressful, thus making the difference between four and eight threads sucking bandwidth from the L's much more tangible in the "real world" scenario as opposed to the benchmark scenario.
Just a congecture.
Bookmarks