How is it compiled?
How does the code look like?
Reading one byte on each read compared to read 8 byte each read, and you will se huge differences. Align reads for cache lines (the size for each line) will improve speed.
Or why not SSE2 optimize it.
You just can't take one test and say "this is how fast it is". bad code isn't fast.
L3 cache is shared among cores.
Bookmarks