Indeed results aren't stable on disk. But the "benchmark" will be extremely long running. Hours to days depending on the size. Any sort of temporary uncertainty would be definitely averaged out quite well by then.
The only problem with the concept of a Pi-based HD benchmark is that the winner will pretty much be whoever has the most ram and the most HDs running in parallel.
So there's almost no point in overclocking unless you have enough disk bandwidth to become CPU-limited.
Bookmarks