Applications and Resources for Bit Error Recovery in Stored Data

**Speederlander** · 01-04-2009, 10:10 AM

I've seen several posts on this topic lately. I was thinking it might be useful for people to post up sources and resources related to this topic. Specifically applications (any OS) and techniques.

I have some to start off:

ICE ECC (Windows - FREE):
http://www.ice-graphics.com/ICEECC/IndexE.html

Quickpar for creating file parity recovery info (Windows only - FREE): <---- Has a memory leak in some circumstances. Not sure why. Might be on 64 bit systems only. (comment added 6/7/09)
http://www.quickpar.org.uk/

NEW: HashCheck Shell Extension
http://www.ktechcomputing.com/hashcheck/

Hermetic File Monitor for monitoring for changes in files (Windows - PAID):

Hermetic File Monitor is for detecting changes to files in a folder, and for detecting additions or deletions of files in a folder. Multiple folders can be set up to be monitored. Optionally files in all subfolders of a folder can also be monitored. Optionally all files can be monitored or just files with a certain file extension such as 'xls'. Folders can be checked automatically at regular intervals. Results can be written to report files.

This program works by recording digital signatures of files (using the MD5 message digest algorithm). If the digital signature of a certain file is the same at two successive recordings then the file has not changed. If the digital signature changes then that shows that the contents of the file have changed.

http://www.hermetic.ch/solo/hfm.htm

Good technical discussion here:
http://www.xtremesystems.org/forums/...d.php?t=212417

More discussion here:
http://www.xtremesystems.org/forums/...d.php?t=213228

**stevecs** · 01-04-2009, 11:49 AM

Just a comment on the thread title, that this does not reduce any occurrence of bit errors, it just allows the user to detect them. Also, in that light, pretty much any tool that does file integrity checking of say MD4 strength or better (128bit/2^64 before collision) will suffice and is readily available (both free & pay) though may need some tweaking for individual purposes.

At this time there is no tested mechanism that will 'reduce' the chance of errors occuring. There is a discussion that by forcing media verifies, raid scrubbing and file space verification that errors can at least be brought to user level attention (and in some, but not all, cases fixed depending type of error and storage subsystem higher level logic (ecc checks at the drive block level, raid, check summing file systems, et al).

Lakshmi Bairavasundaram has done some of the more recent work (among many others) in trying to dig into the problem. His site http://pages.cs.wisc.edu/~laksh/ though this is not a one-man show issue and has been under study/talks for the past couple years.

**Speederlander** · 01-04-2009, 12:04 PM

Originally Posted by stevecs

Just a comment on the thread title, that this does not reduce any occurrence of bit errors, it just allows the user to detect them. Also, in that light, pretty much any tool that does file integrity checking of say MD4 strength or better (128bit/2^64 before collision) will suffice and is readily available (both free & pay) though may need some tweaking for individual purposes.

At this time there is no tested mechanism that will 'reduce' the chance of errors occuring. There is a discussion that by forcing media verifies, raid scrubbing and file space verification that errors can at least be brought to user level attention (and in some, but not all, cases fixed depending type of error and storage subsystem higher level logic (ecc checks at the drive block level, raid, check summing file systems, et al).

Lakshmi Bairavasundaram has done some of the more recent work (among many others) in trying to dig into the problem. His site http://pages.cs.wisc.edu/~laksh/ though this is not a one-man show issue and has been under study/talks for the past couple years.

Quickpar can be used to actually repair bit errors in the files. So that app goes beyond simply checking.

When I say "reduce" I mean long term protection from bit errors damaging data, allowing for the errors to occur and then be repaired when found. I realize that the rate of errors on a given storage medium is out of our control for the most part.

Changed thread title to recovery from reduction.

**m^2** · 01-04-2009, 03:00 PM

Quickpar doesn't work from command line, right?
http://en.wikipedia.org/wiki/Parchive references some programs capable of doing this.

Additionaly some archivers allow you to add recovery data. For things like backups it's more comfortable.
The ones that claim to be able to do it are:
WinRAR (rar too)
FreeArc
Squeez
From my experience, it doesn't work with Squeez though. I asked support - no reply.

**stevecs** · 01-04-2009, 03:56 PM

Yes, any reed-solomon or similar checksum type utility (par uses a reed-solomon scheme) can correct certain amounts of bit damage assuming that you have enough block groups to cover the size of the damage, but that only covers some types of errors. As seen with recent hd studies errors in block or bit damage due to latent errors can on nearline (I have not seen a study on desktop drives but would gather that they are no better than nearline perhaps worse) drives seem to be both spatially and temporally localized. The spacial ones on nearline drives are usually within 10MiB or so which hints that for those types of errors are probably incurred by different means than by enterprise drives (which have spacial range much less than a single track usually just consecutive sectors).

Just like with the current studies on the subject there needs to be a detailed analysis of the types of damage incurred. From the expanding data sets that I've seen there appears to be numerous types and they appear as complex correlations.

Though for practical applications we have the same two basic items that we've always needed: 1) to detect and 2) to recover. In and of themselves pretty much any strategy that has both will allow for recovery including traditional methods of tape, disk-disk, et al. Just remember that the errors could occur in your 'backup' as well (which is why multiple generations are generally used). This actually is one of the main driving forces for faster supercomputer interconnects (the more data integrity you need the more bandwidth / I/O / cycles / storage space (online & offline) are needed to verify the data).

**IanB** · 01-04-2009, 08:12 PM

Just to add to this discussion... PAR2 is a highly robust format. I don't know how Hermetic File Monitor stores its checksums, but the small PAR2 file that contains the file and block MD5 checksums is created in a packeted form, with multiple copies of the packets distributed through the file, and those packets are themselves "wrapped" in a header containing the MD5 of the packet itself that verifies the contents haven't been altered. So it is easy for a PAR2 client to confirm a packet is undamaged, and find a good packet with the correct information elsewhere in the file if not.

The facilities you've listed in Hermetic File Monitor can be pretty easily coded if you have basic MD5 code. It's something of a surprise that they'd consider charging for something like that that doesn't even do repairs on the files it detects as corrupt...

If anyone wants clarification or discussion about PAR2 and its best use, feel free to PM me, as I had long technical email correspondence with the writer of QuickPar some years ago who assisted me in writing my own PAR2 client.

**stevecs** · 01-05-2009, 04:01 AM

the problem I saw w/ PAR (and similar tools) is that it creates a file that is non-usable by the OS until you un-wrap it plus the additional code for the RS bits. That basically means that you will be on average using up to 1.2 - 1.5x the amount of dead disk space for online applications. It's not bad for offline (backup) purposes but on-line is a killer. If this was implemented into the file system itself (ie, like what ZFS is doing though it has other issues) then it would be transparent to the user (and applications).

In my case if it was hooked into say tar or other small tape backup utilities (in the stream) this would be great, but I haven't seen any utilities that do that.

**IanB** · 01-05-2009, 10:39 AM

I'd agree that PAR2 in its current implementation would not be a perfect solution for continual OS-level use. The reason being that every time a file is changed, even just a single byte, the small PAR2 file would have to be rebuilt from scratch, with the checksums of every file in the folder (not just the one that had changed) and then, even more wastefully in terms of processor usage and time, any repair files would have to be recreated from the entire folder set. The disk overhead and CPU usage would be continual and enormous in light, normal file activity, let alone any active file processing.

But for archived files where you rarely change the contents of folders and simply want to guarantee their long-term longevity, PAR2 is ideal. It's particularly handy for filling any spare contents of archived DVDs with repair files so that you can guard against bit-faults due to "fade" over time. What hasn't yet been done I guess is hooking up a repair facility to an automatic timed mechanism for checking the contents to remove the need for manual checking/repair. Hardly difficult. QuickPar was designed from the outset for manual use with PARed Usenet files, which is why it never had such automatic facilities built in.

That basically means that you will be on average using up to 1.2 - 1.5x the amount of dead disk space for online applications.

I'm not sure where you get that excessive figure from.

You can create as many or as few repair blocks as you need. A repair set can contain as many as 32768 virtual blocks (that's the limitation in the maths) so the size of the smallest repair file is 1/32768 (0.003%) of the total fileset size plus the miniscule packet overhead. That would give you one repair block, which can be reused as many times as you need whenever a bit error (EDIT: or multiple errors in a single virtual block) occurs in the fileset. More redundancy (just like RAID 6 over RAID 5) means more repair blocks, able to repair more simultaneous bit errors (EDIT: where they occur in different virtual fileset blocks). But 0.003% is a LONG way from 20% to 50%, that is a HUGE amount of redundancy that bears no relation to the likelihood of bit errors in disk storage. Even for Usenet where some servers were poor at propagating without errors, 10% PAR2 cover was generous back when I was doing that...

**Serra** · 01-05-2009, 11:38 AM

I think the whole procedure can - for home users - generally be summed up as: make backups of your stuff.

The bit errors don't happen that often (and catastrophic bit errors even less frequently). If you have a catastrophic bit error issue you're going to have to restore from backup regardless, and the chances of both original and backup data being bugged (assuming they were initially OK) are so low it's not even funny.

Backups to something like a DVD (or, now, Blu-Ray disc) are obviously highly recommended as aside from disc-rot you shouldn't have issues with bits flipping or anything. With proper storage they'll last far longer than the useful life of your data.

**m^2** · 01-05-2009, 01:02 PM

Originally Posted by IanB

But for archived files where you rarely change the contents of folders and simply want to guarantee their long-term longevity, PAR2 is ideal.

No, it's not ideal. Way too many files makes your backup directory cluttered. I really dislike PAR for generating external checksums...at least 2 for each file. I wish sb. integrated ECC into DAR...
Actually I'm going to ask the author of FreeArc to integrate DAR features into his archiver, it would be the best option. But not yet, he has many more important things to do first.

Originally Posted by IanB

What hasn't yet been done I guess is hooking up a repair facility to an automatic timed mechanism for checking the contents to remove the need for manual checking/repair. Hardly difficult.

I doubt nobody did it. Because it's simple and useful. Just nobody publishes simple administrative scripts.

**stevecs** · 01-05-2009, 07:00 PM

Originally Posted by IanB

I'm not sure where you get that excessive figure from.

You can create as many or as few repair blocks as you need. A repair set can contain as many as 32768 virtual blocks (that's the limitation in the maths) so the size of the smallest repair file is 1/32768 (0.003%) of the total fileset size plus the miniscule packet overhead. That would give you one repair block, which can be reused as many times as you need whenever a bit error (EDIT: or multiple errors in a single virtual block) occurs in the fileset. More redundancy (just like RAID 6 over RAID 5) means more repair blocks, able to repair more simultaneous bit errors (EDIT: where they occur in different virtual fileset blocks). But 0.003% is a LONG way from 20% to 50%, that is a HUGE amount of redundancy that bears no relation to the likelihood of bit errors in disk storage. Even for Usenet where some servers were poor at propagating without errors, 10% PAR2 cover was generous back when I was doing that...

That is just a rough figure of 10-50% overhead depending on the block size used and since you are encoding the original file (i.e. the file is NOT usable by the application when it's encoded) you need to have both the original file to be used PLUS the file that is RS protected.

The overhead is due to the size of the blocks needed to protect a file you have to have a granularity equal to that of your sector size on disk (anything larger than that as with the chance of errors being consecutive sectors or in close proximity) you will lose too much information that you won't be able to recover if you pick a larger block size.

Most common types of RS is 255/223 (where you take 233 input symbols/bytes et al and encode them into 255 output symbols/bytes or ~1bit/byte. The issue comes up with larger groups (ie block sizes > sector size being the atomic size that would be damaged or 512bytes) you would effectivly loose more information than you could reconstruct from the remaining data. Not saying that it's bad for everything (definitely better than nothing) but with the types of disk errors that occur and their spatiality you have to use very small sizes.

**IanB** · 01-05-2009, 11:29 PM

Steve, I didn't understand over half of that reply.

Could you translate it from geek into English?

Basically, smaller blocks (repairs) is better. They are more efficient for storage purposes. In the other thread I replied to your post about errors being spacially aggregated, ie. more likely to occur close to each other physically on the disk surface. The question that needs answering (which I guess you were trying to do in the post above, but completely went over my head with the terminology) is how often errors may occur simultaneously and how close together physically they are when they do. That might determine your block size, but I think for most intents and purposes it's not helpful.

If you have two bit errors (or small clusters) at widely separated points in the file, then you need two repair blocks, as those errors will occur in different virtual blocks. But for small clusters of bit errors you only need small repair blocks. You don't need huge blocks equivalent to sector sizes, unless you are forced to have blocks that size because the fileset size is very large.

Repairing PAR2 is a tradeoff. If you have large blocks, you do less maths (XORing all the block data together sequentially) because there are fewer blocks, so the process is quicker, but you have terrible granularity: a repair block of many MB is extremely wasteful for a single bit error. But that single large repair block could repair many many simultaneous bit errors in a small region of the file (or replace an entire bad sector, say, in your hypothesis that these errors cluster), so it may possibly be better for your application. For most people, though, smaller blocks wins out even with the slower processing because if you are only expecting an occasional random bit error in a reasonably reliable medium then you only need minimal repair capability.

A small repair block gives extremely fine granularity that means you could repair, say, 10 simultaneous bit errors spread randomly through the file with the same amount of repair data compared to a block 10 times the size (poor granularity) that could only repair one. To repair 10 widespread bit errors with the larger blocks would require 10 times as much storage "wasted". For the same amount of data in small blocks you could have repaired a hundred simultaneous errors! Large blocks just don't make much sense if the errors occur infrequently, however close they cluster.

Remember PAR2 was a functional improvement on PAR which had granularity of the worst possible level, where a repair file had to be the size of the largest file in the set, extremely wasteful. PAR2 was all about enabling fine granularity so that only very small amounts of repair data needed to be transmitted and stored along with Usenet source files to cover large numbers of simultaneous data dropouts in the source. At a granular level, for maximum redundancy, you want smaller blocks as the same amount of repair data can repair more simultaneous errors where they occur randomly in a set. However, what you are confusing a little, perhaps, is the reliability problem not of the storage medium (the tendency of the magnetic surface of the disk to flip bits) but the storage device, which can of course catastrophically fail mechanically or electronically.

Where this gets interesting is that, if you think about it, RAID5/6 are equivalent to the most inefficient old PAR-type redundancy, where a single parity source has to be the size of the largest disk in the set. For a single bit error on a disk in the set, you need an entire spare disk of parity to correct it! And two if two errors occur on different disks! Yet of course, we rationalise that extreme redundancy by saying that this covers us if an entire disk fails, not just a single bit flip, which is the more likely hardware event. Having two layers of granularity, the extremely low to guard against hardware failure that can remove large chunks of data instantly, and the very high to repair small errors efficiently, seems to be a good common-sense compromise. It's just a matter of doing the latter elegantly...

**IanB** · 01-05-2009, 11:49 PM

Originally Posted by m^2

No, it's not ideal. Way too many files makes your backup directory cluttered. I really dislike PAR for generating external checksums...at least 2 for each file. I wish sb. integrated ECC into DAR...
Actually I'm going to ask the author of FreeArc to integrate DAR features into his archiver, it would be the best option.

What is DAR? I've never heard of it.

An external checksum can be stored somewhere else safely, which is more secure for verifying the data it covers. One of the problems with QuickPar as a specific implementation of PAR2, not the ONLY implementation, is that because the author decided to remove folder-traversing features that the PAR2 specification allows, everything has to be run from and stored in the single folder where the source files are.

If you have full path information stored in your PAR2 (checksum) files, there is no need for those files to be stored in the same folder any more, so no more "clutter", there can be a single central repository of verification/repair information stored anywhere you like. That's an implementation issue, not a problem with PAR2 as a standard.

**m^2** · 01-06-2009, 03:24 AM

DAR is a backup program. Probably the best that works on the filesystem level, though not really great. I don't feel that I need to protect my whole filesystem, but backups - definitely.

**stevecs** · 01-08-2009, 03:01 AM

Strange I'm not getting mail updates again for this forum, oh well.

@IanB-Basically I was trying to indicate that from current data seen from research papers that errors in media when they occur ARE NOT widely separated at all but are very localized. With enterprise drives errors are usually consecutive (ie, adjacent sectors) so this would be a problem with repair algorithms that did not have the granularity to recover that. Ie. Say you have a file on a single disk (no raid) and it covers sectors 0-1000 (500KiB), but you find that sectors 30,31,32 are all bad. If you're blocking you will have up to 3 errors in the same RS block that you need to repair. With nearline drives, this issue is also spatial but up to a couple tracks (~10MB locality) so you're a bit better off, however since the errors are concentrated at one location you have a higher probability of having more errros per block. (i.e. it's NOT a pure random distribution).

Now with raids it's a bit more complex as you have another layer of abstraction which may help some (as adjacent disk sectors may or may not be adjacent file data sectors). But then you also throw in the various unreported error types the statistical calculations become complex. (ie, it may increase the likelihood of non-recovery for some types of errors).

@m^2: dar looks interesting it's a file level backup (not file system) (ie, a file level backup traversed the directory/lookup trees of a file system (ie resolves names for example). A file system backup does not it backs up a device by inodes without regard to what is in them or what they are. Both are a step above a raw backup (device itself). Ideally having both file & file system are good. File lets you restore a particular file or directory, a file system backup is good for restoring everything even items that a parser may not like (i.e files with names that need special handling, unprintable characters, multi-byte characters or what not) that are legal for a file system but cause problems with user spaced tools that are not coded properly for the file system at hand (TAR was written before we had many of the current file systems today for example and there are some (like JFS) that /ANY/ unicode-16 character is legal except the NULL byte). They've patched it up over the years but most tools are like that. Would be interesting to play with it and see what issues it has with different scenarios (crinkled tape, file names, et al).

**tiro_uspsss** · 01-18-2009, 05:54 AM

Originally Posted by Speederlander

I've seen several posts on this topic lately. I was thinking it might be useful for people to post up sources and resources related to this topic. Specifically applications (any OS) and techniques.

I have two to start off:

Quickpar for creating file parity recovery info (Windows only - FREE):
http://www.quickpar.org.uk/

Hermetic File Monitor for monitoring for changes in files (Windows - PAID):

http://www.hermetic.ch/solo/hfm.htm

Good technical discussion here:
http://www.xtremesystems.org/forums/...d.php?t=212417

More discussion here:
http://www.xtremesystems.org/forums/...d.php?t=213228

is quickpar a hash creating prog?

btw - this soooo should be stickied!

**Speederlander** · 01-18-2009, 08:45 AM

Originally Posted by tiro_uspsss

is quickpar a hash creating prog?

btw - this soooo should be stickied!

It uses Reed-Solomon error correction. You can read up on it here:
http://en.wikipedia.org/wiki/Reed-So...ror_correction

More links on the Parchive format here:
http://en.wikipedia.org/wiki/Parchive

NOTE: You cannot use quickpar except for integrity checks for very large files (10's of GB). For those I will use winrar to break them into 700mb volumes and create parity data for 10 at a time. You can get 20% coverage that way. If you break them into groups of 5 you can get 45% coverage. If you are really paranoid you can get 100% coverage by narrowing it down further. Whatever works for you.

**m^2** · 01-18-2009, 10:17 AM

Originally Posted by Speederlander

NOTE: You cannot use quickpar except for integrity checks for very large files (10's of GB). For those I will use winrar to break them into 700mb volumes and create parity data for 10 at a time. You can get 20% coverage that way. If you break them into groups of 5 you can get 45% coverage. If you are really paranoid you can get 100% coverage by narrowing it down further. Whatever works for you.

Could you say something more about it?
I've seen something in the par documentation, but it's rather cryptic.

**Speederlander** · 01-18-2009, 11:22 AM

Anyone feel free to correct me.

Source block count indicates how fine a grained breakdown you will take when looking at the source data. Recovery block count indicates the coverage of your recovery data. Note below that 100% redundancy (equal source block and recovery block count) yields 1:1 corespondence between the two numbers.

There's a problem with very large files that will cause quickpar to choke. As the file gets bigger it hits an upper limit. Hence for valuable files I break them up with winrar and then add recovery data to smaller groups.

**m^2** · 01-18-2009, 11:52 AM

So it's a limitation of only QuickPar, not .par2 files?

**Speederlander** · 01-18-2009, 12:35 PM

Originally Posted by m^2

So it's a limitation of only QuickPar, not .par2 files?

Not sure. Several people have pointed it out on the quickpar forum so the developer(s) know about it. I find it's a minor limitation really, because I use quickpar with recovery on huge files mostly when I am going to archive them or do secondary back-ups. Besides, quickpar seems to be the only game in town for windows with this level of functionality. Plus it's free. So as long as I have a work-around that gets me data security I'm good.

My process for large files is:
1. Create consistency check data with quickpar on the original file.
2. Create rar'd version on secondary back-up location.
3. Add consistency and recovery information with quickpar to rar'd files.

I can verify the original is good, verify the back-up is good, and recover the back-up if it becomes damaged. I can then verify the consistency of the reconstituted file with the original consistency check data back on the source.

**m^2** · 01-18-2009, 01:07 PM

Originally Posted by Speederlander

Not sure. Several people have pointed it out on the quickpar forum so the developer(s) know about it. I find it's a minor limitation really, because I use quickpar with recovery on huge files mostly when I am going to archive them or do secondary back-ups. Besides, quickpar seems to be the only game in town for windows with this level of functionality. Plus it's free. So as long as I have a work-around that gets me data security I'm good.

My process for large files is:
1. Create consistency check data with quickpar on the original file.
2. Create rar'd version on secondary back-up location.
3. Add consistency and recovery information with quickpar to rar'd files.

I can verify the original is good, verify the back-up is good, and recover the back-up if it becomes damaged. I can then verify the consistency of the reconstituted file with the original consistency check data back on the source.

I do all backups with batches, so QuickPar is almost useless for me. PhPar does the same in a more comfortable way and is faster.
I don't use it yet and I still haven't decided whether to start or wait for FreeArc - it supports ECC, but has ~year before it's really stable.

http://sourceforge.net/docman/displa...group_id=30568
It seems there's 128 bits for practically anything in par2. So file spec shouldn't limit anything.

BTW I see that QuickPar does it, but IMO it's incorrect to say "100% coverage" in case when your parity=data size. It's 50% (actually slightly less) because you may have errors in parity too and you might need to correct your parity in order to be able to correct the main data.

**Speederlander** · 01-18-2009, 01:13 PM

Originally Posted by m^2

I do all backups with batches, so QuickPar is almost useless for me. PhPar does the same in a more comfortable way and is faster.
I don't use it yet and I still haven't decided whether to start or wait for FreeArc - it supports ECC, but has ~year before it's really stable.

http://sourceforge.net/docman/displa...group_id=30568
It seems there's 128 bits for practically anything in par2. So file spec shouldn't limit anything.

BTW I see that QuickPar does it, but IMO it's incorrect to say "100% coverage" in case when your parity=data size. It's 50% (actually slightly less) because you may have errors in parity too and you might need to correct your parity in order to be able to correct the main data.

Yeah I guess I figured it was kind of understood parity data could be damaged too. So yeah, it's imperfect, but apart from the loss of both source and secondary back-up simultaneously, you should be covered, even with 20% parity or less. Primary thing I am after is random bit error recovery, etc. I keep two copies of everything to handle catastrophic failures.

**m^2** · 01-18-2009, 01:17 PM

Yeah, I don't think that you really need a lot of it.
I've been using SQX before I discovered that it's ECC doesn't work (fortunately during tests) with 5% (more for smaller files) and I think it's enough.
I also keep multiple copies of the most critical things, it's definitely the best you can do to protect your data. And it's the best use for DropBox.

**IanB** · 01-21-2009, 10:40 PM

Originally Posted by Speederlander

NOTE: You cannot use quickpar except for integrity checks for very large files (10's of GB)

Originally Posted by Speederlander

There's a problem with very large files that will cause quickpar to choke. As the file gets bigger it hits an upper limit.

Hmm, gotta say that sounds like a QuickPar bug, not a PAR2 bug. There are limitations in PAR2, but filesize definitely isn't one of them. Everything is aligned to 8 bytes or even 16 IIRC, ie. 64 bits at least for filesizes. The only real problem there is that as the fileset gets bigger, the minimum size of the virtual/repair blocks increases too, as the entire set has to be split into a maximum of 32768 virtual blocks.

I've been AWOL from the QuickPar forum for a looooong while now, sadly, maybe I need to catch up there. Gotta say I'm not unhappy that you guys have tweaked my interest in all this again and made me think about picking up some relevant unfinished projects ...

Thread: Applications and Resources for Bit Error Recovery in Stored Data

Thread Tools

Search Thread

Rate This Thread

Display

Applications and Resources for Bit Error Recovery in Stored Data

Bookmarks

Bookmarks

Posting Permissions