Page 1 of 2 12 LastLast
Results 1 to 25 of 36

Thread: bit error = lost file, reduce BER how?

  1. #1
    I am Xtreme
    Join Date
    Jan 2006
    Location
    Australia! :)
    Posts
    6,096

    bit error = lost file, reduce BER how?

    I was reading a thread over @ AVS forum couple days ago, there was a fellow who started of with a 48TB media server (now he's got 96TB)... there was mention of bit errors. I have had this happen to me once.. was transferring a butt-load of files via LAN. I lost a handful of sentimental pics & an ISO

    now I know 'a' measure one can take: ECC RAM - but I have 2 questions that arise from this:

    have far does ECC RAM go to prevent BER? (Bit Error Rate) (does it completely get rid of it so-to-speak)

    Is it possible to completely erradicate any chance of BER? (i'm guessing not )

    oh & where's stevecs for this?
    DNA = Design Not Accident
    DNA = Darwin Not Accurate

    heatware / ebay
    HARDWARE I only own Xeons, Extreme Editions & Lian Li's
    https://prism-break.org/

  2. #2
    Xtreme Mentor
    Join Date
    Sep 2006
    Posts
    3,246
    ECC greatly reduces/prevents bit errors in things that are active in RAM. This is especially important for large RAM drives. This was one of the main reasons I went to a server class motherboard. My data was more important to me than an overclock that gave me improvement in performance I could barely feel.

    Parity raids allow you to do a consistency check on the information stored on the raid for errors using the parity data. At least the good hardware controllers enable this, not sure about mobo parity raids.

    Some (most?) SSDs seem to have some sort of ECC feature. My mtrons do. But I keep my critical data on my RAID array with scheduled consistency checks.

    If you copy lots of data from one location to another, you could try acronis. I think it may have a verification feature for integrity of the copy.

    Worst case you could use quickpar to create parity data for individual files.
    Last edited by Speederlander; 12-27-2008 at 07:19 PM.
    [SIGPIC][/SIGPIC]

  3. #3
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    here.

    Each subsystem has their own error rates; RAM 10^12 for ecc; cables also have their own error rates (about the same as for sata/sas/scsi they use a parity checksum algorithm) and then the drives. You can't really test one without testing the full system (every component has an error rate associated with it and you go through those components multiple times). What I do is a MD5/SH1 check of data files on the systems here and compare them to initial values. If there is a mis-match I compare again as a tie-breaker (ie, an error occurred not with the data on the media itself but somewhere alone the path).

    Data errors are frequent. I see real-world bit errors (corruptions on files) of the rate of 1 file per month on each ~20TiB of storage at home (which I monitor much closer than at work). At work I see it comparable to this when you increase storage size. However I am only able to monitor a small fraction of the ~6PiB of dasd that we have deployed there. From what I've been able to gather from other sites (cern/fermilab) both also have found the same errors plus substantial multi-bit errors in ECC memory (ecc can correct a single bit error but not multi-bit errors, only report those). Which is disconcerting.

    There is no real easy way to eliminate it, and the bigger problem is that hardly anyone (I can probably count them on one hand) the number of people that are actively monitoring this on production systems.

    At this point to mitigate the issues with current hardware you can do several things:
    - construct the array with enough redundancy (and using the right drives) to have a lower probability of read errors. I posted a spreadsheet in the main sticky here that can help with that. Right now I would say ignore the 1.5TB drives as they (from a BER rating) suck.
    - At the hardware subsystem use raid-6 (raid-10 is faster for IOPS but it has no parity block checking, i.e. no means to determine which copy is right or wrong. Likewise parity raids (3/4/5) have the same problem. raid-6 has 3 checks (one raw data, two parities) which allow for a tie break to correct data. Run full checks often (weekly).
    - create and maintain a bit hash (md5/sha1/et al) of your data files and compare them to what's on the drive often (this will provide your window as to /when/ a corruption occurs (between the check intervals).
    - have an up to date backup strategy (so you can restore files that get corrupted (remember to check tapes as well, as the same corruption occurs there, keep multiple copies).
    - If available try ZFS with read-checking enabled (though beware of it's limitations, it's not good for all cases).


    All in all this is the next big 'iceburg' that we're going to be facing in computers I think. Mainly as so many people don't even think about it and with the amount of data that is present today & growth rate everyone is assured to have some corrupted data on their system. With no-one looking, no-one knows what that is until it's too late.

    I haven't been over at AVS for a long while should probably go over there. Just did a fast search there though and if it's the same thread of him making 12 drive wide raid-6's w/ 1TB drives that's (10+2 raid bunched into a raid 6+0 striping model) that would give him ~72TiB usable space (assuming no hot spares)) and a probability of not reading all sectors in the resultant array of ~9%. Though I don't think he has much storage knowledge yet (ie, for arrays of that size/complexity you want to use a volume manager (LVM/veritas/ et al), plus getting more than say 16 drives off a single raid controller today will max you out in performance, so if he's looking for that (he mentions iscsi which would imply performance requirements) he should have several cards (ideally one raid card for each external array of 16 drives max) and no mention of any clearly defined target goals (when dropping that kind of cash (not even mentioning the ~15K for a tape backup solution for it) you should have something defined as to what you're looking to accomplish). But that's a whole architecture discussion.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  4. #4
    Xtreme Mentor
    Join Date
    Sep 2006
    Posts
    3,246
    Quote Originally Posted by stevecs View Post
    - At the hardware subsystem use raid-6 (raid-10 is faster for IOPS but it has no parity block checking, i.e. no means to determine which copy is right or wrong. Likewise parity raids (3/4/5) have the same problem. raid-6 has 3 checks (one raw data, two parities) which allow for a tie break to correct data. Run full checks often (weekly).
    Interesting. So I was running RAID 6 and was benefiting from this without realizing it. So only RAID 6 (and by extension 60)provides this benefit... Makes sense with the tie break requirement. Never considered that. And most major hardware controllers implement this in such a way that this benefit is realized fully on consistency checks?
    [SIGPIC][/SIGPIC]

  5. #5
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Yes, but with a caveat, RAID-6 is CAPABLE of this, but it's up to the firmware writers to take advantage of it. Of which I highly doubt low-end cards do this (mainly as I see that most of those have hard time just getting the multiple requests working properly on raid-1/10 which is much simpler). Unfortunately to independently test which cards (or firmware versions) do this properly is very hard and it's not generally published so you have to escalate it to the company who makes the card. (you could do this yourself by creating a small array, turn it off, pull out the drives read in each stripe for an entire width, then inject a bit error in each area (one at a time, the data, P parity, Q parity) and between each manual insertion turn the array back on, have it do a check, it should find and correct the error. Once you see that, then do a dual bit error (say both P & Q) and it should correct the data with that change if the card is coded with that logic).

    (and just to point out as well though I've never seen this done except in lab tests, you CAN do this with mirrors as you are technically NOT limited to just 2 disks (you can have a 3-way mirror, or like my desk play-box here I have a 12-way mirror (don't ask) As long as you have something to break the 'tie' (and it can get more complex depending on how many errors you want to recover from (ie, 3 out of 5 match, or whatever).

    Also remember that this is subsystem checking NOT file data checking and only comes into play (with raid cards) when you do a raid set check. In read mode there are NO cards or software (besides ZFS if you turn it on which is off by default) that check any type of integrity (file or block). This is what can cause problems where data integrity is important. You read a block (and since it's not parity or hash checked on reads it could be wrong/corrupted) you then act on that data (maybe changing some other part of the block or data set) and then write it back. At this point you are now writing back the BAD data that you read. The card will then calculate a NEW parity (for the bad data) and thereby vetting that data at the subsystem level. It can be very insidious.

    I've been unsuccessful in the past two years to get any raid companies to put an option in to do this (it's not hard, they're already doing this with a manual check, but instead do it on every block read) it WILL slow down operations to that of a partial stripe write which is a hit, but doing it at this level makes it file system and os independent. Then you can add on top of that file and file system level integrity checks (even your suggestion of par archives which is a good thought but not universal to deploy to the different OS's/file systems).

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  6. #6
    Xtreme Mentor
    Join Date
    Sep 2006
    Posts
    3,246
    Quote Originally Posted by stevecs View Post
    Also remember that this is subsystem checking NOT file data checking and only comes into play (with raid cards) when you do a raid set check. In read mode there are NO cards or software (besides ZFS if you turn it on which is off by default) that check any type of integrity (file or block). This is what can cause problems where data integrity is important. You read a block (and since it's not parity or hash checked on reads it could be wrong/corrupted) you then act on that data (maybe changing some other part of the block or data set) and then write it back. At this point you are now writing back the BAD data that you read. The card will then calculate a NEW parity (for the bad data) and thereby vetting that data at the subsystem level. It can be very insidious.
    Well, if I understand this correctly (which I may not) the RAID subsystem check, if performed regularly, will reduce the odds of suffering a bit error but it is at the moment of any given read that if an error has occurred since the last consistency check, it will not be checked for and you may reinforce it by writing bad data (with resulting parity info) back to the array. Yes? Implying that more frequent checks are beneficial (beyond wear and tear imposed on the drives). Or does this consistency check truly not validate data at the file level in such a way as to be useful for the topic under discussion? I would think it would have to in order to fulfill its stated task...

    Quote Originally Posted by stevecs View Post
    Yes, but with a caveat, RAID-6 is CAPABLE of this, but it's up to the firmware writers to take advantage of it. Of which I highly doubt low-end cards do this (mainly as I see that most of those have hard time just getting the multiple requests working properly on raid-1/10 which is much simpler).
    1680ix?
    Last edited by Speederlander; 12-27-2008 at 10:02 PM.
    [SIGPIC][/SIGPIC]

  7. #7
    Xtreme Addict
    Join Date
    Mar 2008
    Posts
    1,163
    The corrections from RAID 6 don't sound right. Can you explain exactly how does it work? Best assume that you have 24 drives and dual parity. Exactly 1 bit is wrong. How to find the disk which has the flipped bit?

  8. #8
    Xtreme Member
    Join Date
    Feb 2008
    Location
    enteon@jabber.ccc.de
    Posts
    292
    Quote Originally Posted by m^2 View Post
    The corrections from RAID 6 don't sound right. Can you explain exactly how does it work? Best assume that you have 24 drives and dual parity. Exactly 1 bit is wrong. How to find the disk which has the flipped bit?
    i guess the controller checks every single bit on all 3 'versions' (HDDs ^^) and in case they don't match, the bit that 2 of 3 'versions' have is deemed the right one.

    @stevecs: thank you for sharing your knowledge. you made me reconsider my whole backup system. does soft-raid add any significant risks for my data?
    and maybe i'll dump linux for the backup machine and use ZFS instead...can't be that hard think i can't wait for btrfs

  9. #9
    Xtreme Member
    Join Date
    Feb 2008
    Location
    enteon@jabber.ccc.de
    Posts
    292
    how about someone puts a how-to about avoiding data corruption together and that was made a sticky?

  10. #10
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    @speederlander: it doesn't reduce the odds (the error rate is not controlled by the check (ie, carrying an umbrella won't increase the odds of rain)). Basically what the raid consistency check will do is read each stripe of data, and read the parity data. Most implementations will re-calculate the parity of the data and then compare it to the parity it read to see if it matches. If it does NOT match then it will then go down a list to see what is wrong (does P match the data? Does Q?) It will then try and correct based on (for raid-6) which two match. For RAID-3/4/5 your card vendor usually provides an option like for checks to (trust data over parity or trust parity over data) where you tell the card which to use if it finds an error (global setting). But what I was mentioning in that quote snippet was that your raid card has NO CONCEPT of what's in those sectors it's checking, the raid card's primary goal is to make sure that they match NOT that anything inside is correct (raid is for drive availability NOT integrity), yes you get some of it here with raid-6 as errors are generally random so having two on the same stripe width is not as frequent of an occurrence so you can repair the sectors that it's dealing with and by doing this regularly you will reduce the probability of acquiring multiple errors on the same stripe width (thereby fixing them before it gets too bad). I do not believe that any card manufacturer does multiple reads on a mis-calc (just like with my file hash tests) it's possible that the data read from the drive (without reporting a hard error at any time) is wrong IN TRANSIT from the media and then force the controller to go through corrections based on false reading (more of a problem w/ raid 3/4/5 as you have the coin-flip case of trusting one or the other implicitly).

    This is why I do multiple levels of checks of data. RAID (consistency) only checks from a subsystem point of view (just like a drive's internal ECC code per sector only checks that sector) each component has no clue about a higher component's function or data. It does not know what data is there (or even IF data is there) it only tries to make (in the case of raid 6) P, Q, & raw match. A file system does similar, it also does not care about data in files it's only job is to make sure the structure of the file system is consistent (ie, that a directory is a directory or an inode is not pointing to itself et al). When you have a file system error it can screw up your data as the file system is trying to re-point inodes/clusters to a set structure NOT THE DATA IN THE FILES. The data in the files are actually left up to an application to verify (which few do, your quickpar is an example of one that does, some report the error (say PDF or whatnot can't read a file properly) but don't have the means to fix it. This is where and why you need to do manual checks (like hashes et al (application level)) to find the error and multiple backup sets to replace the file once an error is found.



    @m^2/enteon: Here's an article on hamming code http://en.wikipedia.org/wiki/Hamming_code This is NOT what raid 3/4/5/6 does but it's similar in concept (this is actually what raid 2 does (and ecc memory)) However you can see here the concept on how to determine the location of an error and why you can correct a single bit error but not a dual bit error (you need more bits in your check to find multi-bit errors).

    @enteon: soft raid and hardware raids in-themselves are not any different from this level of function. It comes down to the implementation of the code (hardware raids are just microcode/firmware so in essence /all/ is software at some level). The main reasons for hardware raid is that generally for the higher end cards they have a staff which is focused only on doing the raid function so is not distracted, the other benefit is that it does not steal cpu cycles from your host application which is very important for servers. As for ZFS, as I mentioned above, it's NOT a panacea at least not yet, it has problems and can cause data corruption on it's own in some cases plus other caveats that may or may not apply to everyone. I still have it in a test phase at work as it's not to a point where I think it's stable enough for enterprise use. It comes down to using the right tool for a job and perhaps customizing solutions, this is a problem that has just started surfacing in the past 2-3 years and so far it is not getting a lot of press so there are not many tools in the woodshed to use against it yet.
    Last edited by stevecs; 12-28-2008 at 04:31 AM.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  11. #11
    Xtreme CCIE
    Join Date
    Dec 2004
    Location
    Atlanta, GA
    Posts
    3,842
    Steve,

    After a number of fruitless occasional searches I finally re-found this post of yours from about 10 months ago where you posted a script for simple md5 check. I noticed that you used the phrase "early version" as a precursor to it - do you have a more sophisticated copy by any chance? I'm asking here rather than via PM because this thread seems like a good place for it.

    Oh, and if you don't mind I'd like to get your permission to link to either the updated version or the old version in the sticky. I was looking at all the stuff I've received from you and your posts the other day and am starting to think I should just make a link to a search for posts your username .
    Dual CCIE (Route\Switch and Security) at your disposal. Have a Cisco-related or other network question? My PM box is always open.

    Xtreme Network:
    - Cisco 3560X-24P PoE Switch
    - Cisco ASA 5505 Firewall
    - Cisco 4402 Wireless LAN Controller
    - Cisco 3502i Access Point

  12. #12
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    wow that's old. Funny you should ask. The version I run in production is much larger (back-end sql database, and mostly optimized for solaris than for linux). However I have been bugged to come up with something that's self-contained/lighter for some linux based systems (and multi-core aware) I haven't dug into any coding but I have the napkin in front of me with some ideas I was thinking at lunch. I'll post it or pm it to you for the sticky if you want when I have it or if someone wants to write a statically compiled c version (which would be best but I know I don't have the time to do that).

    If you want to link to posts, I have no problems, only caveat would be that most were in-context with the threads that they were in and may not generalize well in the current form. I've tried to put generalized items in your main thread. If there's something that helps feel free to steal it (I have absolutely no qualms about sharing code/info, it's how I learned. )

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  13. #13
    Xtreme Mentor
    Join Date
    Sep 2006
    Posts
    3,246
    So really when it comes to the integrity of a specific file, the only thing that's available is quickpar?
    [SIGPIC][/SIGPIC]

  14. #14
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    And programs like it like rsbep http://ttsiodras.googlepages.com/rsbep.html and other home-grown tools. I haven't tried rsbep (just did a fast google search on it a couple mins ago) but actually looks interesting, though don't know about the performance.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  15. #15
    Xtreme Mentor
    Join Date
    Sep 2006
    Posts
    3,246
    Quote Originally Posted by stevecs View Post
    And programs like it like rsbep http://ttsiodras.googlepages.com/rsbep.html and other home-grown tools. I haven't tried rsbep (just did a fast google search on it a couple mins ago) but actually looks interesting, though don't know about the performance.
    What do big outfits use? Large corporations, the military, government, etc. suffer the same issues. What are their solutions, if any?
    [SIGPIC][/SIGPIC]

  16. #16
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    None that I am aware of. No storage vendor (STK/SUN, EMC, et al) nor raid vendor supports it. No corporation that I've seen implements any checking let alone a recovery process besides recovery from tape/source (without checking to see if that data is valid either) and only when things are so corrupt to cause noticeable problems (no-one notices that extra byte change as datasets are so large). The only places I've found that have recognized the issue and have attempted to fix it are in large scientific circles like over at CERN with their LHC project and Fermilab. They are doing the same that I am here but in their case since they ARE the source of the data their option is to delete/wipe any data between checks (ie, if check on day 1 was ok, but check on day 2 was NOT ok, then wipe all data between day 1 and day 2 as you can't trust the data). (it's finer grained than that but that's the concept). It's why I liken it to the titanic & iceburgs, companies/storage admins are living mainly in ignorance or are discounting the frequency of the issue.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  17. #17
    Xtreme Addict
    Join Date
    Mar 2008
    Posts
    1,163
    So RAID controller just flips a bit on one drive, not necessarily the correct one? Therefore I don't see any significant advantage of RAID 6 over RAID 5 here.

    What happens if there's error on the drive that internal ECC can't correct? Drive returns error message and RAID can use another drive to get the data, right?

    BTW do you know anything about chance of getting bit corruption in case of having RAID+ECC mem?
    I mean (undetected) error on HDD / in memory / cables etc.?
    I guess that if industry doesn't care, it's a very minor thing, but it would still be good to know the problem.

  18. #18
    Xtreme Mentor
    Join Date
    Sep 2006
    Posts
    3,246
    Quote Originally Posted by m^2 View Post
    So RAID controller just flips a bit on one drive, not necessarily the correct one? Therefore I don't see any significant advantage of RAID 6 over RAID 5 here.

    What happens if there's error on the drive that internal ECC can't correct? Drive returns error message and RAID can use another drive to get the data, right?

    BTW do you know anything about chance of getting bit corruption in case of having RAID+ECC mem?
    I mean (undetected) error on HDD / in memory / cables etc.?
    I guess that if industry doesn't care, it's a very minor thing, but it would still be good to know the problem.
    I'll try:

    RAID 5, you have to tell it which bit to favor, parity or original data. This means no real protection other than preventing corrupt volumes.

    RAID 6, there are 3 bits and the odds of 2 bits being bad are small, therefore when the bad data is fixed, odds are good it really was the bad data and you are doing the equivalent of a coin toss.

    As far as RAM bit errors, the original Corsair rule I recall is: 1 bit error occurs in 256MB of ram every month.
    4GB = 15 bit errors/month
    8GB = 31 bit errors/month
    16GB = 62 bit errors/month
    32GB = 125 bit errors/month

    However, other more recent sources maintain 1 bit error per gigabyte per month, so that cuts those numbers by a factor of 4, i.e.:
    4GB = 4 bit errors/month
    8GB = 8 bit errors/month
    16GB = 15 bit errors/month
    32GB = 31 bit errors/month

    Which is more correct? My guess is that it's somewhere in between, governed by the quality of the RAM, amount of overclock beyond stated specs, the luck of the draw, and a host of other issues.


    Anyone with more up to date or correct info please feel free to correct me.
    Last edited by Speederlander; 12-29-2008 at 09:28 AM.
    [SIGPIC][/SIGPIC]

  19. #19
    Xtreme Addict
    Join Date
    Mar 2008
    Posts
    1,163
    Quote Originally Posted by Speederlander View Post
    I'll try:

    RAID 5, you have to tell it which bit to favor, parity or original data. This means no real protection other than preventing corrupt volumes.

    RAID 6, there are 3 bits and the odds of 2 bits being bad are small, therefore when the bad data is fixed, odds are good it really was the bad data and you are doing the equivalent of a coin toss.
    RAID 6 of 3 drives doesn't make sense. You have most likely 6-24 drives. And have to choose 1. Assumption that there's only 1 error is reasonable. With it the second parity lets you halve (?) number of drives that are suspected. So you have 8-33% of guessing correctly.

    Quote Originally Posted by Speederlander View Post
    As far as RAM bit errors, the original Corsair rule I recall is: 1 bit error occurs in 256MB of ram every month.
    4GB = 15 bit errors/month
    8GB = 31 bit errors/month
    16GB = 62 bit errors/month
    32GB = 125 bit errors/month

    However, other more recent sources maintain 1 bit error per gigabyte per month, so that cuts those numbers by a factor of 4, i.e.:
    4GB = 4 bit errors/month
    8GB = 8 bit errors/month
    16GB = 15 bit errors/month
    32GB = 31 bit errors/month

    Which is more correct? My guess is that it's somewhere in between, governed by the quality of the RAM, amount of overclock beyond stated specs, the luck of the draw, and a host of other issues.


    Anyone with more up to date or correct info please feel free to correct me.
    Thanks, I knew the Corsair data, but not the other one. But this applies to non-ECC memory. ECC has some error rate too - a double error in not correctable. Taking the data above and calculating chance if double error is wrong as ECC memory has more modules - how many? Depends on implementation.

    Anyway that's not what I was looking for - I'm curious how is it if you take computer as a whole, add standard protection (RAID+ECC mem+ECC controller cache), pump data constantly for a year or 2 and then verify the output. I didn't expect that you'll know of such data, but it doesn't hurt to ask.

  20. #20
    Xtreme Mentor
    Join Date
    Sep 2006
    Posts
    3,246
    Quote Originally Posted by m^2 View Post
    RAID 6 of 3 drives doesn't make sense.
    Where did I say it was only three drives?
    [SIGPIC][/SIGPIC]

  21. #21
    Xtreme Addict
    Join Date
    Mar 2008
    Posts
    1,163
    "RAID 6, there are 3 bits"
    What did you mean then?

  22. #22
    Xtreme Mentor
    Join Date
    Sep 2006
    Posts
    3,246
    Quote Originally Posted by m^2 View Post
    "RAID 6, there are 3 bits"
    What did you mean then?
    RAID 6 calculates two sets of parity information for each piece of data. Therefore it has a tie-breaking chuck when two pieces don't agree. Hence, 3, rather than the 2 you get with the other parity RAIDs. Steve explained that above.
    [SIGPIC][/SIGPIC]

  23. #23
    Xtreme Addict
    Join Date
    Mar 2008
    Posts
    1,163
    OK, I spend some time searching and now I understand how is it done. Actually I didn't explain the problem well and steve didn't answer it:
    Problem:
    24 drives. You have 24 bits. One is wrong. You need no less than ceiling(log(24))=5 redundant bits to identify the wrong one. //log is binary logarithm
    You don't have so many, so you can't do any correction.

    However there's solution:
    Don't process individual bits, but blocks of them. If you do it on bytes and have 24*8=224 bytes, ceiling(log(24))=8 *might* be enough (it's a rough and simple lower bound, real needs are likely to be higher. I think that Shannon gave precise solution, but I don't understand wikipedia's maths language). You have 16, which is probably enough. If it's not, make the block even bigger, stripe size is usually at least 4KB.

    Actually double block size and RAID 5 can act as RAID 6 (with double stripe width) and offer error correction too! I guess that nobody does it though.

    ADDED: I just realized that I still don't know what are the 3 bits?? Yeah, I reread steve's posts.
    Last edited by m^2; 12-29-2008 at 01:32 PM.

  24. #24
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Sorry for the delay here, for some reason I wasn't getting e-mail notifies.

    Speederlander is correct in post #18 above. The error rates are the same for both ECC/Non-ECC memory (about 1 in 10^-12 BER for the module) however this does not take into effect usage and time. The most recent study I saw was from CERN (~1300 nodes over 3 months) with a average network I/O rate of about 800MB/s contineous. (~6 memory read/write operations for each bit (read NIC->write kernel mem->read from kmem->write application mem->read app mem->write kmem file buff->read kmem filebuff->write to disk) with the vendor rating of a BER rate of 10^-12 that would be somewhere around 600,000 ECC single bit errors (which would be corrected by the ECC memory & bios) however they only had reported 44 errors in that time frame which is VERY good. However they did mention that they had a problem with the errors being reported properly by the motherboard (to the IPMI level) so they may not have seen all the errors.

    @M^2: actually you don't have 24bits (the only bit-level raid is raid-2, all other parity raids are byte or block parity) which is what you're talking about, and was one of the reasons why raid-2 was dropped (that and you got the best performance when your raid-2 set had the same number of data spindles as your system's word size). The 3 'bits' was an illustration not literal, sorry for the confusion. I was trying to say that you have 3 sources of information to check the real (full data) and two different parity blocks. Each parity block can reconstruct a missing block of full data on that stripe width.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  25. #25
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    @m^2 - just stumbled on this, which may help you out in understanding raid-6 a bit more: http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf Plus this older one from adaptec http://storageadvisors.adaptec.com/2...tiple-raid-6s/ which highlights the major types of raid-6 implementations.
    EDIT: Just found this recent discussion as well regarding the problems (2nd video) of raid-5 & 6 with large drives, right from Garth Gibson himself: http://storagemojo.com/2008/12/16/ga...puter-storage/

    @Serra here's an updated version of a very simple MD5 checker that I wrote up last night. Was going to do something self-contained but noticed that afick is a similar tool written in perl that works on windows & unix. It's missing some items like multi-threading and handing of the full unicode character set. I've sent the author an e-mail about it and see if it can be added. Below is just for reference for anyone who wants some bad shell code.
    Code:
    #!/bin/bash
    # $Id: md5check.sh 002 2009-01-02 06:00:00Z stevecs $
    ##
    # File System based MD5 Check
    #
    ## Revision log:
    # 20081229 - Version 0.01 - Just a rough draft
    # 20090102 - Version 0.02 - create dirs automatically, comments on use
    #
    ##
    # External Command Requirements:
    # bash
    # md5sum
    # find
    # sed
    # awk
    ##
    
    ##
    # Use:
    # Put into crontab or run from the command line
    # first run the script in create mode against the directory you want to watch:
    #	md5check.sh <directory> create {ignore string}
    # then run checks against that directory periodically
    #	md5check.sh <directory> check
    # re-run create whenever to create a new baseline
    ##
    
    ## NOTES - things to do
    ## - Use memory array to store output until done then write (avoids filesystem I/O)
    ## - Re-do as a proper getops/case for arguments
    ## - Trap ctrl-c and kill children automatically
    ## - On hash compare failure, re-read again to check to avoid transients
    ## - Create self-contained database so as to only update hashes with new files when create, and/or remove deleted files from check queue
    ## - Need to handle someone creating a file w/ same fqn (delete,replace) to update db.
    ## - Very inefficient when dealing with very small files (where spawning and file system directory lookup takes most of the cycles)
    ## - Should be able to support multiple different directory checks (now only one)
    ##
    
    #######################
    ## Trap user input
    #######################
    trap BashTrap INT
    
    
    
    #######################
    ## Variables
    #######################
    
    # Set calling argument variables for usage
    BASENAME=$0
    MNTPNT=$1
    CHECK=$2
    IGNORE=$3
    
    # Set static variables
    CPUS=`cat /proc/cpuinfo | grep processor | awk '{a++} END {print a}'`
    CURDATE=`date +%Y%m%d%H%M%S`
    HASHDIR="/var/log/md5sum"
    OLDIFS="$IFS"
    IFS="
    "
    
    
    
    #######################
    ## Functions
    #######################
    
    function Usage () {
      echo -e "usage: `basename ${BASENAME}` DIRECTORY [create|check] {IGNOREDIR}\n"
    }
    
    
    function BashTrap () {
      echo "${BASENAME}: CTRL+C Detected !... executing bash trap !"
      exit 2
    }
    
    
    function CheckDirs () {
      local CMDRSLT=0
      local DIR=""
      for DIR in "${HASHDIR}" "${HASHDIR}/md5hash" "${HASHDIR}/md5check" ; do
        if [[ ! -d ${DIR} && -n ${DIR} ]]; then
          echo "${BASENAME}: INFO: ${DIR} does not exist..  Creating ${DIR}"
          mkdir -p ${DIR}
          CMDRSLT=$?
          if [ ${CMDRSLT} -gt 0 ]; then
            echo "${BASENAME}: ERROR: You don't have permission to create directory ${HASHDIR}"
            exit 1
          fi
        fi
      done
    }
    
    
    function Md5CreateFileArray {
      local FILE=""
      local INDEX=0
      echo "${BASENAME}: INFO: ${FUNCNAME}: gathering list of files to operate on..."
      if [ -z "${IGNORE}" ]; then
         for FILE in `find "${MNTPNT}" -type f -print0 | xargs -0 -i echo -e -n "{}\n"` ; do
           FILEARRAY[${INDEX}]="${FILE}"
           ((INDEX++))
         done
      else
         for FILE in `find "${MNTPNT}" -type f -print0 | xargs -0 -i echo -e -n "{}\n" | grep -v $IGNORE` ; do
           FILEARRAY[${INDEX}]="${FILE}"
           ((INDEX++))
         done
      fi
    }
    
    
    function Md5CheckFileArray {
      local LINE=""
      local INDEX=0
      local FILELST=""
      echo "${BASENAME}: INFO: ${FUNCNAME}: reading in old md5sum output file..."
      FILELST=`find "${HASHDIR}/md5hash" -type f -name md5sum\* -print | xargs ls -rdt | tail -1`
      if [ -f "${FILELST}" ]; then
        exec 10<${FILELST}
        while read LINE <&10; do
          FILEARRAY[${INDEX}]="${LINE}"
          ((INDEX++))
        done
        exec 10>&-
      else
         echo "${BASENAME}: ERROR: ${FUNCNAME}: old md5sum file \"${FILELST}\" does not exist or is not readable"
         exit
      fi
    }
    
    
    function ChildQueueWait {
      local INDEX=0
      local PID=""
      while [ ${#CHILDPIDARRAY[@]} -eq ${CPUS} ]; do
        INDEX=0
        for PID in ${CHILDPIDARRAY[@]} ; do
          if [ ! -d /proc/${PID} ]; then
    	 unset CHILDPIDARRAY[${INDEX}]
          fi
          ((INDEX++))
        done
      done
    }
    
    
    function ChildQueueAddPid {
      CHILDPIDARRAY=( "${CHILDPIDARRAY[@]}" $1 )
    }
    
    
    function Md5Check {
      local FAINDEX=0
      local PID=""
      declare -a CHILDPIDARRAY
      echo "${BASENAME}: INFO: starting ${FUNCNAME}..."
      while [ ${FAINDEX} -lt ${#FILEARRAY[@]} ]; do
        if [ ${#CHILDPIDARRAY[@]} -lt ${CPUS} ]; then
          echo ${FILEARRAY[${FAINDEX}]} | md5sum -c - 2>/dev/null | grep -v ": OK" >> ${HASHDIR}/md5check/md5check.${CURDATE} &
          PID=$!
          ChildQueueAddPid $PID
          ((FAINDEX++))
        fi
        ChildQueueWait
      done
    }
    
    
    function Md5Create {
      local FAINDEX=0
      local PID=""
      declare -a CHILDPIDARRAY
      echo "${BASENAME}: INFO: starting ${FUNCNAME}..."
      while [ ${FAINDEX} -lt ${#FILEARRAY[@]} ]; do
        if [ ${#CHILDPIDARRAY[@]} -lt ${CPUS} ]; then
          md5sum -b "${FILEARRAY[${FAINDEX}]}" >> ${HASHDIR}/md5hash/md5sum.${CURDATE} 2>/dev/null &
          PID=$!
          ChildQueueAddPid $PID
          ((FAINDEX++))
        fi
        ChildQueueWait
      done
    }
    
    
    
    #######################
    ## Main Program
    #######################
    
    # Let's be friendly to the rest of the system
    renice 10 -p $$ > /dev/null 2>&1
    
    if [ -z "${MNTPNT}" ]; then
            Usage
            exit 1
    fi
    
    if [ "${CHECK}" = "check" ]; then
    	Md5CheckFileArray
    	Md5Check
            exit
    fi
    
    if [ "${CHECK}" = "create" ]; then
    	CheckDirs
    	Md5CreateFileArray
    	Md5Create
            exit
    fi
    Last edited by stevecs; 01-02-2009 at 03:47 AM. Reason: Added some links for background info

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

Page 1 of 2 12 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •