Swiftech
Results 1 to 1 of 1

Thread: Recovering an XP installation from RAID0 failure

  1. #1
    Registered User
    Join Date
    Sep 2006
    Posts
    2

    Post Recovering an XP installation from RAID0 failure

    Intro:

    I wanted to share my experience of restoring a RAID0 disk array after a RAID software failure. I was stunned to see the scarcity of information and tools available on the web, and could not believe we came to losing all our data even though there was NO HARDWARE FAILURE on any of the disks. The following story tells the details on how to save your data with freely available tools from such a situation.

    The problem:

    This computer suffered power interruptions due to faulty cables, and overloaded UPS units. So when I got to it, it wouldn't boot up. The computer had an Intel D875PBZ integrated motherboard with a software (fake) RAID controller on it. It had two identical 120GB SATA disks. Upon bootup, the RAID BIOS screen would say that one of the RAID disks was "reporting error". Windows XP SP2, which was being used on this computer, would come with the "choose last known good configuration" menu, but none of the options lead out of that screen. The computer would either reset, or get hung.

    Dissapointing scarcity of tools from Intel:

    Entering the Intel BIOS menu during bootup showed both SATA disks intact, so I hoped there was no hardware error. However, since I couldn't boot into any operating system, there was no way to tell if there was any hardware failure. It was very dissapointing: Intel did not supply any recovery utilities that can be used from a floppy or a CD. The RAID BIOS menu had only three options: create, delete, or revert to non-RAID configuration; with each showing big flashing warnings about how I'd lose all my data.

    Diagnosis:

    I booted the computer with a Knoppix Linux Live DVD v5.0.1. This is a bootable DVD (it comes as a CD, too) that runs Linux as if it was installed locally on the computer without requiring any real installation. The reason to choose this version was because it came with the "dmraid" utility to talk to the fake RAID BIOSes.

    First I checked the disks for hardware errors by looking at logs and simply looking at raw data from the /dev/sda and /dev/sdb disk device files, but both disks seemed to work happily. Even fdisk showed that there was a 230 GB NTFS partition on the first disk. It was bigger than the physical disk, thus it must be from the RAID configuration. dmraid recognized the RAID0 configuration, but showed only one disk as available. Since the RAID BIOS similarly reported a disk in error state, it must be a corrupt metadata on the disk itself. My immediate thought was: "Why doesn't the Intel RAID BIOS offer an option to fix or recover the metadata?"

    Debugging:

    Out of desperation, I decided to debug the metadata. I used dmraid to get the metadata block and its location on the second disk:

    Code:
    $ dmraid -r /dev/sdb -D
    This created three files: sdb_isw.dat with the actual metadata, sdb_isw.offset with the offset of the metadata into the disk in bytes, and sdb_isw.size with the size of the RAID disk in blocks. This didn't work for the first disk, because dmraid complained that it cannot find the metadata. It turns out the metadata for Intel RAID consists of a single disk block of 512 bytes and sits very close to the end of the disk. The actual RAID part of the disk starts from the beginning of the disk. That explained why I could see the partition information with fdisk when looking at the first disk.

    The above code can also be used to take backups of the metadata. Notice that when working from a Knoppix environment, since it doesn't modify the disks, you're actually working on a ramdisk. Whenever you boot the machine you will lose the files you created. Make sure to transfer the files you'd like to keep over a network to a safe place.

    Before trying to copy the above metadata block to the first disk, I checked the same location on it looking for metadata remains. I copied the block from the first (bad) disk:
    Code:
    $ dd if=/dev/sda of=sda_isw-test.dat skip=$[ `cat sdb_isw.offset` / 512 - 1 ] count=1
    (divide by 512 to find number of blocks. You may have a different block size, check the good RAID metadata or disk information. I had to subtract 1 from the number of blocks!)

    Surprisingly, doing a binary diff with the metadata blocks from the two disks gave perfect match:

    Code:
    $ diff sdb_isw.dat sda_isw-test.dat
    (match!)

    At this point I didn't know what to do, I inspected the remaining blocks at the end of the disk. There seemed to be some more RAID metadata, but I didn't have any guides to interpret them. Upon later reflection, I saw that in the metadata (see output of "dmraid -n" below), the first disk had the status "0x1e" whereas the second disk was "0x1a". This may be the reason for both dmraid and the raid BIOS complained. Unfortunately I did not try to change this value.


    Brute-force solution:

    I got an advice from a fellow admin: take full backups and use the RAID BIOS to recreate the RAID array. Since I'll be creating the same configuration, it shouldn't destroy the data. You think?

    I streamed the contents of both raw disks using ssh and bzip2 over the ethernet:

    Code:
    $ bzip2 -c -1 /dev/sda | ssh somebody@another.computer "cat > /path/to/disk-a.img.bz2"
    $ bzip2 -c -1 /dev/sdb | ssh somebody@another.computer "cat > /path/to/disk-b.img.bz2"
    It was overkill to use "bzip2 -9", it took about 12 hours to copy 230 GBs, so I have "-1" in the example. The reason for using compression was that only the first 30 GBs out of the 230 GBs were used, I hoped the rest would compress to become nothing. And they did. I got 2x 13GB files as backup image for the disks.

    Then I recreated the same raid array. Note: keep the same volume name and stripe size as before! dmraid will correctly report them with the "dmraid -n" native mode report (see below).

    The result was that the RAID BIOS was happy, but there was nothing left from the existing NTFS filesystem. No partition table, no boot sector. Thank you Intel RAID BIOS!

    I went in with Knoppix again, used "hexdump -C /dev/sda" to see the raw disks and compared them to the backup copies. It felt so much safer to have full backups. Apparently 0x2000 and 0x200 bytes were zeroed out from disks, respectively. Used dd to restore them from backup:

    Code:
    $ ssh somebody@another.computer bunzip2 -c /path/to/disk-a.img.bz2 | dd of=/dev/sda bs=1 count=0x2000
    After this step, dmraid could recognize the array and the ntfs partition. They appeared as device files like /dev/mapper/isw_*. As a happy ending, I could mount the partition in linux, and could browse the files. I ran samba to make the drive available as a windows share. So all data from original disk was available.

    Conclusion:

    I saved the data on the RAID0 disk using the Knoppix Live DVD and making it a Windows share on the local network.

    Unfortunately, Windows never booted up properly. It must have corrupted some files when booting with half of the disk missing. After the fix, it would even display the graphical loading screen, but it would reboot unconditionally afterwards.

    We ended up installing a fresh copy of Windows after taking a backup from the Samba share. We didn't try selecting the "recover" option during windows install because our windows admin preferred to put a preinstalled ghost image which didn't have the necessary RAID drivers and never booted. So we had to do a fresh install, which worked.

    I hope this helps anybody trying to recover data. Please let me know if anyting in this post is wrong.

    A complaint about floppy drives:

    Note that you will need a FLOPPY DRIVE to install Windows on this computer. Is it only me or is this ridiculous? It has been several years since I removed the last floppy drive from my desktop machine. I don't see any reason to use them anymore since one can boot up from CD drives. Anyway, windows won't install if you don't supply the raid drivers in a floppy disk after pressing F6 during installation.

    Update: I learned that there's a way around using floppy drives, by creating an XP installation CD that includes the RAID drivers. That's a story for another day, though.

    Caveats:

    USE AT YOUR OWN RISK! These are personal experience and applicable only under proper circumstances. I give no guarantee that it will work in your case, and I assume no responsibility if you decide to use it and lose data.

    Details:

    Output of "dmraid -n" before fixing anything:
    Code:
    NOTICE: checking format identifier isw
    WARN: locking /var/lock/dmraid/.lock
    NOTICE: skipping removable device /dev/hdd
    NOTICE: /dev/sda: isw    discovering
    NOTICE: /dev/sdb: isw    discovering
    NOTICE: /dev/sdb: isw metadata discovered
    INFO: RAID device discovered:
    
    /dev/sdb (isw):
    0x000 sig: "  Intel Raid ISM Cfg Sig. 1.0.00"
    0x020 check_sum: 3538140745
    0x024 mpb_size: 480
    0x028 family_num: 3924843586
    0x02c generation_num: 962
    0x030 reserved[0]: 4992
    0x034 reserved[1]: 3221225472
    0x038 num_disks: 2
    0x039 num_raid_devs: 1
    0x03a fill[0]: 2
    0x03b fill[1]: 0
    0x0d8 disk[0].serial: "        3JT48QNN"
    0x0e8 disk[0].totalBlocks: 234441648
    0x0ec disk[0].scsiId: 0x0
    0x0f0 disk[0].status: 0x13e
    0x108 disk[1].serial: "        3JT421B6"
    0x118 disk[1].totalBlocks: 234441648
    0x11c disk[1].scsiId: 0x10000
    0x120 disk[1].status: 0x13a
    0x138 isw_dev[0].volume: "             MPC"
    0x14c isw_dev[0].SizeHigh: 0
    0x148 isw_dev[0].SizeLow: 468882432
    0x150 isw_dev[0].status: 0x0
    0x154 isw_dev[0].reserved_blocks: 0
    0x158 isw_dev[0].filler[0]: 65536
    0x190 isw_dev[0].vol.migr_state: 0
    0x191 isw_dev[0].vol.migr_type: 0
    0x192 isw_dev[0].vol.dirty: 0
    0x193 isw_dev[0].vol.fill[0]: 0
    0x1a8 isw_dev[0].vol.map.pba_of_lba0: 0
    0x1ac isw_dev[0].vol.map.blocks_per_member: 234441216
    0x1b0 isw_dev[0].vol.map.num_data_stripes: 915786
    0x1b4 isw_dev[0].vol.map.blocks_per_strip: 256
    0x1b6 isw_dev[0].vol.map.map_state: 0
    0x1b7 isw_dev[0].vol.map.raid_level: 0
    0x1b8 isw_dev[0].vol.map.num_members: 2
    0x1b9 isw_dev[0].vol.map.reserved[0]: 1
    0x1bb isw_dev[0].vol.map.reserved[2]: 1
    0x1d8 isw_dev[0].vol.map.disk_ord_tbl[0]: 0x0
    0x1dc isw_dev[0].vol.map.disk_ord_tbl[1]: 0x1
    
    WARN: unlocking /var/lock/dmraid/.lock
    Output of "dmraid -n" after fixing the array:
    Code:
    0x000 sig: "  Intel Raid ISM Cfg Sig. 1.0.00"
    0x020 check_sum: 3554720388
    0x024 mpb_size: 480
    0x028 family_num: 3924843586
    0x02c generation_num: 1
    0x030 reserved[0]: 4992
    0x034 reserved[1]: 3221225472
    0x038 num_disks: 2
    0x039 num_raid_devs: 1
    0x03a fill[0]: 0
    0x03b fill[1]: 0
    0x0d8 disk[0].serial: "        3JT48QNN"
    0x0e8 disk[0].totalBlocks: 234441648
    0x0ec disk[0].scsiId: 0x0
    0x0f0 disk[0].status: 0x13a
    0x108 disk[1].serial: "        3JT421B6"
    0x118 disk[1].totalBlocks: 234441648
    0x11c disk[1].scsiId: 0x10000
    0x120 disk[1].status: 0x13a
    0x138 isw_dev[0].volume: "             MPC"
    0x14c isw_dev[0].SizeHigh: 0
    0x148 isw_dev[0].SizeLow: 468882432
    0x150 isw_dev[0].status: 0x0
    0x154 isw_dev[0].reserved_blocks: 0
    0x158 isw_dev[0].filler[0]: 65536
    0x190 isw_dev[0].vol.migr_state: 0
    0x191 isw_dev[0].vol.migr_type: 0
    0x192 isw_dev[0].vol.dirty: 0
    0x193 isw_dev[0].vol.fill[0]: 0
    0x1a8 isw_dev[0].vol.map.pba_of_lba0: 0
    0x1ac isw_dev[0].vol.map.blocks_per_member: 234441216
    0x1b0 isw_dev[0].vol.map.num_data_stripes: 915786
    0x1b4 isw_dev[0].vol.map.blocks_per_strip: 256
    0x1b6 isw_dev[0].vol.map.map_state: 0
    0x1b7 isw_dev[0].vol.map.raid_level: 0
    0x1b8 isw_dev[0].vol.map.num_members: 2
    0x1b9 isw_dev[0].vol.map.reserved[0]: 1
    0x1ba isw_dev[0].vol.map.reserved[1]: 255
    0x1bb isw_dev[0].vol.map.reserved[2]: 1
    0x1d8 isw_dev[0].vol.map.disk_ord_tbl[0]: 0x0
    0x1dc isw_dev[0].vol.map.disk_ord_tbl[1]: 0x1
    
    /dev/sdb (isw):
    0x000 sig: "  Intel Raid ISM Cfg Sig. 1.0.00"
    0x020 check_sum: 3554720388
    0x024 mpb_size: 480
    0x028 family_num: 3924843586
    0x02c generation_num: 1
    0x030 reserved[0]: 4992
    0x034 reserved[1]: 3221225472
    0x038 num_disks: 2
    0x039 num_raid_devs: 1
    0x03a fill[0]: 0
    0x03b fill[1]: 0
    0x0d8 disk[0].serial: "        3JT48QNN"
    0x0e8 disk[0].totalBlocks: 234441648
    0x0ec disk[0].scsiId: 0x0
    0x0f0 disk[0].status: 0x13a
    0x108 disk[1].serial: "        3JT421B6"
    0x118 disk[1].totalBlocks: 234441648
    0x11c disk[1].scsiId: 0x10000
    0x120 disk[1].status: 0x13a
    0x138 isw_dev[0].volume: "             MPC"
    0x14c isw_dev[0].SizeHigh: 0
    0x148 isw_dev[0].SizeLow: 468882432
    0x150 isw_dev[0].status: 0x0
    0x154 isw_dev[0].reserved_blocks: 0
    0x158 isw_dev[0].filler[0]: 65536
    0x190 isw_dev[0].vol.migr_state: 0
    0x191 isw_dev[0].vol.migr_type: 0
    0x192 isw_dev[0].vol.dirty: 0
    0x193 isw_dev[0].vol.fill[0]: 0
    0x1a8 isw_dev[0].vol.map.pba_of_lba0: 0
    0x1ac isw_dev[0].vol.map.blocks_per_member: 234441216
    0x1b0 isw_dev[0].vol.map.num_data_stripes: 915786
    0x1b4 isw_dev[0].vol.map.blocks_per_strip: 256
    0x1b6 isw_dev[0].vol.map.map_state: 0
    0x1b7 isw_dev[0].vol.map.raid_level: 0
    0x1b8 isw_dev[0].vol.map.num_members: 2
    0x1b9 isw_dev[0].vol.map.reserved[0]: 1
    0x1ba isw_dev[0].vol.map.reserved[1]: 255
    0x1bb isw_dev[0].vol.map.reserved[2]: 1
    0x1d8 isw_dev[0].vol.map.disk_ord_tbl[0]: 0x0
    0x1dc isw_dev[0].vol.map.disk_ord_tbl[1]: 0x1
    Last edited by cengique; 09-16-2006 at 11:12 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •