PDA

View Full Version : Raid Issues on Blackops



negev
09-15-2008, 11:39 PM
Okay I have 3 x 1TB drives in RAID5. At the weekend I was priming at higher fsbs and the nb got very hot and one of the drives in the array failed. This happened once before with the 3rd drive, and when it happened before I just marked it as normal and 2 days of rebuilding later it was working fine.

This time I wasn't so lucky, it has been rebuilding since the weekend and this morning I got up to find that the 1st drive had also failed before it had finished rebuilding. I tried marking the failed drive as normal in Intel Matrix, but it wouldn't do anything else :/ Then windows froze, I could move the mouse but there was no HD activity and couldn't click on anything, couldn't run taskman etc so I have no choice to reboot.

Now on reboot, all drives are online but the array is in 'failed' status and is unbootable.

Please tell me I haven't lost over a TB of data! If I borrow a HD from work and install windows on it, is there any way to recover the array? Or even just the data from it? I don't even want to consider this not being recoverable :S

RAYTTK
09-16-2008, 01:23 AM
I had a problem with a lost raid 0 config saying it had failed on boot up, I run Vista. So i put the OS disc back in and selected repair, It couldnt find an OS to repair until i reinstalled the sata driver and everything was ok about 5 mins later.
I hope you can get sorted good luck.

negev
09-16-2008, 01:32 AM
I've found some tools that I can apparently recover the data with, but I need another 2TB of storage to write the recovered image to :/

negev
09-16-2008, 02:43 AM
Hey RAYTTK, how did you reinstall the sata driver when you couldn't even find the Windows volume to repair? I'm confused..

RAYTTK
09-16-2008, 06:42 AM
There is an option to install driver in the vista repair function, I just put the sata driver on a memory stick and when given the option to search for the driver to install i routed to usb stick installed driver then OS was found

RAYTTK
09-16-2008, 06:44 AM
Sorry i just realised i meant raid driver doh

negev
09-16-2008, 06:45 AM
Yea i figured out what you meant earlier and tried it but unfortunately it didn't work for me :(

I've got a plan to recover the data but its painful like you can't even imagine :/

I've borrowed three 750gb drives from work, I'm going to create a raid0 array on them, then use Raid Reconstructor to recover the broken array into a 2TB image file on the raid0. Once thats done I will (hopefully) be able to recover most of the data from the image file.

It's going to take soooo long though..

TheGanG
09-16-2008, 07:21 AM
Yea i figured out what you meant earlier and tried it but unfortunately it didn't work for me :(

I've got a plan to recover the data but its painful like you can't even imagine :/

I've borrowed three 750gb drives from work, I'm going to create a raid0 array on them, then use Raid Reconstructor to recover the broken array into a 2TB image file on the raid0. Once thats done I will (hopefully) be able to recover most of the data from the image file.

It's going to take soooo long though..

Good luck friend :( hard work...

negev
09-16-2008, 07:34 AM
yeah, ill probably get the image onto raid0, format the original disks and then the raid0 will break lol.

im turning everything down to stock clocks and voltage before i start :P

negev
09-16-2008, 07:44 AM
I think I know why this happened, if your overclock isn't 100% stable and you get a single cpu or memory error, it can break your raid volume.

From: http://en.wikipedia.org/wiki/RAID

Firmware/driver based RAID

Operating system-based RAID cannot easily be used to protect the boot process and is generally impractical on desktop version of Windows (as described above). Hardware RAID controllers are expensive. To fill this gap cheap "RAID controllers" were introduced that do not contain a RAID controller chip, but simply a standard disk controller chip with special firmware and drivers. During early stage bootup the RAID is implemented by the firmware; when a protected-mode operating system kernel such as Linux or a modern version of Microsoft Windows is loaded the drivers take over.

These controllers are described by their manufacturers as RAID controllers, and it is rarely made clear to purchasers that the burden of RAID processing is borne by the host computer's central processing unit, not the RAID controller itself, thus introducing the aforementioned CPU overhead. Before their introduction, a "RAID controller" implied that the controller did the processing, and the new type has become known in technically knowledgeable circles as "fake RAID" even though the RAID itself is implemented correctly.



As the raid processing is done by the CPU, it's dangerous to overclock a system running fakeraid.

Xello
09-16-2008, 08:22 AM
That's strange, i've tortured this system like you wouldn't believe trying to get my overclock settings and my raid 0 velociraptors have held up flawlessly throughout the whole process.

Maybe it was to do with your NB getting so hot :shrug:

jason4207
09-16-2008, 08:45 AM
I've lost too many RAID0 arrays to OCing. Right now I'm running RAID10, but it has to verify & repair whenever I get a lock-up (Damn ATI drivers!). I'm in the process of building an unRAID File Server, and I'm just going to keep my OS/games on a single Velociraptor in my main PC.

IMO RAID (or fake-RAID) and OCing don't mix well together.

negev
09-16-2008, 08:59 AM
@Xello

If the raid controller tried to tell your drives to drop out of the array they'd probably snarl and tell it to f*ck off lol

Seriously though, its probably just bad luck. Unstable OC + cheap sw raid controller running raid stuff through cpu + cheap drives... the odds were against me.

I'm starting the recovery process now, got a 2047gb raid0 which I should be able to recover the array onto. After spending all day reading about this and finding the tools I am fairly confident I can recover the data.

negev
09-16-2008, 09:00 AM
@Jason

That 'verify and repair' thing worries me. It took about 12 hours on my 3 x 1TB raid5 to do that, and if you think its doing that once a week or so, its got to be shortening the life of the drives.

negev
09-16-2008, 09:16 AM
Whoa this is weird..

I just disconnected all my drives, connected up the temporary 750gb drives and installed Server 2008.

Then I disconnected the raid0 and reconnected my raid5 with the intention of setting them to non-raid mode, and suddenly its detected the array and i can boot into windows! its rebuilding now :D

jason4207
09-16-2008, 02:53 PM
@Jason

That 'verify and repair' thing worries me. It took about 12 hours on my 3 x 1TB raid5 to do that, and if you think its doing that once a week or so, its got to be shortening the life of the drives.

I've found that you lose any redundancy benefit w/ software RAID. I was all about going Matrix RAID0/5 last year. I had planned to use the RAID0 for super-fast throughput, and then have the RAID5 for data and to keep regular image back-ups of the RAID0(OS/Programs) on it as well. It would have worked, but the more research I did the more I learned how software RAID5 wasn't as good as I had thought. Constant rebuilds, and poor write performance really hurt it, and even though it has 'RAID5 redundancy' it really is not very safe (not nearly as good as hardware RAID5). In theory you can lose 1 drive and still be alright, but in practice there is so much more that can go wrong.

By the time I got these WD 640's I had changed my mind and decided to go RAID10. I'm pretty confident my data is safe now (yet I still have it backed up on my other PC), but I waste a lot more space, and the constant rebuilding is getting old.

Now I'm getting all my storage out of my gaming rig. All those HDD's just make things hotter in there anyway. I'm building a file server which is good for several reasons...I don't have to keep syncing the storage data b/n me and my wife's PC (I still don't trust the RAID10 completely), and I can get rid of all these damn DVDs (close to 600 now). I've been reading up on unRAID (http://lime-technology.com/). It seems to be a pretty elegant solution. If 1 drive fails you are fine. If 2 drives fail you only lose the data on 1 drive. It's easy to add/replace HDD's of varying sizes, and the file server can spin-down the HDD's that are not in use...saving power (you only access 1 HDD at any given time as opposed to RAID0/1/5/10 which needs to access multiple drives simultaneously). You get similar storage to RAID5 (n-1), and you can scale up to 15 HDD's together. And...this is good...you can completely swap out mobo, CPU, and RAM, and the unRAID still functions the same! No need to update any software or rebuild anything. You can even take a HDD out of the unRAID, throw it in any PC, and see all the files you have on there.


Whoa this is weird..

I just disconnected all my drives, connected up the temporary 750gb drives and installed Server 2008.

Then I disconnected the raid0 and reconnected my raid5 with the intention of setting them to non-raid mode, and suddenly its detected the array and i can boot into windows! its rebuilding now :D

Glad you got it working again! :up:

negev
09-16-2008, 03:01 PM
That unraid does sound pretty cool... maybe I'll build a fileserver sometime.

Another problem with raid5, apart from errors from overclocking, is that if the drives were all bought at the same time then the probability of two drives failing within 24-48 hours of eachother increases.

As my raid5 volume takes a good 48 hours to rebuild, this is worrying :/

As soon as this array is finished rebuilding, I am going to copy all the data onto the temp 750gb drives, then reformat the 1TB drives is non-raid and use rsync to manually replicate the data between all three. That way I have protection from two drive failures and from filesystem errors due to overclocking.

It's not a great solution, massive loss of usable space and massive overhead replicating the data, but I can't take the risk of losing all of this again.

RAID5 sucks even if its hardware, two drive failures and you're screwed. If you think about it, even if your controller supports a hot spare, you can have 3 drives in raid5 with one hot spare. After a year or two, one of the drives fail. The other two drives might still be running, but could be a little iffy. Now when that first drive fails, the hot spare kicks in and starts the rebuild, and the strain from all that drive activity could take out another one.

jason4207
09-16-2008, 05:27 PM
That unraid does sound pretty cool... maybe I'll build a fileserver sometime.

Another problem with raid5, apart from errors from overclocking, is that if the drives were all bought at the same time then the probability of two drives failing within 24-48 hours of eachother increases.

As my raid5 volume takes a good 48 hours to rebuild, this is worrying :/

As soon as this array is finished rebuilding, I am going to copy all the data onto the temp 750gb drives, then reformat the 1TB drives is non-raid and use rsync to manually replicate the data between all three. That way I have protection from two drive failures and from filesystem errors due to overclocking.

It's not a great solution, massive loss of usable space and massive overhead replicating the data, but I can't take the risk of losing all of this again.

RAID5 sucks even if its hardware, two drive failures and you're screwed. If you think about it, even if your controller supports a hot spare, you can have 3 drives in raid5 with one hot spare. After a year or two, one of the drives fail. The other two drives might still be running, but could be a little iffy. Now when that first drive fails, the hot spare kicks in and starts the rebuild, and the strain from all that drive activity could take out another one.

If your going to use all 3 as a sort of ad-hoc RAID1 then maybe you should just put 2 in RAID1, and then use the 3rd drive as an external back-up. If you keep it at work or something then even a catastrophe won't destroy your data.

saaya
09-16-2008, 05:59 PM
im turning everything down to stock clocks and voltage before i start rebuilding :P you didnt do that before? :P
the nb shouldnt cause this sort of issues tho, sounds more like a raid controller issue... your using the intel sb raid right?
does the sb get hot too?
did you increase the sb voltage from stock?

glad to hear its working again now! :toast:

about raid... ive worked with servers and pcs for years and the most likely to fail components are:
40% memory
30% hdd
10% psu
10% mainboard

so yes, hdds are rather likely to fail, but for me it never happened out of nowhere, the drives always showed signs of wearing off/dieing. i started to play with raid after the infamous ibm deathstar drives where the replacement of the replacement of the replacement drive started to fail after only 4 months :rolleyes:

my conclusion about raid was:
too much hassle
too expensive
even if a drive fails its annoying and time consuming to restore the image
even with raid you can lose data

so i now keep a backup of all important data on a second external drive, which works pretty well for me and its MUCH less hassle :D
since the drive is external i only connect and use it when i backup more data, so im not running it every day, which means a longer lifetime and makes the drive pretty safe :)

negev
09-17-2008, 01:13 AM
I didn't overvolt the sb, but my nb was pretty hot and it's heatpiped to the sb so the heat would radiate down there..

I think it's more likely that my overclock was slightly unstable, with the fakeraid controller a single error can cause the controller to think a drive has failed.

As soon as this rebuild is done I am abandoning raid and replicating my data manually.

negev
09-17-2008, 06:35 AM
Update: 42% of rebuild and the box bluescreened with Stop error 00008086

Quick google shoes this is related to the ICH9R controller. Luckily it didn't seem to cause any damage, the rebuild is continuing after the reboot, but now I am not waiting for the rebuild anymore I am getting my data off this crazy raid controller!

saaya
09-17-2008, 03:30 PM
As soon as this rebuild is done I am abandoning raid and replicating my data manually.
thats a b!tch tho innit? :D
i wish thered be an automated backup option that does it all on its own...
you just set it to make a backup to your external raid1 array every week and thats it... on tuesday at 2pm it just fires up your machine and starts to make the backup. :)

man thatd be sweet :D

sorry to hear about the issues with the southbridge...
can you pm me your contact details?
Ill hook you up with our UK FAEs so they can send you a replacement board and our engineers can check out those issues.
Please write down the exact BIOS settings your using just in case they have problems replicating it. oh and please write down the ambient and case temp too, since this might be a temperature issue.
How are you cooling the nb?

negev
09-17-2008, 04:01 PM
Hey Saaya

I thanks - I have PM'd you my contact info.

I will give you all the settings tomororw after ive finished recovering all my data.. you won't believe what happened, I spent hours copying all the data off of the f2cked raid5 volume (while it was rebuilding) onto spare drives, then rebooted, deleted the raid5 array, installed windows on one of the drives and then got drunk and accidentally deleted all the partitions on the drives that had all my data on LOL

I'm running a data rescue program now :p when you say the uk FAE will send me a replacement board, will they do that before I send this one back? I'd rather not be without a machine for a while as this is my main computer and has all my data...

saaya
09-17-2008, 04:32 PM
and then got drunkuh oh

and accidentally deleted all the partitions on the drives that had all my data on LOL:doh:

alcohol and backing up data are an even worse combination than overclocking and raid :lol:


I'm running a data rescue program now :p when you say the uk FAE will send me a replacement board, will they do that before I send this one back? I'd rather not be without a machine for a while as this is my main computer and has all my data...im not sure, let me check :)

negev
09-18-2008, 12:19 AM
Cool, do you actually think there is a fault with the southbridge then?

I managed to get all the data back and am now running without raid :D