Big RAID6 array [Archive] - XtremeSystems Forums

rkagerer

06-15-2012, 12:42 PM

Hi, I'm pricing out a new 66TB storage system as follows:

- Areca ARC-1882-ix-24 - $1550 w/ 4GB RAM and cables (could go with a cheaper 1880 or 1680 instead..)
- 25x Seagate Barracuda XT ST33000651AS 3TB drives, in RAID6 - $229 each
- Norco 4224 case - $430
- SAS cables
- motherboard, CPU, PSU, etc.

Wanted to run it by the community for feedback and whether there are other options at similar price points I should consider. Any comments / suggestions? Any estimates on what rebuild times will be like?

One concern I have is the Norco backplanes only do Sata 3g. Will these disks be able to saturate the backplanes? If so, I'd consider ripping them out and just connecting the PSU / controller to the disks directly w/ SAS-SATA breakout cables.

Reason for the Areca card is I already have a couple ARC controllers and like to keep similar equipment laying around for backup / interchange purposes.

I currently have a 20-bay Norco 4220 with an ARC-1880 and 1.5TB Seagate Barracuda plain old 7200rpm consumer grade drives in it. Been okay over the last two years, although a drive "dies" (< 10% health in Hard Disk Sentinel, primarily based on # bad sectors) every few months and write speeds across the network are slower than I'd like. Figured I'd spend the extra few bucks to step up to slightly better drives this time around especially with the 5yr warranty. Want the extra one on hand so I can swap out immediately when needed.

Also anyone know if Hard Disk Sentinel can see behind the newer Areca cards?

stevecs

06-15-2012, 01:30 PM

Well, a couple comments I would make would be:
- If write performance is important you would want to make that into several smaller raid-6's (write performance is limited to a little under that of a single drive assuming the writes are larger than your BBU cache). This is improved by striping across multiple raids. (basically raid 6+0).

- That's a LOT of space with very large drives that have very low bit error ratios. (i.e. chance of data error would be high). standard raids like with areca are for availability not data integrity, so as long as you're aware of that and either don't care or have other means to check, fine.

- doubtful that any utilitities can see 'behind' (expose the raw drive) in the hardware raid, especially sas based raid cards like the 16xx, 18xx series. There was a comment that areca was looking to export some of the ata commands in firmware v1.51 but no confirmation (this was for smartmontools). The problem is on areca's/lsi side, lack of an API to send the scsi or ata commands through the 'proxy' of the controller (and in some cases expanders).

- As for compatability to backplanes, that's always a 'fun' thing to find out. Basically everyone has their own interpretation of the standards so in essence there isn't really one. You'll have to try it or see if someone else has already done it for you. I've gone through several myself when I was using areca's in my main share, have since moved to a different solution.

rkagerer

06-15-2012, 02:27 PM

Thanks!

I hear you loud and clear about the data integrity; most of it would be stored under CrashPlan or similar software that stores extra parity and heals things like bit rot. Wish they just built an option for extra parity into the controllers (is there cheap hardware out there I'm missing which has this?). Also looked at ZFS but seemed complicated to get set up (dabbled in Linux but not an expert). ReFs sounds intriguing too.

Dumb question - in plain old RAID-6, why would [sequential] write performance be limited to that of a single drive? I had assumed my throughput was being bottlenecked by the controller's parity calculations.

In fact Hard Disk Sentinel does an awesome job of seeing behind several popular RAID controllers - used it with the Areca ARC-1231ML and ARC-1280, also with several mobo embedded RAID's (Intel and LSI or Marvel I think) on Asus / MSI boards. I highly recommend it. That said, it doesn't work with everything - I suspect you're correct about SAS as I haven't gotten it to work yet with the 1880 I just installed. I, too, await in vain for Areca to make [more] SMART stats available through CLI.

May I inquire what kind of chassis houses your 104 2TB disks? ;-)

[XC] Oj101

06-15-2012, 03:10 PM

He uses this
http://www.xtremesystems.org/forums/attachment.php?attachmentid=127007&d=1337733359
(Taken from another thread)

stevecs

06-15-2012, 05:04 PM

Yes, ZFS is a bit more complicated especially if you're not used to *nix type systems. There is no 'best' solution everything has issues it's whether or not the issues affect your deployment. Checking parity, mainly at READ time is a big issue, this is handled by enterprise solutions (SAN)'s mainly by using 'fat' sectors (520/524/528 byte blocks) where the system will write extra data (crc and LBA #) in the sector to check against things like wild/random/etc writes/reads. No consumer solution does this except zfs/btrfs really.

As for /sequential/ writes, the issue is ONLY mitigated if you write a FULL stripe width at a time. In practice this doesn't happen. You're always writing partial stripe widths. the function of the BBU cache's is to help TRY and write out a full stripe width where possible but with large arrays (or pretty much any array) it's just not something that effectively happens. General 'rule of thumb' is to have 1/1000th the cache ram to raw disk. So if you have 1TB of disk, you should have 1GB of cache, 60TB of disk, 60GB of ram. Even in my case I have 192GB ram (against 224TB of disk) which helps in some cases per file, but then you get hit with other items (mainly ZFS's COW as it's fragment heaven so you get killed anyway). The issue is that ultimately this turns to random i/o and you get the read/write/parity update issue (with raid5 this is 4 ops, with raid6 it's 6 ops)

Wrote up a spreadsheet a while ago in the old storage thread here: http://www.xtremesystems.org/forums/showpost.php?p=2728049&postcount=52

As for your HD Sentinal (haven't used it in particular) it works as the 12xx cards are SATA only so they are supporting fully the ATA commands being sent. the others (newer) are using a SAS transport. SATA is encapsulated in SAS which is how they support different drives. (SATA is a different command and even electrical spec but can be encapsulated over SAS's protocol). Anyway, the commands that are being used are ATA commands which are not carried through the firmware.

As for the system here, (thanks Oj101) it's just a wee box. ;) Biggest issue is the slow backup/restore process (really need to get a couple LTO6 drives when they come out).

johnw

06-15-2012, 05:10 PM

Dumb question - in plain old RAID-6, why would [sequential] write performance be limited to that of a single drive? I had assumed my throughput was being bottlenecked by the controller's parity calculations.

Even RAID-6 parity is not very CPU intensive for today's processors.

I'm not sure what he was getting at. In my experience with Areca RAID-6 arrays, the write speed for large sequential writes (i.e., larger than a full stripe width) is in the ballpark of the theoretical max, which for RAID-6 is (N-2) times the speed of a single drive.

As for system recommendations, you did not mention the purpose of this machine. Is it going to be a file server? What type of files, how many simultaneous users, and what sort of access patterns?

When I see someone spec'ing a system like this for home use, I usually assume that it is for a media file server, streaming blu-ray movies and such. If that is the case, I think distributed-parity RAID is not the best way to go. Snapshot RAID is better for that sort of application. SnapRAID or FlexRAID. That also has the advantage that you can go with HBAs (IBM M1015, LSI 9211, etc.) instead of more expensive hardware RAID cards.

stevecs

06-15-2012, 05:21 PM

johnw

06-15-2012, 05:29 PM

RAID-6 is still very cpu (calculation) intensive. I guess it depends on what speeds you are looking at. Trying to push through GB/s+ speeds is /very/ intensive especially, like I mentioned when the array ages to the point your data is no longer contiguous (can fit so you're writing full stripe widths). By 'cpu' here just to clarify I mean whatever calculation engine you're using (asic, or whatever). This is also why this can be split across multiple cards.

Well, I've never tried anywhere near 5GB/s, but using linux mdadm with about 1.5GB/s RAID-6 writes, it does not stress the CPU at all (<10% of one core). The parity can be calculated almost as fast as the data can be read from RAM. Dual parity takes several more XORs than single parity, but it is still a very easy and quick computation for modern CPUs. I assume the ROCs are even better than multi-purpose CPUs.

How have you determined that it is the parity computations that is limiting your speed, and not just the HDDs or system bus that is the limiting factor?

rkagerer

06-15-2012, 05:37 PM

Wow Steve thanks that was really educational!

System will mainly be used to store backups and media. Might also use it as shared storage for ESXi at some point down the road (or even as local storage - I tested and the LUN's can be mounted as RDM's). I realize the Areca is overkill but I need to pick one up anyway to have a backup for the one in my main machine. Also been working on developing some tooling for my maintenance stuff to talk to these cards so the effort can be reused there. 7200's may be overkill too compared to lower RPM's/power but I don't mind paying a bit of a premium to get bulk operations done a little quicker when I'm mucking with it.

I'll check out SnapRAID and FlexRAID. I did look at stuff like UnRAID, Gluster, etc.

Thanks,

-Richard

Anvil

06-16-2012, 02:25 AM

Hard Disk Sentinel reads through the LSI 9265, a great tool!
(still on 3.7 and I noticed that V4 is available)

I'll try on the 1880 later.

stevecs

06-16-2012, 02:37 AM

@rkagerer no problem. As for drives (assuming you haven't bought them yet) one of the biggest items I run into here is with drives that do NOT support time limited error recovery (or early error recovery). This is only an issue when the drives are not working correctly but when that happens it's too late and you have the issue. Unfortunatly, low-cost drives usually don't support this option. Some Hitachi's do mainly on the ones that are for 24x7 (8760 hours/year) use, or nearline or full enterprise drives. Best thing is either to find someone who's posted the smart page data to see if it's supported, or buy one for yourself and see. When you start getting into large arrays this becomes more and more critical. The ones I picked up here (ST2000DL03's) suck and wouldn't buy them again for large arrays (picked for power savings) but have wasted months of my time working with their finiky nature.

@johnw-Well first it's theoretical max based on computer architecture for memory (~31992MB/s per cpu so ~63984mb/s / 4 as there are generally 4 memory/copy/move operations to get I/O into/out of the cpu so ~16000MB/s). (won't expect to really hit that but that gives a high water mark). Then use the xor.c test from linux's crypto environment (what meta devices uses at kernel load to find the fastest implementation). On mine this comes up to about 14126MB/s. so not bad (88%) of what I would expect for a cpu only figure. Then it's moving trying to figure out getting data through the IOH and cards to the cpu. This, without analysers is kind of 'artful' so I'm mainly doing the same as above testing 1,2,4,8,16,... drives at a time off a single controller, finding the max of each until I hit a plateau then I add in another card, do the same with 1,2,3,4,16,... on both of those (making sure they're on different IOH's) and see if I'm hitting a card or IOH, rinse repeat until no further improvement. at the same time watching service times (how long commands take to execute) average wait's, queue depth, request size, etc). Making sure I'm not hitting a drive or driver saturation level (outstanding commands, or in any wait states). Also noting that sata drives tend to 'slow down' for requests and cause other issues when pushed really beyond 40% (sas/fc can push generally to 80% utilization). So basically means throwing more drives at the mix to keep utils down. Watching IRQ loading, etc. rinse & repeat ad nausium. Currently, when testing I'm hitting 95% utilization on all cores (with/without hyper threading, though have noticed a /slight/ improvement with hyperthreading enabled (<10%) but with drives with zero waiting and utilization down (granted, this is with 128QD and 112 drives so ~1 command/drive when equally balanced). Getting >128 (256QD) shows no improvement which is understood as anything after ~90QD or so the cpu's were just being killed. Would love to get a dual LGA2011 system as that removes the IOH's from the picture for one thing, better cross-cpu (dual QPI links) at faster clocks, and better memory. But haven't won the lottery yet. :(

@anvil-are you passing scsi commands to that LSI9265 or using the ata command set for drive data?

johnw

06-16-2012, 07:26 AM

Currently, when testing I'm hitting 95% utilization on all cores (with/without hyper threading, though have noticed a /slight/ improvement with hyperthreading enabled (<10%) but with drives with zero waiting and utilization down (granted, this is with 128QD and 112 drives so ~1 command/drive when equally balanced). Getting >128 (256QD) shows no improvement which is understood as anything after ~90QD or so the cpu's were just being killed.

What test (i.e., program and workload) were you running? And what were you using to measure CPU utilization?

stevecs

06-17-2012, 04:59 AM

@johnw-real world tests are from custom programs and data scrubs where I'm accessing through the file system to modify TB's of large files, sythetics I'm mainly using xdd, iozone, and similar. As for cpu, pretty standard tools mpstat, sar, top, etc. If I'm really interested to dig would use /proc/stat and /proc/<pid>/stat and calculate out the averages for a process but I'm mainly focusing at the macro level now for this.

@rkagerer- don't know if HA is a huge issue for you or not, but one reason why I have so many chassis is to avoid backplane failure causing an outage (or with cheap drives, one goes bad and causes an issue on that particular expander). Basically if a drive starts to screw up it can cause issues with anything off the same expander (command timeout issues, hanging issues, etc). This is bad anyway around, but with multiple chassis and making sure that there is enough redundancy that if you loose one (times out) it won't loose your dataset. Now, this is $$ so it may not justify it to your need for a home user.

johnw

06-17-2012, 08:20 AM

stevecs

06-17-2012, 09:57 AM

;) so was your Q. calculating cpu time is a hard job, but from your response you were looking for specific loop time? (if so, xor.c is what you're looking for) If you're looking for other data in situ then the other responses. Anyway, not to derail this thread as far as I have already (sorry rkagerer), we can continue this elsewhere.

josh1980

06-18-2012, 09:02 PM

I've recently put together a FreeNAS server. The good thing is RAID controller performance isn't very important since you don't use RAID. I have 16 2TB drives on a RAID-Z2 with 2 VDEVs and I can max out 2 gigabit NICs with reads and writes(120MB/sec+ on each NIC simultaneously).

I'll probably never go back now that I've used FreeNAS. It's so awesome! I've used Windows Server 2003 and 2008 R2 and never gotten the performance that I have from FreeNAS.

rkagerer

06-18-2012, 10:48 PM

Josh, thanks for the FreeNAS tip - you've caught my attention! Can't believe I didn't stumble on it before (or more likely stumbled on it but dismissed it prematurely). It sounds like exactly what I need to make ZFS a bit more idiot-friendly and I'll do some more reading on it. Any downsides/issues you've encountered? Any tips? (I'm happy to take this offline or to another thread)

mobilenvidia

06-23-2012, 07:34 PM

Did some benchies ages ago with my IBM M5015 (LSI9260-8i) with 512MB and BBU with 4x OCZ Solid3 SSD's
The LSI9260 is now old hat, a LSI9265 will do better still

RAID6 is not all that much slower than RAID5, on spindled drives you may not even notice it
And no CPU time involved at all.

RAID0
127807

RAID5

127808

RAID6

127809

If you are thinking about ZFS, you'll be wanting to stay away from the RAID controllers and get your self a nice LSI HBA ie the LSI9202-16i (dual SAS2008), no on Ebay for $399, get 2 so you can control up to 32 drives.
The LSI9202 is PCIe 16x, so beware

My 2c worth

josh1980

06-25-2012, 08:13 PM

odditory

06-30-2012, 12:19 PM

One concern I have is the Norco backplanes only do Sata 3g.

Huh?? Where did you read that? You do realize Norco's backplane is just passthrough. There are no active electronics on there doing link negotiation, any more than a plain old sata cable would. Thus 6G SAS-2 and SATA-III devices work just fine at 6G assuming you're connecting to a 6G controller.

I'd also recommend against a 24-port Areca and instead get an 8 port card and one or multiple expanders. Don't get me wrong they're great cards and I have sworn by Areca since inception and have owned many of their 24-port cards over the years but the key is interchangability: by decoupling port count from the controller you can upgrade controllers much more easily as newer ones become available. Granted Areca's have good resale, but you'll lose less money selling an 8port card secondhand than you will a 24port.

Example, Areca 1880i + HP Sas expander = 32 ports for less than $750. Or 1882i for a little more.

odditory

06-30-2012, 12:41 PM

By the way you never indicated the usage scenario in the OP. There are many reasons for and against striped arrays depending on that. Example if you're just storaging bluray rips and media then you may be better off just going JBOD and using a non-striping + parity system like Flexraid or one of the others. Striping introduces unnecessary risk when you're just storing media. Striped RAID = performance & availability multiplier + disk pooling as a side effect, thats about it.

johnw

06-30-2012, 01:55 PM

By the way you never indicated the usage scenario in the OP.

You should have read past the OP. This one, for example: http://www.xtremesystems.org/forums/showthread.php?281543-Big-RAID6-array&p=5111930&viewfull=1#post5111930

Highendtoys

06-30-2012, 03:38 PM

And I thought my QNAP TS-1279 with 12 4TB drives was something.... /hangs head in shame :)

alpha754293

07-03-2012, 11:00 AM

Sorry I'm late to the party.

Yes, your RAID array will saturate the BP. BUT the BP will also give you hotswap capabilities which if you have a direct connect, you can't do quite nearly as easily.

I've used ZFS before to manage an 8 TB array (16x 500 GB a few years ago). Keep tape backups. That's all I'm really going to say about that. (Well...okay...that and it actually ISN'T very complicated at all, but you can do a number of rather interesting things with it if you pair it up with DTrace.)

Btrfs is not officially released yet. It's still hovering between the alpha and beta phases, although I think you can get it from the latest either Open or Enterprise SuSE. The biggest problem (I've found) with ZFS/Btrfs - there's no data recovery tool. See, if a NTFS array dies, you can always just run a NTFS drive scanner to recover your files/data. With ZFS/Btfrs - no such tools exist. Which means, you HAVE to have backups for recovery. (Which is fine if you're running a company or you have enough money to be able to afford the LTO drive and tapes. I ran the calculation once that it's actually cheaper to have a second live, hard-drive based system for "backup" than it would be for me to get LTO3/4 for > 8 TB of data).

(I'm sitting on 30 TB raw right now, 27 TB RAID5, single NTFS volume.)

I think that your RAID bandwidth isn't necessarily going to be nearly as important as your network bandwidth. It's going to depend on how many users you've got connected to the system, and how many "channels" of access by which they're accessing it (and how the data is being placed/generated on the array).

Obviously, if you've got a lot of users, and the poor array only has a single GbE connection, it's going to be limited anyways.

But if you've got 25 SATA 6Gbps drives, you're looking at a theorectical peak of 150 Gbps, which would be enough for five IB QDR connections, all maxed out.

Conversely, if all of the data is auto-generated on the system (and users/clients are mostly read-only), then it'll be as fast as the system can generate/write it.

@stevecs -- there's a LOT of really good information here. But I remember years ago when I was getting prepared to build the 8 TB array, it certainly gets easier when you have a suitable and appropriate budget for such builds.

But when you're limited by budget but the requirements are only slightly diminished, it makes the task of accomplishing the same (or very similiar thing) for less that much harder.

It's the same reason why I've reverted back to a NTFS array because after having used ZFS (albeit NOWHERE to its fullest potential), and tinkered with Btrfs for a few minutes, without tape backups, NTFS became the winner because of my ability to scan drives and arrays in the event that I need to recover data. (Still, nothing beats tape for better data security/integrity. But like I said, tape can be quite expensive, especially if speed becomes an issue (backing up 27 TB at 40 MB/s will probably put it close take more than a week, which means if you're going to do a weekly backup, it'll be backing it up indefinitely.))

The reality is that for large storage arrays, it's going to be a number of factors that come into play. Your conventional "home" user likely won't have to deal with having to do a full blown systems level analysis; but if you've got 150 users, a 42U rack full of drives, everything comes into play.

Size and distribution of files, how many users, what's the network connectivity, RAID, BP, HBA, and host system details/specifics.