Best controller for a system with eight SSDs?

**AceNZ** · 01-25-2010, 06:54 PM

I'm in the process of putting a system together that will have eight SSDs (probably the X25-E). I'm trying to maximize performance, which in my case particularly means small-block I/O. I'm also planning to use parity RAID (probably RAID-5), since reliability is important for this application. The OS will be Win 7 x64. The SSDs will hold Windows, applications and current user data, with separate magnetic media for bulk storage.

What's the best way to do this? The 1231ML seems to max out perf with about 4 drives (at around 800MB/s). So I could use two boards with 4 drives each -- but they would have to be two logical drives, since I can't boot from them in a software RAID setup. It also costs two drives for parity this way.

Another option would be a PCIe 2.0 board that supports 8 drives -- for example, the 9260-8i. It's x8 like the 1231ML, but doesn't support native SATA, so latency would be a bit higher, although peak throughput should also be better. I only lose one drive to parity, and only have to buy one controller instead of two. I don't mind paying for an extra controller, but only if there's a clear benefit.

I also have ICH10R available on the motherboard, although that only supports 6 drives (through an x4 bridge), and doesn't provide any caching.

**One_Hertz** · 01-25-2010, 07:00 PM

What is this for? The 2x 1231 setup would be fastest in majority of the cases, but I am not sure how cost effective that would be.

**Computurd** · 01-25-2010, 07:23 PM

the 1231 is sata 3gb.s. anyone investing money in a 3gb/s card right now is mentally retarded. by this time next year all drives will have 6gb.s drives. you want your controller to be forward compatible right? the difference between the 1231 and the 9260-8i does n0ot even begin to justify the cost of the 1231. DOUBLY so considering that is a last gen card.

**stevecs** · 01-25-2010, 07:54 PM

Depends on your use of the subsystem. Max throughput is not the only use, the OP is mentioning small block I/O and X25-E's. With that (80/20 rule) gives about 26-29,000 IOPS at 4KiB each. Controllers generally cap out around 200,000-300,000 IOPS. Putting all 8 drives on on controller would come close to saturating it's capabilities (not to mention the sata issue of being uni-directional in it's data path). With that in mind, it's arguably better to get two controllers that are cheaper than wait on a new controller and still hit the same IOPS bottleneck. Spread the load. It really comes down to what your design goal is.

**AceNZ** · 01-25-2010, 09:03 PM

Originally Posted by One_Hertz

What is this for?

This will be a high-end combined development / desktop and server (database) machine. Used for gaming too, but only rarely.

Originally Posted by Computurd

the 1231 is sata 3gb.s. anyone investing money in a 3gb/s card right now is mentally retarded. by this time next year all drives will have 6gb.s drives. you want your controller to be forward compatible right? the difference between the 1231 and the 9260-8i does n0ot even begin to justify the cost of the 1231. DOUBLY so considering that is a last gen card.

Sure, upward compatible is nice, provided it doesn't cost too much (in either dollars or performance). I guess the first 6gbps SSD should be out next month (the Crucial C300), at about the same price point as the X25-E. I can probably wait a month to order the parts, but not much more.

Originally Posted by stevecs

Depends on your use of the subsystem. Max throughput is not the only use, the OP is mentioning small block I/O and X25-E's. With that (80/20 rule) gives about 26-29,000 IOPS at 4KiB each. Controllers generally cap out around 200,000-300,000 IOPS. Putting all 8 drives on on controller would come close to saturating it's capabilities (not to mention the sata issue of being uni-directional in it's data path). With that in mind, it's arguably better to get two controllers that are cheaper than wait on a new controller and still hit the same IOPS bottleneck. Spread the load. It really comes down to what your design goal is.

The problem with multiple controllers is that many of my applications are single-threaded, so spreading the load often isn't a good way to speed them up. However, my motherboard has lots of PCIe lanes; the more of them I can actively engage, the better.

My design goals are (in order):

1. Maximize reliability
2. Minimize latency
3. Maximize random-ish small block reads
4. Maximize random-ish small block writes
5. Maximize sequential large block reads
6. Maximize sequential large block writes
7. Good gaming support
8. Minimize cost

I'm leaning toward dual Xeons, 48GB RAM, appx 10TB magnetic storage in RAID-6 (controller still TBD; maybe the 1680ix?).

**stevecs** · 01-26-2010, 12:22 AM

If your queue depths are going to be small then there's not much you can really do as you're going to be limited by the speed of a single SSD or single device. If that's what you're doing then you may want to look at ram based solutions (acard, fusion i/o, et al) put your os on an ssd and then move what you're doing over to the ram and use that. Otherwise change your process to increase concurancy/queue depth. Multiple controllers are not an issue (and are the solution actually) for high performance. You spread your discs across them and then use the OS to create a striped volume (dynamic disc, or if $$ VxFS from veritas in windows).

Thing is your list (random i/o small block and sequential large block) are antithesis of each other. If you try to do both you'll not do either well. For large storage (you mention raid 6 & 10TB) you want for max iops to create multi-layered raids (i.e. the more raid-6's you have and striped the better for overall performance as it increases your write throughput which would otherwise be killed by the 6 for 1 penalty (3 reads+3 writes for ever write for less than a full stripe width). Would recommend small drives (146/300GB) and bunch them into several 6D+2P raid groups that you merge into one volume space. Assuming your goals apply to that as well. This also goes with multiple controllers so you could just create multiple tiers (say two controllers 4 ssd's each, raid-10 for availability; say 64 15K 300GB drives in 8 raid-6's would give ~7500read IOPS or ~800write IOPS and ~13TB of space. Or you can raid-10 those as well and get ~9TB of space but ~10000 read/4600 write iops.

What is your budget and what is your base performance goal you are looking for (since you mentioned DB work that's going to be pretty random access).

**AceNZ** · 01-26-2010, 03:43 AM

Originally Posted by stevecs

If your queue depths are going to be small then there's not much you can really do as you're going to be limited by the speed of a single SSD or single device.

My workload is a mix. Sometimes, my queue depths are low -- which is why minimizing latency is near the top of my list. But at other times it might get up to 20 or so, depending on what I'm doing. I would expect queue depths to drop with a much faster disk subsystem.

Based on some monitoring I've done, I believe that one of the main areas I need to optimize is OS-oriented functions: things like paging, application start-up, DLL loading, processing temporary files, I/O for services, anti-virus, etc. Another focus is everything related to Visual Studio, which also loads lots of DLLs/plug-ins, plus running the compiler, handling temp files, etc. SQL Server and IIS are third priority -- although SQL Server is easy in some ways because it's completely configurable (it will have a dedicated log drive, for example).

Originally Posted by stevecs

If that's what you're doing then you may want to look at ram based solutions (acard, fusion i/o, et al) put your os on an ssd and then move what you're doing over to the ram and use that.

Since RAM devices lose everything if they lose power, it would mean frequently copying everything back-and-forth to more durable storage, which isn't workable in my environment.

The Fusion IoDrive is a possibility, although at about $3300 for 80GB the cost is about 3x compared to the X25-E on a per GB basis, if you include the controllers (and the -E is already outrageous). I'm hoping there's a better way.

Originally Posted by stevecs

Otherwise change your process to increase concurancy/queue depth. Multiple controllers are not an issue (and are the solution actually) for high performance. You spread your discs across them and then use the OS to create a striped volume (dynamic disc, or if $$ VxFS from veritas in windows).

The issue there is, AFAIK, software striped volumes aren't bootable -- so that would help with my application stuff, but not with the OS stuff. With only 8 drives and a requirement for redundancy, does splitting them onto three controllers (one for OS and two for apps) really make sense?

Originally Posted by stevecs

Thing is your list (random i/o small block and sequential large block) are antithesis of each other. If you try to do both you'll not do either well.

Which is why they're prioritized. Small block is more important for the SSD subsystem. If that means completely giving up large block perf, that's fine.

Originally Posted by stevecs

For large storage (you mention raid 6 & 10TB) you want for max iops to create multi-layered raids (i.e. the more raid-6's you have and striped the better for overall performance as it increases your write throughput which would otherwise be killed by the 6 for 1 penalty (3 reads+3 writes for ever write for less than a full stripe width). Would recommend small drives (146/300GB) and bunch them into several 6D+2P raid groups that you merge into one volume space. Assuming your goals apply to that as well.

The goals don't apply to the large storage subsystem.

The RAID-6 array will primarily be used for secondary storage and archive, where performance isn't as critical. Actually, large block perf is probably more important there than small block, and latency can be higher. Reliability is really the main metric there (I may go with RAID-60; that part is still TBD).

Originally Posted by stevecs

What is your budget and what is your base performance goal you are looking for (since you mentioned DB work that's going to be pretty random access).

I currently have a rough budget for the SSD side of things of appx $400 each times eight drives, plus $750 each for two controllers, so $5K or so. On the magnetic side, appx 13 1TB drives at $150 each, plus $1K for the controller, so about $3K or a little more if you include chassis components (this needs to all fit in a single tower, so it's a bit of a squeeze). Tentative config for the magnetic drives: 1 hot spare, 2 in RAID-1 for DB log, remainder as either 8+2 RAID-6 or possibly as two 3+2 RAID-60 (with two or maybe three partitions, including one short-stroked for best perf).

The X25-E is speced at 35,000 IOPS for random 4K reads. I'm hoping with a RAID-5 array that I might hit close to 200,000 IOPS. SQL Server actually reads and writes 8K pages and 64KB extents, but I would still be inclined to push the optimization toward the smaller side.

IIRC, the X25-E is internally striped in 4K blocks -- which means a 4K strip size would be a reasonable minimum. In a 7+1 RAID-5 config, that would be a 28K stripe. Possible alternatives: a 4+1 RAID-5 (16K stripe) and a 2+1 RAID-5 (8K stripe), or two 2+1 RAID-5 and one RAID-1 mirror.

**stevecs** · 01-26-2010, 05:28 AM

Ok, what you're describing really needs multiple raid sets, don't create a 'catch all' from all the drives as you don't have enough to handle the varying types of loads. From what you're describing I can see the following raidsets: OS, Application, Scratch, Database, DB Transaction log, Games, Long term storage. Your QD will be large only on a couple of items (probably the scratch and database ones; but small on the others) I am of the opinion that a pagefile is useless if you have the money for RAM (in over 30 years I have yet to find an application that -needed- a pagefile/swapfile. Not saying that they don't exist but haven't run across one) that would dramatically cut down on your finite IOPS and buss bandwidth. Keep it in memory).

As for striped volumes and booting, that's solved by using a raid-1 or 10 for your OS raidset. You use other raidsets for the other functions which can be striped or not. Splitting across multiple controllers does make sense IF you are running into a controller; buss; drive et al bottleneck. Remember that the Areca's (and most other similar architecture cards, (adaptec, et al) have 256 command limit to the card. Say even if you /CAN/ get 32 commands per sata drive (unlike the 128-256/sas drive) that's 8 drives and if you reach that you don't give the card any time to re-order and forward them so your service time will increase to the array. This along with the IOPS limits it's generally better to have subsystems below their spiral point. For sata my rule of thumb is below 40% utilization.

The X25-E has a read at 35,000 at 4KiB but a write at 3,300 at 4KiB so 80/20 that would be ~27000 IOPS or ~108MiB/s per drive. With a RAID-5 array you will be lucky to get 3300-4000 IOPS. Remember random writes with parity raids are limited by a single drive. You need a multi-level raid to increase them (raid-50 or 10 et al). The cache on your controller will mask this (as it's buffered) but you will never have enough cache for the entire subsystem (if you did, why have the subsystem?). I would leave your parity raids for your long term storage, games, application partitions. Scratch would probably be a stripe (raid-0) assuming it really IS scratch and you want that for speed. OS would be raid 1 or 10 as well as your DB/transaction log partitions.

Since you are looking at a small budget and constraints of a single tower you have to make concessions. If you only have 13 1TB drives, and 8 SSD's total, I would probably do the following:

(most equal load balance assuming you want to push real high QD's) Two SSD's (one per controller) merged into a mirror at the OS level (dynamic disc) for your OS. split the remaining 6 drives per controller into two raid 0's and mirror at the OS (RAID0+1). The 13 drives (would really try and get SAS drives here as it's bi-directional unlike sata and have much greater command queue depth/reordering) spit into two raid-5's of 6 drives each one per controller. set up a raid-0 in the OS for the two luns (raid 5+0). The last drive would be a cold spare in case of failure of one of the 12. This would give you ~9TiB of space and about 790read iops/152 write iops (assuming 78/73 for a single drive (7200rpm/sata).

Now if load balancing wasn't a main issue (i.e. not high loads) and wanting to offload any dynamic disc functions in windows would probably put the raid 5's (still do a raid-50 of 12 drives) on one controller with the two SSD's in raid-1 for your OS and the other would have the 6 ssd's in raid-10, bump the cache on both cards to 2GiB to aid in writing the dirty data out.

Then from those luns above create partitions for each of your file systems (i.e. create a file system for each function above db, trans log, app, games, et al). This allows you to have a different cluster table for each so you don't create a bottleneck there as it's single threaded per file system.

**Levish** · 01-26-2010, 09:52 AM

Originally Posted by AceNZ

Since RAM devices lose everything if they lose power, it would mean frequently copying everything back-and-forth to more durable storage, which isn't workable in my environment.

it would be a matter of setting up your backups which when implementing a server you have to do anyway. And since its a server its assumed the device only gets turned off or restarted for planned maintenance.

a couple ideas:

- vss for enabled on the Ram Disk and checking hourly and the storage used for vss set to normal disks or ssd's (this wouldn't work well for DB's).

- ntbackup full daily, differential hourly of the ram disk the differentials would overwrite so you'd only have the most recent diff and the full. Then backup again to your Tape/CD/DVD whatever for archival (this would work for DB's just setup backups for the DB and transaction logs).

Ram Disks give you the IOPs and MB/sec to do what you want, consider them if they can be made to function in your environment.

Why not after all? So at worst you'd be out about 1hr of work while getting the benefits of a Ramdisk, most smaller companies can tolerate 1 day of data loss for some services (and much less for others).

**AceNZ** · 01-26-2010, 05:12 PM

Originally Posted by stevecs

I am of the opinion that a pagefile is useless if you have the money for RAM

Good point.

Originally Posted by stevecs

As for striped volumes and booting, that's solved by using a raid-1 or 10 for your OS raidset.

So, an array that's mirrored across multiple controllers is bootable? Does it just boot from half of the array?

Originally Posted by stevecs

Two SSD's (one per controller) merged into a mirror at the OS level (dynamic disc) for your OS. split the remaining 6 drives per controller into two raid 0's and mirror at the OS (RAID0+1).

If I need to give up half of the SSDs for data reliability, then I will probably need to move to the X25-M instead of the -E, in order to have enough capacity. Of course, that costs both in terms of write performance and device life / MTBF. At least the hardware is a bit cheaper (80GB version).

This also raises the question of whether I should wait a month (maybe more) for the upcoming 6gbps SSDs, such as the C300.

Originally Posted by stevecs

The 13 drives (would really try and get SAS drives here as it's bi-directional unlike sata and have much greater command queue depth/reordering) spit into two raid-5's of 6 drives each one per controller. set up a raid-0 in the OS for the two luns (raid 5+0). The last drive would be a cold spare in case of failure of one of the 12. This would give you ~9TiB of space and about 790read iops/152 write iops (assuming 78/73 for a single drive (7200rpm/sata).

The only affordable SAS option I'm aware of is the Seagate ES.2. They are about a 50% price premium per GB compared to the WD RE3. To stay in my budget, I would therefore have to decrease the number of drives from 13 to 9 or so. Is the improvement from bi-directionality and NCQ really worth that much?

Do you know if controller latency is any better with the SAS drives? The SAS controllers usually emulate SATA in software, which can add to latency (hence the advantage of the 1231ML, since it's a native SATA controller). Although, I vaguely recall that the ES.2 does a SAS to SATA conversion on the drive side, so maybe it's a wash?

I would love to use 10^16 BER / 1.6M hour MTBF SAS drives (the 2.5-inch versions would be ideal), but the 6 to 8x higher cost per GB makes them prohibitive for this application.

Originally Posted by stevecs

Now if load balancing wasn't a main issue (i.e. not high loads) and wanting to offload any dynamic disc functions in windows would probably put the raid 5's (still do a raid-50 of 12 drives) on one controller with the two SSD's in raid-1 for your OS and the other would have the 6 ssd's in raid-10, bump the cache on both cards to 2GiB to aid in writing the dirty data out.

Would you still use RAID-50 with 10^15 BER drives like the RE3? Or would RAID-60 be better?

OK, so now we can get back to the question in the OP: which controllers would be best here?

Originally Posted by stevecs

Then from those luns above create partitions for each of your file systems (i.e. create a file system for each function above db, trans log, app, games, et al). This allows you to have a different cluster table for each so you don't create a bottleneck there as it's single threaded per file system.

Sounds right.

Originally Posted by Levish

it would be a matter of setting up your backups which when implementing a server you have to do anyway. And since its a server its assumed the device only gets turned off or restarted for planned maintenance.

It's not a server. This will be a desktop machine, running Win 7 Ultimate.

Originally Posted by Levish

- vss for enabled on the Ram Disk and checking hourly and the storage used for vss set to normal disks or ssd's (this wouldn't work well for DB's).

VSS doesn't work in Win 7, does it?

Originally Posted by Levish

ntbackup full daily, differential hourly of the ram disk the differentials would overwrite so you'd only have the most recent diff and the full. Then backup again to your Tape/CD/DVD whatever for archival (this would work for DB's just setup backups for the DB and transaction logs).

Certainly an option; it's just messy -- plus, of course, the system performance drops through the floor while the backups are in progress.

The other issue is cost, which is considerably higher than an SLC SSD, on a per-GB basis.

Even so, I am (slightly) tempted by the IoDrive. It's not RAM, so it doesn't have the volatility issues of the Acard. But the cost is still crazy: $3K for just 80GB, and you can't use it as a boot device. The plus side is high random 4K IOPS and low latency -- although the benches I've seen don't seem to be nearly as good as the specs.

**stevecs** · 01-26-2010, 06:15 PM

Originally Posted by AceNZ

So, an array that's mirrored across multiple controllers is bootable? Does it just boot from half of the array?

The array controller won't have an 'array' it will just be two drives exported. The RAID-1 is created by the OS as a dynamic disk. There is no difference here than if you had two drives directly attached to your motherboard (with the exception you get the benefit of a BBU and ram cache for the drives). You set your bios to have both drives (luns) as bootable, you install your OS on the first drive and then use windows computer manager to convert the discs to dynamic and create a mirror. That gives you the read benefit of raid-1.

Originally Posted by AceNZ

If I need to give up half of the SSDs for data reliability, then I will probably need to move to the X25-M instead of the -E, in order to have enough capacity. Of course, that costs both in terms of write performance and device life / MTBF. At least the hardware is a bit cheaper (80GB version).

This also raises the question of whether I should wait a month (maybe more) for the upcoming 6gbps SSDs, such as the C300.

Actually would really look at the 160GB G2's if you need it for space due to the raid-0+1 of the drives. Would not really go much higher as the reliability decreases by doing this but it does increase iops. (MTTF/3)^2 in this model opposed to a traditional raid-10 which would be (MTTF^2)/n where n is the number of drive pairs. OK for small arrays like what you are looking at but it does not scale well. As for 6Gbps sata/sas like I mentioned not much of a problem here as you are using these for IOPS NOT for streaming performance so you will only be in 1.5-2.5Gbps saturation depending on request size. These are for your databases/high IOPS.

Originally Posted by AceNZ

The only affordable SAS option I'm aware of is the Seagate ES.2. They are about a 50% price premium per GB compared to the WD RE3. To stay in my budget, I would therefore have to decrease the number of drives from 13 to 9 or so. Is the improvement from bi-directionality and NCQ really worth that much?

Do you know if controller latency is any better with the SAS drives? The SAS controllers usually emulate SATA in software, which can add to latency (hence the advantage of the 1231ML, since it's a native SATA controller). Although, I vaguely recall that the ES.2 does a SAS to SATA conversion on the drive side, so maybe it's a wash?

I would love to use 10^16 BER / 1.6M hour MTBF SAS drives (the 2.5-inch versions would be ideal), but the 6 to 8x higher cost per GB makes them prohibitive for this application.

If you are looking for 1TB (or actually >600GB) drives then you are going to be limited to 10^15 BER, same boat I am in for large storage. You will only get 10^16 for the cheetah (using seagate model names here as I know them better) line and yes they are more expensive but also come at faster RPM ratings (10K & 15K which do help). However this array is primarily for large storage and for app/game/ et al files in which case bit density is mainly what you are looking for. The Constellation drives would be the best but they are not in the shops yet. If you have luck with another drive vendor fine. What you want though is "enterprise" or raid enabled drives and ones that are supported by your controller. SAS, personally, would be the way to go to eek out more performance. it's the one thing I've really kicked myself for being 'wooed' by the sata price tags. Not worth it (if I had the chance to do-over I would). For 1TiB ES.2 (older tech) would be about $220/drive for the sas version. It does handle larger queues and has a smoother utilization curve among other things which could be helpful if you're pushing things. Otherwise like I said bit density is the big item, this translates into streaming performance.

And no, there is no 'sas to sata' conversion on the ES.2 It's the same drive you are buying the electronics (controller) the sas controller has the better logic which is what you're buying. Without knowing your real-world workload can't really point you more than that.

Originally Posted by AceNZ

Would you still use RAID-50 with 10^15 BER drives like the RE3? Or would RAID-60 be better?

It's a trade-off sure and comes down to your own comfort level. 2x6 drive raids would be (assuming ES.2 SATA here) Both (as it's based on # of drives and capacity (4.69% probabilty of a bit error in a single array (9.15% across all discs))
RAID5: ~790Read/150Write IOPS; ~9TiB usable space
RAID6: ~631Read/100Write IOPS; ~7.2TiB usable space
Since these are small arrays and if I had a cold spare probably go with a raid-5 knowing that you have a window during rebuild for a loss. Sure if I had more drives I would do raid-6's but that costs. (you can look at my .sig to get an idea). This does NOT mean you don't need backups, you should always have those for data recovery issues. This is for performance + space with minimal additional availability to help against drive failure.

Originally Posted by AceNZ

OK, so now we can get back to the question in the OP: which controllers would be best here?

I've used Areca, LSI, Adaptec cards. All are basically comparable. First item is expandability/flexability. If you're looking to grow to more or external chassis I would really consider sas as that really helps. I like the areca's mainly for the user interface and network port which does not have a performance impact (management when dealing with lots of cards). Otherwise Adaptec 58xx series is good though they do get hot to you need better cooling for your case. LSI cards are good, though generally slower than adaptec/areca within the same line, also seem to have issues with some of their firmwares being bloated (take up too much of the 1MiB ROM space so for 'desktop' systems can cause issues with some boards that load too much into the rom space).

All cards have issues/compatibility problems of some sort as they are mainly tested with server systems only, not desktop, and only the big players on top of that. If it was me and buying today, I would probably pick the Areca 1680ix-12's (smallest card that you can expand the memory on), otherwise a toss-up of the Adaptec or LSI.

**AceNZ** · 01-27-2010, 12:18 AM

@steve -- thanks, I think this makes sense.

After SteveRo's comments in the Napalm thread, I'm starting to think about using a larger number of less expensive SSDs -- maybe 16 Vertexes or X25-Ms instead of 8 X25-Es, with 8 on each of two 9260-8i controllers (since they seem to be able to handle 8 drives without saturating, unlike the Arecas, AFAIK). Seems like it wouldn't help my small block perf much, but maybe the tradeoff is worth it.

EDIT: random I/O perf on the Vertex doesn't look too good. The X25-M seems much better.

So, the next question is whether to go with 4 160GB drives on each controller, or 8 80GB drives on each. I guess your point is that random perf is really driven by the perf of the individual drives, whereas streaming or large block perf is driven by the number of drives -- so to optimize small block it's better to go with a smaller number of higher-performing drives, provided they meet my size requirements. Does that sum it up?

**stevecs** · 01-27-2010, 05:30 AM

The X25-M G2's do seem rather good, I'm still concerned about wear leveling in general on SSD's (have burned up some SLC based drives already, though my access patterns are probably /much/ higher than the norm).

Random IOPS are a derivation of service time of the drive. For traditional drives that's pretty much your rotation rate, for SSD's it would be the design & speed of the nand chips (number of channels, how many cells per block for the erase cycle, et al). Streaming is mainly based on bit density. RAID itself doesn't really provide improvements unless your queue depth is higher than 1 (and usually much higher than that to really show the benefits) also for streaming you need to have request sizes that span multiple stripes. For writes likewise you need to really operate in full stripe widths to avoid the penalties of parity based raids.

Parity raids are generally chosen for storage efficiency over performance first, then for availability (raid6 > raid 1 > raid 10> raid5) then for block level integrity checking (not to be confused with data integrity as a raid doesn't care about your data really). To over come the write performance hits and to minimize rebuild time you create multi-level raids (raids that are striped) the simplest is raid 10 (1+0) but that also extents to any raid type (5+0 for example that we talked about above).

All things being equal, the more drives the better. Your comment above about system performance dropping during backups is very appropriate here. file based backups (opposed to block or file system level) suck up a lot of overhead. This is a VERY pertinent issue in enterprises and why things like synthetic backups and similar are getting to be very popular as systems can't perform the lookups fast enough to feed the backup systems. (for say a ntfs directory with the normal small file types (.dll's et al) it takes a LOT of drives to avoid a tape drive from shoe-shining (about 80-100 7.2Krpm drives or 40-50 15K drives for ~100-120MiB/s)). IOPS are a finite resource which you have guard in your subsystem design.

**Levish** · 01-27-2010, 09:30 AM

~0.8MB/sec roughly for a 7200rpm SATA drve at the 4k size (maybe worse) :p
Considering LTO2 tape drives, which can go pretty fast stevecs is right as usual not to even mention LTO3 and 4

Stevecs for a situation like that though, could you do something like backup nesting where you have the dll's or whatever backed up by NT backup into a single file and then you copy/move that file off the server being backed up to the server handling the backups and then back the single file up to tape (or just go straight to tape after the NT Backup) which should improve transfer rates significantly to the point where a couple/few drives would be able to provide the required throughput to keep the tape drive fed?

This would theoretically be done with the dll's being backed up ahead of time of the tape backup obviously.

**stevecs** · 01-27-2010, 11:01 AM

Doesn't make the difference NTbackup or other still needs to do file system lookups and then the seek hits to get the data. That's where the bottleneck is. A method I am seeing that helps to some degree (besides getting faster subsystems) is synthetic backups where you do a full backup of a system once and then you only do incremental backups forever more. The backup system will play the catalogue and merge say full backup + x number of incrementals and then create a new 'full' backup without ever touching the server. This cuts down lookups as you are only doing incrementals on the server. Problem here of course is that more advance functions of backup systems (file integrity de-dup checking et al) reverse this as now you have to go back and look up and read the file. So it's a tug of war. SSD's really help just due to the fact they have higher iops than traditional media so they can absorb the additional load better but then there is the cost and size factors. Like anything, there is no free ride.

**AceNZ** · 01-27-2010, 03:28 PM

Originally Posted by stevecs

RAID itself doesn't really provide improvements unless your queue depth is higher than 1 (and usually much higher than that to really show the benefits) also for streaming you need to have request sizes that span multiple stripes.

There are things you can do to mitigate this, particularly for smaller arrays -- although I think it's easier on conventional rotating media than with SSDs. The problem is that SSDs are already internally striped, so optimization requires dealing with two levels of striping.

For example, the smaller your QD, the more important choosing an appropriate strip and stripe size becomes. The filesystem cluster size can also have a big impact. For random access, disabling caching and read-ahead often helps, although it depends somewhat on the controller implementation.

There are also a few things you can do on the Windows side to help increase parallelism. For example, Windows will load multiple DLLs in parallel, but only if they're located on separate drives. So, instead of having a large C: drive or using multiple raidsets, you can split a single volume into two LUNs, and use one for the C: drive, and another for Program Files. Works particularly well on SSDs, since the penalty for seeks is gone. Lots of tricks in SQL Server are possible along these lines, too, since it allocates a single thread per drive letter / LUN.

Choosing a non-standard RAID level may also help in some scenarios, such as RAID-3 or RAID-4, since they force all drives in the array to be active, even with small request sizes. I've seen that work well on rotating media; I haven't had a chance to try it yet on SSDs.

**stevecs** · 01-27-2010, 06:05 PM

True for small QD's that your stripe size size has a larger (by percentage) effect, however this is minimal in real world practice. Read ahead disablement is good where your workload request size is small enough and random enough so that the next block from the subsystem would not be otherwise requested. With most raid (hardware) type cards you have options of 'drive read ahead' and 'volume read ahead' or similar. Those terms, the 'drive read ahead' is to have the drive read as many blocks as needed to match to the stripe size of your array. the volume read ahead is to read multiple stripes in anticipation of larger requests. Add on top of that file system read ahead (and/or logical volume or OS buffers). All of this needs to be tested, the defaults are just best guesses for general use which will probably never match a particular case. As for disabling caching (assuming you mean write cache) this is /always/ a good idea for any cache that is not battery backed (I'm not talking UPS here, I'm talking BBU). Some file systems support write barriers which allow some minimal caching to be enabled (the fs writes through to the drive and waits for a safe store response). This generally doesn't work w/ raid setups however (raid's fake the response).

cluster or block sizes of a file system only really have an effect on writes, there is no read effect (as, barring the above read a-heads, only the blocks that the application requests will be read regardless if it's a full cluster or not with the limit being that any device can't read less than the physical sector size of the media (i.e. 512byte generally). Higher performance file systems have concepts of allocation groups (think of the file system being comprised of many file systems each controlling their own clusters/blocks so things can be done in parallel), NTFS is not one of these however, though NTFS does use some features like btrees for indexing.

As for multiple I/O operations at the same time on different file systems, this is what I was alluding to before that you want different drives. Each file system (NTFS) can only do one task at a time due to the need to maintain a lock on the metadata. The problem here is that this is not really the OS doing this, it's just the nature of having multiple file systems so your application is the one that needs to be written to understand some parallelism or give you the opportunity to move structures to different file systems.

RAID-3 is a full stripe raid, it works well for video (large files) and some COW (copy on write) and log based file systems. It does not do well with general user data at all as there is no means to write a partial stripe so if you do that you will take a hit. RAID-4 (and variants) is nothing more than a raid 0 + separate parity disk (or disks). This can have problems with bottlenecking with parity updates and was one reason why raid-5 (distributed parity) became more popular so as to not have the same spindle(s) be a bottleneck. It works (or can be made to work with some abstraction layers) but you need to know your workload well to make sure you pick the right tool for the job. All raids have all spindles working all the time so I don't understand that comment really unless you're referring to IBM's distributed sparing idea (RAID 1E, 5E, et al) where even the spare disk is part of the raid set. This is nice but by doing this you no longer have a global spare (it's part of the raid set) but since you're always using it you have less of a chance to get surprised when a drive dies and the hot spare is found that it doesn't work. Also you get the additional drive's iops added to the raid set.

For SSD's, yet they have some (what you can call striping) of the cells or low level raiding which is important for alignment purposes but this is really no different than say SAN or environments using logical volume groupings (i.e. multiple layers of storage that have to be considered for alignment purposes). The biggest problem here is the lack of vendor information to explicitly indicate the underlaying mechanisms of the drive so it makes it very hard / lot of leg work to reverse engineer what is going on and then build up. I think it should be standard information on the spec sheet just like the drives block size.

**AceNZ** · 01-27-2010, 09:03 PM

Originally Posted by stevecs

As for disabling caching (assuming you mean write cache) this is /always/ a good idea for any cache that is not battery backed (I'm not talking UPS here, I'm talking BBU).

I agree about write cache, but I was actually talking about read cache. In some environments, I've seen measurably better perf with read cache disabled. SQL Server is notorious in that way. Since it has it's own rich cache and manages its own read-ahead, etc, the odds are very small that a remote cache will somehow contain needed data that isn't already cached by SQL Server itself. If the cache was 100% transparent from a performance perspective, it wouldn't matter. Unfortunately, that's not the case for many implementations.

Originally Posted by stevecs

cluster or block sizes of a file system only really have an effect on writes, there is no read effect (as, barring the above read a-heads, only the blocks that the application requests will be read regardless if it's a full cluster or not with the limit being that any device can't read less than the physical sector size of the media (i.e. 512byte generally).

I guess I'm not 100% sure of this, but I'm pretty sure that in Windows at least, the cluster size also determines the minimum physical I/O read size, since clusters are the unit of on-disk addressability, not blocks. Apps that are sloppy about how they read files can benefit from larger clusters, because the data they need for the next N reads can already be in RAM. OS read-aheads tend to only get one cluster ahead, and may use an extra I/O, to avoid delaying the originally requested block.

Originally Posted by stevecs

The problem here is that this is not really the OS doing this, it's just the nature of having multiple file systems so your application is the one that needs to be written to understand some parallelism or give you the opportunity to move structures to different file systems.

I agree about the application side, but some related perf issues are OS oriented. Application start-up time, for example. For an app that requires multiple DLLs, if they are spread out among multiple equal-speed LUNs, the OS will issue parallel I/O requests; otherwise, if they're all on the same drive, the requests are serialized.

Originally Posted by stevecs

RAID-3 is a full stripe raid, it works well for video (large files) and some COW (copy on write) and log based file systems. It does not do well with general user data at all as there is no means to write a partial stripe so if you do that you will take a hit.

It can help for read-heavy apps with certain access patterns. By having all drives active on every read with a QD of 1, read throughput is maximized compared to issuing separate requests to each drive and waiting for them to finish. Of course, this assumes that caching is effective and that the per-device internal striping doesn't interfere.

Originally Posted by stevecs

RAID-4 (and variants) is nothing more than a raid 0 + separate parity disk (or disks). This can have problems with bottlenecking with parity updates and was one reason why raid-5 (distributed parity) became more popular so as to not have the same spindle(s) be a bottleneck.

Bottlenecking is only a problem with QD > 1; otherwise, in both RAID-4 and -5, you're always writing one or more data drives and a parity drive. One potential advantage of RAID-4 vs. RAID-5 on SSD is that you could use SLC for the heavily-written parity drive, which should help maximize array life.

Originally Posted by stevecs

All raids have all spindles working all the time so I don't understand that comment really unless you're referring to IBM's distributed sparing idea (RAID 1E, 5E, et al) where even the spare disk is part of the raid set.

With a RAID-0 or RAID-5 array, with QD =1, if my app issues a request that's a single strip in length or less, only a single drive is active. Only with higher QD or larger request sizes will multiple drives be active. I think this is why many desktop users don't experience performance improvements when they move to RAID: at any instant, they are only using a single drive.

Originally Posted by stevecs

The biggest problem here is the lack of vendor information to explicitly indicate the underlaying mechanisms of the drive so it makes it very hard / lot of leg work to reverse engineer what is going on and then build up. I think it should be standard information on the spec sheet just like the drives block size.

+1. The Intel SSDs have ten internal channels, but Intel apparently still considers the details of how those channels work to be proprietary. I have some ideas on how to reverse engineer it, but as you said, it's time consuming.

**stevecs** · 01-28-2010, 03:45 AM

Originally Posted by AceNZ

I agree about write cache, but I was actually talking about read cache. In some environments, I've seen measurably better perf with read cache disabled. SQL Server is notorious in that way. Since it has it's own rich cache and manages its own read-ahead, etc, the odds are very small that a remote cache will somehow contain needed data that isn't already cached by SQL Server itself. If the cache was 100% transparent from a performance perspective, it wouldn't matter. Unfortunately, that's not the case for many implementations.

yes, if your app is already expanding the requests to fit it's needs it is effectively doing drive read ahead and volume read ahead from a subsystem standpoint so keeping that on in the subsystem you can be taking extra ms of latency each request to fill data that may not ever be needed.

Originally Posted by AceNZ

I guess I'm not 100% sure of this, but I'm pretty sure that in Windows at least, the cluster size also determines the minimum physical I/O read size, since clusters are the unit of on-disk addressability, not blocks. Apps that are sloppy about how they read files can benefit from larger clusters, because the data they need for the next N reads can already be in RAM. OS read-aheads tend to only get one cluster ahead, and may use an extra I/O, to avoid delaying the originally requested block.

Get microsoft's diskmon application, this is not true for reads cluster sizes have no relation to read requests. Even with write requests you do not fill up an entire cluster only the amount that is needed (minimum being the block size of the subsystem) though you have a higher chance to fill the cluster (i.e. if it's a directory node or mft as small files will be stored there then in it's own cluster). I posted the link and a small spreadsheet to calculate from it here: http://www.xtremesystems.org/forums/...7&postcount=54

Originally Posted by AceNZ

I agree about the application side, but some related perf issues are OS oriented. Application start-up time, for example. For an app that requires multiple DLLs, if they are spread out among multiple equal-speed LUNs, the OS will issue parallel I/O requests; otherwise, if they're all on the same drive, the requests are serialized.

I think we're saying the same thing, though this is not LUN based but file system based. (though technically it's not really parrallel but each file system has it's own queue for lookups of what blocks are being used and each block device has it's own request queue). So from what you're saying that if you have two luns/disks that comprise one file system (say dynamic discs here or other volume manager) the block device (disks) can both take the read/write requests and run with them when they get them, yes. However what I'm saying is that with a single file system across /both/ disks (i.e. c: which is a dynamic volume of both luns) the file system lookup that happens before the block read/write request is serialized. In cases of reads (where your OS may cache the mft or lookup information) the block device is generally slower so this will provide some benefit. Though on systems that we are building these days (ram based block devices, banks of ssd's; this is a bottleneck on low QD loads. for heavy use systems (high QD) this shows up even on traditional media as the OS can't keep everything in memory plus the need to write out dirty pages requires a refresh of the data which is where separating items to it's own file systems come in handy.

Originally Posted by AceNZ

It can help for read-heavy apps with certain access patterns. By having all drives active on every read with a QD of 1, read throughput is maximized compared to issuing separate requests to each drive and waiting for them to finish. Of course, this assumes that caching is effective and that the per-device internal striping doesn't interfere.

Same thing here, all drives won't be used on reads even with these raid set ups. The only one that would have all drives active on all reads (& writes) would be RAID-2, but due to it's design it's very costly (it's best when the # of drives matches the host OS's page size so you operate at that level). No-one makes a raid-2 implementation anymore though the last I saw was for I believe a unisys mainframe environment ages ago (80's?).

Originally Posted by AceNZ

Bottlenecking is only a problem with QD > 1; otherwise, in both RAID-4 and -5, you're always writing one or more data drives and a parity drive. One potential advantage of RAID-4 vs. RAID-5 on SSD is that you could use SLC for the heavily-written parity drive, which should help maximize array life.

Yes if you QD is <1 you don't have a bottleneck only perhaps a latenc concern but no bottleneck. However with >1QD and with RAID4, you have a dedicated parity drive w/ RAID4, but writes take longer than reads to perform, however your stripe set needs to be read (at least the drives involved in the write) to re-calcuate the parity so you still have multiple requests (some bad implementations require to re-read the entire stripe width and do the parity for the entire stripe). This takes multiple requests. Using SLC SSD's here may help to some degree (as assuming garbage collection is working well and you don't have a lot of delays to free up cells for all the writes) it will be faster so you don't have the disparity of the extra 1-2ms for write settle times, however you will run into another problem in that this is the probably the worst workload you can do to an SSD and will cut it's life. This would be nearly 100% small write (minus the data scrub read checks every week or so).

Originally Posted by AceNZ

With a RAID-0 or RAID-5 array, with QD =1, if my app issues a request that's a single strip in length or less, only a single drive is active. Only with higher QD or larger request sizes will multiple drives be active. I think this is why many desktop users don't experience performance improvements when they move to RAID: at any instant, they are only using a single drive.

assuming your request is not spanning a stripe yes, but at most less than 2 drives and you are correct here on a normal bottleneck that people are seeing (or not 'feeling' the benefits) for home users, as their workloads are very small.

**pjkenned** · 01-28-2010, 03:10 PM

Originally Posted by Computurd

the 1231 is sata 3gb.s. anyone investing money in a 3gb/s card right now is mentally retarded.

For SSD's sure. For mechanical SATA disks... 1.5gbps is generally fine no?

**Levish** · 01-29-2010, 05:10 AM

I've seen 120->90MB/sec on my 150GB VR and with some short stroking you could possibly see more with some of the newer drives. For anything involving cache it could have a pretty severe impact.

**AceNZ** · 01-30-2010, 09:24 PM

@stevecs -- thanks for your help and feedback; it has been very useful.

**pjkenned** · 01-31-2010, 11:43 AM

stevecs: Do you have any pics of the 65+ drive array enclosure? Which expanders are you using with those Areca's? I've been having a bear of a time working with 30+ drives which is basically going to be sitting in 2x Norco 4Us most likely with an expander for each once I update the HP SAS expander firmware.

**stevecs** · 01-31-2010, 06:01 PM

No pictures of the build here at least nothing that resembles the current set up. It's actually 80-drives though I'm only using 64+hot spares currently. It is (5) AIC RSC-3EG2-80R-SA1S-0C-R chassic (http://www.aicipc.com/ProductSKU.aspx?ref=RSC-3EG2-0). I have 2 (soon to be 3) ARC-1680ix cards running them. Currently one card runs 3 chassis (daisy chained) and the other runs 2. It will be 1 card for the last 16bay chassis plus the os drives and the other two cards would have 2 chassis each. Each chassis (16 drives) is two 8-drive raid-6 raid sets. (total of 8 raid-6 volumes) which are then stripped (LVM) into large media pool of which I create whatever logical volumes I need. This lets me get the performance of all the drives. The extra chassis will let me grow to another 2 raid sets (10 total/80 drives) as I don't think 2TB sas drives with the BER ratings I am looking for will be available for at least 18 months so I'll have to make do with 1TB drives.

Physically the chassis are only a single data path (i.e. not dual ported) which is ok as no raid cards support that and even though I /may/ switch to BTRFS and standard HBA's in the future that's going to be several years out so don't really need the multi-pathing now. It's then just a simple matter to get an SFF-8087 (backplane on the 3EG0) to an external SFF-8088 adapter (two per chassis) to daisy them together (I'm using the supermicro ones but any would do). You are limited to 7 chassis per controller (areca) though (as it has a built-in expander on the card). Not that big of a deal, I would more like to be able to fit in more than 4 cards per system to get higher performance.

Never used the Norco's, with the 3EG2's at only ~$1200-$1300 they're cheap enough, plus they use 120x38mm fans so you can find quieter ones than the default (which would be advisable if you're using the case as just a drive case like I am, you don't need the high air-flow).

Only problem is running out of rack space, the new tape library I'm looking at will take up 6U, I should really have picked up a larger rack (everything is in a 25U rack currently).

Thread: Best controller for a system with eight SSDs?

Thread Tools

Search Thread

Rate This Thread

Display

Best controller for a system with eight SSDs?

Bookmarks

Bookmarks

Posting Permissions