Page 2 of 3 FirstFirst 123 LastLast
Results 26 to 50 of 68

Thread: Not getting more than 65000 IOPS with xtreme setup

  1. #26
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Ok, saw your comment on # procs and verified that here on a quad core system as well, so there's a bug with that. I sent an e-mail out to the author, it never really cropped up before (much lower IOPS so wasn't really saturating anything anyway). Though in your case even w/ 256 commands it's taking only ~3.5-4% cpu which is not bad at all. The write tests you can ignore/stop it's interesting to note that they are very low but then again I would have assumed that due to the nature of SSD's, more of a data point, but with databases you would probably do an initial load and most traffic would be read on these anyway right?

    As for your read iops, xdd has historically been pretty good to give readings close to theoretical ones for performance in 'worst-case' (ie, 100% random workloads without caching) which is generally lower than a lot of number presented on the web. (most don't go for worst-case scenarios). Anyway without going through bm_flash code to see what it was doing the delta of ~20K iops could be due to caching or workload differences.

    Can you try xdd against your 2x5GB ram disks w/ lvm? w/ your bm_flash you got ~190K, I want to see if we are running into another issue w/ the software for testing.

    Also you may want to try a run w/ deadline as a scheduler (fast run either w/ bm_flash or xdd) to see how that compares to noop unless you've already tried that).

    What slots do you have your cards in on your system? and do you have anything else in any other slots? (ideally running just the cards and it being headless) From the supermicro site it looks like J5 & J6 are the ones you want to use as they are tied directly to the MCH at full speed.

    Also just found this doing some searching: http://www.usenix.org/events/lsf08/t...cardi_SATA.pdf
    Last edited by stevecs; 03-10-2009 at 01:53 PM.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  2. #27
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    though if you did let it run it should show scalability based on queue depth which may be interesting if there is a leveling off at some point which we can map back to a bottleneck.
    Ok, that's indeed an interesting thing to test. I run the test for various queue depths with 10 passes for each depth, except for the 128 one which I copied from an earlier test. These are the results:

    Code:
                           T  Q   Bytes          Ops        Time      Rate        IOPS        Latency    %CPU   OP_Type     ReqSize 
    ^MTARGET   Average     0  8   171798691840   41943040   1080.618  158.982     38813.93    0.0000     0.04   read        4096
    ^MTARGET   Average     0  16  171798691840   41943040   863.128   199.042     48594.24    0.0000     0.12   read        4096 
    ^MTARGET   Average     0  32  171798691840   41943040   795.252   216.031     52741.85    0.0000     0.28   read        4096 
    ^MTARGET   Average     0  64  171798691840   41943040   773.569   222.086     54220.14    0.0000     0.61   read        4096 
    ^MTARGET   Average     0 128   34359738368    8388608   152.927   224.681     54853.83    0.0000     1.33   read        4096 
    ^MTARGET   Average     0 256  171798691840   41943040   765.943   224.297     54760.01    0.0000     3.31   read        4096
    So 32 threads already more or less saturate the controller and beyond 64 threads there really is no gain anymore.

  3. #28
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    when you're running xdd w/ >32 threads what is your output for vmstat 5 while that's running? likewise when you do a iostat -m -x 15 (looking for averages, in vmstat to see if you have more context switches than interrupts for example or what % is in i/o wait). iostat can be interesting for service time & utilization. Though I would like to see if you're able a xdd read test against your ram disk to make sure we're not seeing a top limit in xdd here.

    oh, and noticed that the request size was wrong in the script it's set to 8 (that's actually the number times your block size which is 512bytes so it's testing 4096 bytes (1 page)) not your 8K that you wanted which would be a request size of 16. Shouldn't be much of a difference in numbers based on your earlier tests but to be aware.
    Last edited by stevecs; 03-10-2009 at 03:55 PM.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  4. #29
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    when you're running xdd w/ >32 threads what is your output for vmstat 5 while that's running? likewise when you do a iostat -m -x 15 (looking for averages, in vmstat to see if you have more context switches than interrupts for example or what % is in i/o wait). iostat can be interesting for service time & utilization. Though I would like to see if you're able a xdd read test against your ram disk to make sure we're not seeing a top limit in xdd here.
    I did take a look at iostat earlier. Didn't pay a lot of attention to context switches there, but I did noticed that tps as reported by iostat was nearly equal to the iops reported by bm-flash, but differed from the iops reported by xdd. E.g. if bm-flash would report 60k, than that's basically what iostat is reporting too, but when iostat says 60k, xdd says ~54k.

    I do some more testing tomorrow for sure. It's now 1:15 AM in Amsterdam

    edit: sleep is so overrated :P so I did a quick test (8 threads, 8KiB request size) with XDD to look at iostat -m -x 15:

    Code:
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.23    0.00    4.70   24.37    0.00   70.70
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
    sda               0.00     0.53    0.00    4.07     0.00     0.02     9.05     0.00    0.00   0.00   0.00
    sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    sda2              0.00     0.53    0.00    4.07     0.00     0.02     9.05     0.00    0.00   0.00   0.00
    sdb            1607.73     0.00 16649.67    0.00   116.30     0.00    14.31     5.71    0.34   0.06  99.71
    sdc            1603.60     0.00 16636.60    0.00   116.16     0.00    14.30     2.43    0.15   0.06  92.77
    dm-0              0.00     0.00    0.00    4.60     0.00     0.02     8.00     0.00    0.00   0.00   0.00
    dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    md0               0.00     0.00 39750.33    0.00   245.17     0.00    12.63     0.00    0.00   0.00   0.00
    and a couple of lines from vmstat 5:

    Code:
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
     3  8   1572 16365168     48 15838276    0    0 14989  3616  131   95  0  3 85 12
     0  8   1572 16364920     48 15838300    0    0 239958     0 29942 63498  0  5 70 25
     0  8   1572 16364912     48 15838300    0    0 239763     0 29972 63240  0  5 70 25
     2  8   1572 16364912     48 15838300    0    0 235338     0 29507 61952  0  5 71 24
     2  8   1572 16364912     48 15838300    0    0 239374     0 29844 63689  0  5 70 26
     1  8   1572 16364912     48 15838300    0    0 239586     0 29948 64244  0  5 69 26
    oh, and noticed that the request size was wrong in the script it's set to 8 (that's actually the number times your block size which is 512bytes so it's testing 4096 bytes (1 page)) not your 8K that you wanted which would be a request size of 16.
    I did notice that, but thanks for mentioning it Is the block size in xdd btw only a unit for calculations, or does it actually have significance for the way load is put on the device? I.e. is block size 512 bytes and request size 16 the same as block size 1024 and request size 8, or do those differ?
    Last edited by henk53; 03-10-2009 at 04:19 PM.

  5. #30
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    That's a high context switch count compared to interrupts, but (vmstat) you are in a i/o wait state for 25% but you have idle cpu cycles. Another item is interesting w/ the xdd (don't know if you noticed) but even though your iops stayed constant after say a queue depth of 32 the cpu time increased. In regards to that (or more accurately the # of processors detected which will allow us to use multiple targets on the same array and split the load up better to have more threads) I got a response back from Tom ruwart (author of xdd) and he fixed the # processor issue actually today. he should be sending out the corrected version when he gets back to a network.

    As for block size calcs, yes that's mainly for unit calcs, ultimately it doesn't really have much to do w/ the sub system (for example under enterprise drives you normally don't have 512 byte blocks (well for systems yes, but sans, or large arrays it's usually 520 or I've even seen 524 bytes/block).

    If you get a chance (I know it's late there) run a test of xdd against your ram disks just for comparison. Hopefully we'll get the updated xdd shortly. Not that it really matters in the end (ie, it doesn't really matter if we use xdd or bm-flash, just that xdd seems to have more granular controls for testing). There is no real industry standard benchmark. I personally like xdd as it is closer in line (at least w/ traditional media) to match worst case iops (what would be the worse-case, in reality it's never really /that/ bad but it gives a lower limit as to how it would perform so when I give numbers to clients I'm safe).

    Not to add items (there's a lot outstanding to test already) but if space (capacity) is not an issue, you /may/ also increase iops by using two raid-10's. with raid0 or even raid-5 you are limited in that you have only one copy of the data, with raid-10 you have two copies of data so your raid controller should send the requests to the least busy subsystem drive. remember when you do this though that your data stripe width is 1/2 the drives.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  6. #31
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    The write tests you can ignore/stop it's interesting to note that they are very low but then again I would have assumed that due to the nature of SSD's, more of a data point, but with databases you would probably do an initial load and most traffic would be read on these anyway right?
    That's indeed the nature of most DB loads I think. We do have a process running that continuously inserts about 12 rows per second in some large table, but that only concerns one table. Most everything else is ~90% read, ~10 write.

    Can you try xdd against your 2x5GB ram disks w/ lvm? w/ your bm_flash you got ~190K, I want to see if we are running into another issue w/ the software for testing.
    I'm going to try this soon. Currently the test server is occupied by a co-worker doing some other tests on it, but I will resume my testing soon. It would also be interesting to test what the xdd results are for testing a single Areca card. 'hopefully' these numbers will be lower then what xdd gives for the two Areca cards striped. I'll surely try the deadline scheduler too.

    What slots do you have your cards in on your system? and do you have anything else in any other slots? (ideally running just the cards and it being headless) From the supermicro site it looks like J5 & J6 are the ones you want to use as they are tied directly to the MCH at full speed.
    There are no other things in any of the other slots. It's a pure headless server. The slots we put them in are the ones that are indeed connected directly to the MCH. Initially we put them in the slots that were connected to the south bridge and got performance numbers that were quite a bit lower. We also tried a hybrid approach with one card in a 'north bridge' slot and one card in a south bridge, but both cards in the north bridge slots gave the best results. It thus definitely pays off at this level to put your cards in the right slot.

    I was a bit surprised though that not a lot of server motherboards are available that put emphasis on fast PCIe slots. The fastest one can get seems to be 2x PCIe8 and very few to almost none of the manufacturers actually advertise this point. In contrast, desktop motherboards these days can be get with multiple fast PCIe16 slots. This is a little different from the days that desktops computers came with slower PCI 32 bit/33Mhz slots, while the servers had the 64 bit/66Mhz versions.

  7. #32
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Yeah, I've run into that problem for all platforms/vendors. I've even seen various 'stupid admin' designs where they put say a serial card in a 64bit/133mhz slot but then put the SAN hba in a 32bit/33mhz one. The board I've go my eyes on now is the Supermicro X8DAH+ as it has two tylersburg-36D chipsets on it (only one I've found so far w/ two). For I/O it's going to be a killer if they don't castrate something else. You don't really need more than 8x PCIe v1 speeds as of yet as the cards that are available can't handle more than that anyway (IOP34x are maxed out). What you're doing is generally what is done (putting multiple cards in a system and lvming/striping across them). Here at at the datacenter we don't have any SSD's really deployed to any of the clients, the largest bank of drives we have is about 500-600 (not including sans) but even that large one is not for a single db instance and is hooked up via 4gbit fc so there are a lot of built-in bottlenecks (doesn't help matters that they bought a lower-end sun box (should have had at least a T5440 for the sparc line).

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  8. #33
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    The board I've go my eyes on now is the Supermicro X8DAH+ as it has two tylersburg-36D chipsets on it (only one I've found so far w/ two). For I/O it's going to be a killer if they don't castrate something else.
    It sounds interesting, thanks for the tip.

    You don't really need more than 8x PCIe v1 speeds as of yet as the cards that are available can't handle more than that anyway (IOP34x are maxed out).
    Indeed, which is something else I don't understand. There don't seem to be any faster RAID cards coming any time soon. At least, no announcements have been made by either Intel, Areca or Adaptech. Clearly many of us are hitting a wall with the speed offered by the current generation of IOPS, but there doesn't seem to be any improvement in the short term.


    Here at at the datacenter we don't have any SSD's really deployed to any of the clients, the largest bank of drives we have is about 500-600 (not including sans)
    Wow, that's something nice indeed to play with :P

    I did have the opportunity to run some more tests today. I started with doing tests for an 8KiB request size. The numbers turned out to be significantly lower. Just to be sure that something else hadn't changed in the system (as I mentioned, my co-worker did some tests too), I re-run all the 4KiB tests and for every number of threads (queue depth) the same results as before were reported. To be really sure I then again re-run all the 8KiB tests (taking about 1 hour), but these gave exactly the same results as the earlier run.

    Here they are. This is again for 8 disks per raid controller, 2 controllers, lvm stripped, 8KiB array stripe size, NOOP scheduler, average over 10 passes and thus 8KiB request size this time. I specifically checked that none of the passes were out of range, and none were. Every pass reported nearly the exact number of IOPS.

    Code:
                           T  Q   Bytes          Ops        Time      Rate        IOPS        Latency    %CPU   OP_Type     ReqSize 
    ^MTARGET   Average     0   4  171798691840   20971520   1020.919  168.278     20541.81    0.0000     0.01   read        8192
    ^MTARGET   Average     0   8  171798691840   20971520   702.962   244.393     29833.08    0.0000     0.03   read        8192 
    ^MTARGET   Average     0  16  171798691840   20971520   594.507   288.977     35275.51    0.0000     0.08   read        8192 
    ^MTARGET   Average     0  32  171798691840   20971520   560.851   306.318     37392.34    0.0000     0.18   read        8192 
    ^MTARGET   Average     0  64  171798691840   20971520   548.917   312.978     38205.30    0.0000     0.38   read        8192
    ^MTARGET   Average     0 128  171798691840   20971520   545.725   314.808     38428.76    0.0000     0.82   read        8192 
    ^MTARGET   Average     0 256  171798691840   20971520   545.344   315.028     38455.59    0.0000     2.29   read        8192
    As can be seen, in this case we already almost max out for 16 threads. Going beyond that only very marginally increases the IOPS.

    For completeness I also tested with block size 2KiB, although that size isn't really important for my live load:

    Code:
                           T  Q   Bytes          Ops        Time       Rate        IOPS        Latency    %CPU   OP_Type     ReqSize 
    ^MTARGET   Average     0   4  171798691840   83886080   2856.671    60.139     29364.98    0.0000     0.01   read        2048 
    ^MTARGET   Average     0   8  171798691840   83886080   1813.242    94.747     46263.03    0.0000     0.05   read        2048 
    ^MTARGET   Average     0  16  171798691840   83886080   1395.675   123.094     60104.29    0.0000     0.16   read        2048 
    ^MTARGET   Average     0  32  171798691840   83886080   1258.457   136.515     66657.87    0.0000     0.40   read        2048 
    ^MTARGET   Average     0  64  171798691840   83886080   1216.341   141.242     68965.91    0.0000     0.90   read        2048 
    ^MTARGET   Average     0 128  171798691840   83886080   1205.534   142.508     69584.15    0.0000     1.91   read        2048 
    ^MTARGET   Average     0 256  171798691840   83886080   1217.721   141.082     68887.78    0.0000     3.90   read        2048
    In this case IOPS increase until 32 threads, but very slightly decrease after 128. With 128 threads I did notice an awkward a-synchronicity in the numbers reported by iostat:

    Code:
    Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
    sda               0.00     0.00    0.00    0.13     0.00     0.00     8.00     0.00    4.00   2.00   0.03
    sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    sda2              0.00     0.00    0.00    0.13     0.00     0.00     8.00     0.00    4.00   2.00   0.03
    sdb               0.00     0.00 35684.80    0.00    68.09     0.00     3.91    66.00    1.85   0.03 100.00
    sdc               0.00     0.00 35833.27    0.00    68.39     0.00     3.91     4.30    0.12   0.03  99.07
    dm-0              0.00     0.00    0.00    0.13     0.00     0.00     8.00     0.00    4.00   2.00   0.03
    dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    md0               0.00     0.00 73157.07    0.00   139.68     0.00     3.91     0.00    0.00   0.00   0.00
    avgqu-sz and await for sdb is way more than 10 times higher than that for sdc. With 256 threads I saw the same things. With 64 threads the difference was there too, but a little less (almost exactly a factor 10):

    Code:
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.68    0.00   17.21   69.68    0.00   12.42
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
    sda               0.00     0.00    0.00    0.20     0.00     0.00     8.00     0.00    2.67   1.33   0.03
    sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    sda2              0.00     0.00    0.00    0.20     0.00     0.00     8.00     0.00    2.67   1.33   0.03
    sdb               0.00     0.00 35177.27    0.00    67.12     0.00     3.91    47.15    1.34   0.03 100.00
    sdc               0.00     0.00 35127.20    0.00    67.04     0.00     3.91     4.17    0.12   0.03  98.75
    dm-0              0.00     0.00    0.00    0.20     0.00     0.00     8.00     0.00    2.67   1.33   0.03
    dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    md0               0.00     0.00 71920.27    0.00   137.31     0.00     3.91     0.00    0.00   0.00   0.00
    During the 64 threads run, vmstat showed this:

    Code:
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
     3 40   1572 12645792     52 15683252    0    0  9715  1252   20   17  0  3 85 12
     7 63   1572 12645776     52 15683252    0    0 138445     0 50310 276128  1 17 12 70
     4 46   1572 12645776     52 15683252    0    0 138288     0 50262 276661  0 17 13 69
     1 53   1572 12645768     52 15683252    0    0 138387     0 50337 274764  1 17 13 69
     2 40   1572 12645768     52 15683252    0    0 135370     0 49442 268276  2 17 13 68
     2 55   1572 12645760     52 15683252    0    0 138110     0 50192 275198  1 18 11 71
     3 45   1572 12645760     52 15683252    0    0 138575     0 50218 275155  1 17 12 70
     5 53   1572 12645760     52 15683252    0    0 135070     1 49743 264514  1 17 14 69
    Last edited by henk53; 03-11-2009 at 01:25 PM.

  9. #34
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Got pulled into a project at work, been working /your/ hours for the past bit. :P

    Your iostat there is very interesting with the average queue size. What do you see w/ running w/ bm_flash? I wonder if there is something up w/ xdd.

    Did you run xdd against a couple ram disk partitions for comparison? also what file system do you have on here or are you testing against the raw device? if it's xfs what is your ag count & size?

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  10. #35
    Xtreme Addict
    Join Date
    Jun 2006
    Posts
    1,820
    Guys, I lost you 10 posts ago, but when you finally figure it out, do let us know in plain English what the issue was (or where the bottleneck is )
    P5E64_Evo/QX9650, 4x X25-E SSD - gimme speed..
    Quote Originally Posted by MR_SmartAss View Post
    Lately there has been a lot of BS(Dave_Graham where are you?)

  11. #36
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    Got pulled into a project at work, been working /your/ hours for the past bit. :P
    :P Got pulled into several other things too. Meanwhile my co-worker has reconfigured the machine; installed a second 1231 and rewired the drives, so now the machine physically has 6 drives attached to each 1231 and 4 drives to the 1680.

    Your iostat there is very interesting with the average queue size. What do you see w/ running w/ bm_flash? I wonder if there is something up w/ xdd.
    I'll look at the average queue size for the dual card config with bm_flash when I start testing with the striped configs again. It would be interesting to see what the results are there indeed.

    Currently my co-worker has configured a RAID 6 array, so I've been running some tests against that. Although it's a little difficult to compare a single RAID 6 array to two LVM striped RAID 0 arrays, I did notice some new things.

    For starters, for an xdd test run on a single RAID card, the number of IOPS as reported by iostat is now nearly exactly the same as what xdd reports. For various block sizes (request sizes) it appeared that the dual RAID 0 config was quite a bit faster for block sizes 2KiB and 4 KiB, but was about the same for block size 8KiB.

    Here is the output (I edited the column names a little to correspond with bm-flash)

    Code:
    1x6xRAID6		noop									
    		Threads	Bytes		Ops		Time	Rate	IOPS		Latency	%CPU	OP_Type	BlockSize
    TARGET	Average	4	51539607552	50331648	1324.04	38.93	38013.58	0	0.02	read	1024
    TARGET	Average	8	51539607552	50331648	968.65	53.21	51960.7		0	0.06	read	1024
    TARGET	Average	16	51539607552	50331648	929.48	55.45	54150.42	0	0.13	read	1024
    TARGET	Average	32	51539607552	50331648	939.8	54.84	53555.54	0	0.26	read	1024
    TARGET	Average	64	51539607552	50331648	944.16	54.59	53308.57	0	0.53	read	1024
    TARGET	Average	128	51539607552	50331648	949.7	54.27	52997.57	0	1.12	read	1024
    TARGET	Average	256	51539607552	50331648	949.84	54.26	52989.66	0	3.13	read	1024
    TARGET	Average	128	51539607552	25165824	496.72	103.76	50663.57	0	1.06	read	2048
    TARGET	Average	4	51539607552	12582912	430.33	119.77	29239.91	0	0.01	read	4096
    TARGET	Average	8	51539607552	12582912	308.72	166.94	40757.88	0	0.04	read	4096
    TARGET	Average	16	51539607552	12582912	267.53	192.65	47033		0	0.11	read	4096
    TARGET	Average	32	51539607552	12582912	268.61	191.87	46843.96	0	0.22	read	4096
    TARGET	Average	64	51539607552	12582912	269.98	190.9	46607.29	0	0.46	read	4096
    TARGET	Average	128	51539607552	12582912	271.54	189.81	46339.54	0	0.96	read	4096
    TARGET	Average	256	51539607552	12582912	271.48	189.85	46349.2		0	2.75	read	4096
    TARGET	Average	4	51539607552	6291456		275.42	187.13	22843.46	0	0.01	read	8192
    TARGET	Average	8	51539607552	6291456		206.27	249.86	30500.83	0	0.03	read	8192
    TARGET	Average	16	51539607552	6291456		177.18	290.88	35508.34	0	0.08	read	8192
    TARGET	Average	32	51539607552	6291456		165.4	311.61	38037.68	0	0.18	read	8192
    TARGET	Average	64	51539607552	6291456		162.51	317.14	38713.44	0	0.37	read	8192
    TARGET	Average	128	51539607552	6291456		163.14	315.92	38564.38	0	0.8	read	8192
    TARGET	Average	256	51539607552	6291456		163.14	315.92	38564.66	0	2.37	read	8192
    TARGET	Average	4	51539607552	3145728		186.27	276.69	16887.81	0	0.01	read	16384
    TARGET	Average	8	51539607552	3145728		154.47	333.66	20365.24	0	0.02	read	16384
    TARGET	Average	16	51539607552	3145728		141.71	363.7	22198.17	0	0.05	read	16384
    TARGET	Average	32	51539607552	3145728		136.34	378.02	23072.76	0	0.1	read	16384
    TARGET	Average	64	51539607552	3145728		134.09	384.36	23459.45	0	0.22	read	16384
    TARGET	Average	128	51539607552	3145728		133.36	386.46	23587.77	0	0.46	read	16384
    TARGET	Average	256	51539607552	3145728		133.36	386.47	23588.3		0	1.53	read	16384
    TARGET	Average	4	51539607552	1572864		135.71	379.77	11589.64	0	0.01	read	32768
    TARGET	Average	8	51539607552	1572864		125.95	409.22	12488.43	0	0.01	read	32768
    TARGET	Average	16	51539607552	1572864		122.1	422.12	12882.15	0	0.03	read	32768
    TARGET	Average	32	51539607552	1572864		120.48	427.77	13054.61	0	0.06	read	32768
    TARGET	Average	64	51539607552	1572864		119.95	429.66	13112.26	0	0.13	read	32768
    TARGET	Average	128	51539607552	1572864		120.01	429.47	13106.26	0	0.27	read	32768
    TARGET	Average	256	51539607552	1572864		120.03	429.4	13104.1		0	0.94	read	32768
    As can be seen, for 8KiB the IOPS range is from 22k to 38K here, while for the dual 8xRAID 0 (16 disks total) the range was 20k to 38k, which is nearly the same. However for 2KiB it's 29k to 46k here, while the earlier dual setup gave me 38k to 54k. Puzzling...

    Did you run xdd against a couple ram disk partitions for comparison? also what file system do you have on here or are you testing against the raw device? if it's xfs what is your ag count & size?
    Still haven't done the ram disk test, but hope to be able to do that soon. There's currently some write test running and it doesn't seem to be ending anytime soon The file system used for all tests is JFS.

  12. #37
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    That is strange. Keep an eye on your queue depth to the base arrays (when you're doing plaid) if it looks the same w/ bm_flash that you saw w/ xdd (one base array taking 10x the queue). If it's lop-sided like that it could mean that there is something mis-alligned (allocation groups et al, w/ jfs you don't really have a way to force AG alignment (what is the ag size (jfs_tune -l <device>)? or something w/ lvm. If it only happens w/ xdd then we can narrow that down to the tool. I don't really want to put into the mix oracle's orion toolset here as it looks like a much lower level problem. When I get back from the office I'll check to see what's going on w/ the patched xdd as well.
    Last edited by stevecs; 03-13-2009 at 06:38 AM. Reason: clarity (needed another mt. dew)

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  13. #38
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    That is strange.
    I'm encountering some more strange behavior. I've been ignoring writes a little until now, but started testing this again today. A single test pass was still running after some 5(!) hours. The total amount of data to be written would have been 16 GB. IO stat showed me this:

    Code:
    Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
    sda             276.67         0.00         0.27          0          0
    sdb               0.00         0.00         0.00          0          0
    sdc               0.00         0.00         0.00          0          0
    sdc1              0.00         0.00         0.00          0          0
    sdc2              0.00         0.00         0.00          0          0
    sdd               0.00         0.00         0.00          0          0
    dm-0              0.00         0.00         0.00          0          0
    dm-1              0.00         0.00         0.00          0          0
    dm-2              0.00         0.00         0.00          0          0
    I can imagine that sometimes the onboard cache on the Areca controller might need some time to empty itself, but speeds where consistently this low for the entire duration.

    I canceled the test and tested with 16MB and 128MB of data to be written for a single pass.

    The first pass of the following test was bizarrely slow again, but the second pass 'suddenly' picked up speed:

    Code:
                         T  Q       Bytes      Ops    Time          Rate      IOPS     Latency    %CPU  OP_Type    ReqSize    
    TARGET   PASS0001    0 32     134217728    32768    99.731     1.346     328.56    0.0030     0.02   write        4096
    TARGET   PASS0002    0 32     134217728    32768     3.193    42.037     10262.91  0.0001     0.09   write        4096
    TARGET   PASS0003    0 32     134217728    32768     1.700    78.930     19269.97  0.0001     0.16   write        4096
    TARGET   PASS0004    0 32     134217728    32768     1.347    99.642     24326.67  0.0000     0.25   write        4096
    TARGET   PASS0005    0 32     134217728    32768     3.774    35.562     8682.02   0.0001     0.09   write        4096
    TARGET   Average     0 32     671088640   163840   108.036     6.212     1516.53   0.0007     0.02   write        4096
             Combined    1 32     671088640   163840   109.000     6.157     1503.12   0.0007     0.01   write        4096
    Maybe the controller happened to have a back log of writes to be done to the actual disks and it just happened that after the first pass the cache was available to accept new data again at a decent speed, but I don't think the cache should work that way, does it?

    I'm running the script you gave me with one loop added, for the block size, and now the array seems to be writing at a rather normal speed again although a lot of time the figures between passes do differ greatly. E.g. :

    Code:
                           T  Q      Bytes         Ops      Time      Rate      IOPS       Latency    %CPU   OP_Type     ReqSize     
    ^MTARGET   PASS0001    0 128     134217728     8192     3.236    41.474     2531.35    0.0004     5.74   write       16384 
    ^MTARGET   PASS0002    0 128     134217728     8192     2.080    64.541     3939.27    0.0003     0.80   write       16384 
    ^MTARGET   PASS0003    0 128     134217728     8192     0.429   312.631     19081.47   0.0001     0.81   write       16384 
    ^MTARGET   PASS0004    0 128     134217728     8192     0.402   333.493     20354.77   0.0000     0.89   write       16384 
    ^MTARGET   PASS0005    0 128     134217728     8192     0.372   360.663     22013.10   0.0000     1.27   write       16384 
    ^MTARGET   Average     0 128     671088640    40960     6.455   103.968     6345.68    0.0002     0.80   write       16384 
    ^M         Combined    1 128     671088640    40960     6.455   103.968     6345.68    0.0002     0.49   write       16384
    This looks a lot like the 'gaps' I've seen in the results given by bm-flash. Namely, bm-flash doesn't test a total fixed amount of megabytes, but tests for some time and when that time elapses it moves on to the next test. I can't really wrap my head around this. I tried the 8Kib request size a couple of 100 times, and it was consistently fast. I won't dump the hundreds of lines of results of that here, but this is a snapshot from it:

    Code:
    TARGET   PASS0023    0 32     134217728    16384     0.597   224.723     27431.99    0.0000     0.25   write        8192 
    TARGET   PASS0024    0 32     134217728    16384     0.602   223.133     27237.94    0.0000     0.36   write        8192 
    TARGET   PASS0025    0 32     134217728    16384     0.604   222.281     27133.87    0.0000     0.21   write        8192 
    TARGET   PASS0026    0 32     134217728    16384     0.597   224.898     27453.31    0.0000     0.24   write        8192 
    TARGET   PASS0027    0 32     134217728    16384     0.599   224.087     27354.35    0.0000     0.31   write        8192 
    TARGET   PASS0028    0 32     134217728    16384     0.608   220.632     26932.66    0.0000     0.25   write        8192 
    TARGET   PASS0029    0 32     134217728    16384     0.598   224.550     27410.87    0.0000     0.38   write        8192 
    TARGET   PASS0030    0 32     134217728    16384     0.588   228.214     27858.21    0.0000     0.27   write        8192 
    TARGET   PASS0031    0 32     134217728    16384     0.600   223.534     27286.88    0.0000     0.23   write        8192 
    TARGET   PASS0032    0 32     134217728    16384     0.598   224.319     27382.65    0.0000     0.33   write        8192 
    TARGET   PASS0033    0 32     134217728    16384     0.611   219.798     26830.78    0.0000     0.28   write        8192 
    TARGET   PASS0034    0 32     134217728    16384     0.606   221.305     27014.77    0.0000     0.24   write        8192 
    TARGET   PASS0035    0 32     134217728    16384     0.607   221.042     26982.61    0.0000     0.45   write        8192 
    TARGET   PASS0036    0 32     134217728    16384     0.593   226.310     27625.74    0.0000     0.38   write        8192 
    TARGET   PASS0037    0 32     134217728    16384     0.587   228.826     27932.83    0.0000     0.30   write        8192 
    TARGET   PASS0038    0 32     134217728    16384     0.601   223.478     27280.02    0.0000     0.25   write        8192 
    TARGET   PASS0039    0 32     134217728    16384     0.595   225.523     27529.70    0.0000     0.25   write        8192 
    TARGET   PASS0040    0 32     134217728    16384     0.597   224.821     27443.98    0.0000     0.32   write        8192 
    TARGET   PASS0041    0 32     134217728    16384     0.597   224.741     27434.24    0.0000     0.32   write        8192 
    TARGET   PASS0042    0 32     134217728    16384     0.605   221.809     27076.29    0.0000     0.21   write        8192 
    TARGET   PASS0043    0 32     134217728    16384     0.603   222.421     27151.00    0.0000     0.38   write        8192 
    TARGET   PASS0044    0 32     134217728    16384     0.602   222.932     27213.32    0.0000     0.29   write        8192 
    TARGET   PASS0045    0 32     134217728    16384     0.606   221.631     27054.61    0.0000     0.29   write        8192 
    TARGET   PASS0046    0 32     134217728    16384     0.591   227.098     27721.89    0.0000     0.45   write        8192 
    TARGET   PASS0047    0 32     134217728    16384     0.599   223.944     27336.92    0.0000     0.39   write        8192
    Then I switched to the 2KiB request size again, and things slowed down again:

    Code:
    TARGET   PASS0001    0 32     134217728    65536     7.288    18.415     8991.82    0.0001     0.13   write        2048 
    TARGET   PASS0002    0 32     134217728    65536     7.002    19.169     9359.76    0.0001     0.06   write        2048 
    TARGET   PASS0003    0 32     134217728    65536     2.404    55.823     27257.08    0.0000     0.25   write        2048 
    TARGET   PASS0004    0 32     134217728    65536    31.388     4.276     2087.93    0.0005     0.02   write        2048 
    TARGET   PASS0005    0 32     134217728    65536    76.843     1.747     852.86    0.0012     0.01   write        2048 
    TARGET   PASS0006    0 32     134217728    65536    75.700     1.773     865.73    0.0012     0.01   write        2048 
    TARGET   PASS0007    0 32     134217728    65536    42.965     3.124     1525.34    0.0007     0.01   write        2048
    Thereafter I switched to 8KiB again, it first started slow and then was fast again:

    Code:
     
    TARGET   PASS0001    0 32     134217728    16384    20.080     6.684     815.95    0.0012     0.03   write        8192 
    TARGET   PASS0002    0 32     134217728    16384     3.438    39.035     4765.01    0.0002     0.18   write        8192 
    TARGET   PASS0003    0 32     134217728    16384     0.662   202.631     24735.23    0.0000     0.37   write        8192 
    TARGET   PASS0004    0 32     134217728    16384     0.626   214.401     26171.94    0.0000     0.24   write        8192 
    TARGET   PASS0005    0 32     134217728    16384     0.603   222.636     27177.26    0.0000     0.42   write        8192 
    TARGET   PASS0006    0 32     134217728    16384     0.608   220.797     26952.78    0.0000     0.33   write        8192 
    TARGET   PASS0007    0 32     134217728    16384     0.603   222.688     27183.57    0.0000     0.23   write        8192 
    TARGET   PASS0008    0 32     134217728    16384     0.603   222.475     27157.57    0.0000     0.27   write        8192 
    TARGET   PASS0009    0 32     134217728    16384     0.616   218.035     26615.59    0.0000     0.21   write        8192
    It almost looks like the alternation in writing between block sizes causes some serious (internal) 'trashing' effect or so to take place. I'm not sure yet though...

    When I start the test for writing a total of 1024 MB, it also starts slow. This creeps me out a little. How does the RAID controller in advance 'know' that 1024 MB is going to be written. I take it that xdd writes this 1024 MB totally at random in the 64 GB file on which it operates. This is the result for writing 1024 MB total, which is way different from writing 128 MB a couple of hundred times right after each other:

    Code:
    TARGET   PASS0001    0 32    1073741824   131072   352.669     3.045     371.66    0.0027     0.00   write        8192 
    TARGET   PASS0002    0 32    1073741824   131072    19.656    54.627     6668.38    0.0001     0.06   write        8192 
    TARGET   PASS0003    0 32    1073741824   131072    20.078    53.478     6528.10    0.0002     0.05   write        8192 
    TARGET   PASS0004    0 32    1073741824   131072    26.033    41.246     5034.89    0.0002     0.03   write        8192 
    TARGET   PASS0005    0 32    1073741824   131072    21.058    50.990     6224.40    0.0002     0.04   write        8192 
    TARGET   PASS0006    0 32    1073741824   131072    21.431    50.103     6116.08    0.0002     0.04   write        8192 
    TARGET   PASS0007    0 32    1073741824   131072    21.174    50.712     6190.37    0.0002     0.04   write        8192 
    TARGET   PASS0008    0 32    1073741824   131072    30.574    35.119     4287.02    0.0002     0.03   write        8192 
    TARGET   PASS0009    0 32    1073741824   131072    21.966    48.882     5966.98    0.0002     0.03   write        8192 
    TARGET   PASS0010    0 32    1073741824   131072    19.553    54.916     6703.58    0.0001     0.05   write        8192 
    TARGET   Average     0 32   10737418240   1310720   542.097    19.807     2417.87    0.0004     0.01   write        8192 
             Combined    1 32   10737418240   1310720   554.000    19.382     2365.92    0.0004     0.01   write        8192
    (p.s. I hope it doesn't bother anyone that I'm 'dumping' so many figures here)
    Last edited by henk53; 03-13-2009 at 07:59 AM.

  14. #39
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    I wonder if we're seeing here some kind of profiling or wear leveling of the ssd's themselves with that result. The controller doesn't really have a 'backlog', it does have cache yes and will have some items that it needs to write out to the disks but it's not that large in/of itself nor would it take anywhere near that long unless the drive itself told it to hold off (taking too long to write or cell updates et al). To rule it out completely (ie, see if it's the SSD) just turn off the write cache or remove the ram from the controller (you have to reboot for it to take effect). you would be seeing the real speeds of the drives at that point.

    Also we can put in a delay between passes w/ xdd something large enough to handle any type of outstanding i/o like 60 secs or whatever. (-delay <seconds>) which happens between passes which should clear out any delayed writes. also (should be doing this automatically as jfs does not have barrier support) but you could try -nobarrier w/ xdd as well to let all threads have free reign.

    I've looked at the jfs layout specs (http://jfs.sourceforge.net/project/pub/jfslayout.pdf ) and don't really see anything here that would really (except by chance) be only on one disk opposed to another. with exception of the built-in journal. You may want to put the journal (128MB max) on another physical disk not the ssd array, could be another stand-alone ssd or whatever).

    Code:
    mkfs.jfs -J journal_dev /dev/external_journal             # creates a journal on device /dev/external_journal
    mkfs.jfs -J device=/dev/external_journal /dev/jfs_device  # attaches the external journal to the existing file system on /dev/jfs_device
    or 
    mkfs.jfs -j /dev/external_journal /dev/jfs_device    # if you're creating the array filesystem at the same time as you create the journal
    did you ever send an e-mail over to areca to ask about any other types of limits in the driver?
    Last edited by stevecs; 03-13-2009 at 09:01 AM.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  15. #40
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    Did you run xdd against a couple ram disk partitions for comparison?
    I now have some results from the xdd against the ram disks. I created a very basic LVM striped device, using:

    Code:
    pvcreate /dev/ram0 /dev/ram1
    vgcreate lvram /dev/ram0 /dev/ram1
    lvcreate -i2 -I128 -L5G -n ramdisk lvram
    mkfs.jfs /dev/lvram/ramdisk
    mount /dev/lvram/ramdisk /ramdiskstr/ -o noatime,nodiratime
    I let xdd work on a 4GB file, testing an amount of data equal to ~16GB per pass:

    Code:
    dd if=/dev/zero of=$XDDTARGET bs=1M count=4096
    sync ; sleep 5
    $XDD -verbose -op read -target ${XDDTARGET} -blocksize 512 -reqsize $RQ -mbytes 16384 -passes 5 -dio -seek random -seek range 4000000 -queuedepth $QD
    The results are this:

    Code:
                            T  Q   Bytes         Ops        Time     Rate        IOPS         Latency    %CPU   OP_Type     ReqSize    
    ^MTARGET   PASS0001    0 128   17179869184   2097152    17.714   969.824     118386.69    0.0000     2.73   read        8192 
    ^MTARGET   PASS0002    0 128   17179869184   2097152    18.182   944.893     115343.37    0.0000     2.16   read        8192 
    ^MTARGET   PASS0003    0 128   17179869184   2097152    18.651   921.124     112441.89    0.0000     2.21   read        8192 
    ^MTARGET   PASS0004    0 128   17179869184   2097152    18.490   929.159     113422.79    0.0000     2.20   read        8192 
    ^MTARGET   PASS0005    0 128   17179869184   2097152    18.171   945.431     115409.08    0.0000     2.28   read        8192 
    ^MTARGET   Average     0 128   85899345920   10485760   87.133   985.837     120341.44    0.0000     2.14   read        8192
    While iostat -mx dm-3 ram0 ram1 15 showed this:

    Code:
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.62    0.00   14.41    0.00    0.00   84.97
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
    ram0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    ram1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    dm-3              0.00     0.00 115517.87    0.13   902.48     0.00    16.00     0.38    0.00   0.00  37.97
    I do wonder why there's no traffic reports for ram0 and ram1, but I guess this is because of the way these device are implemented.

    For comparison, the same xdd test on a single (non lvm) ram disk, with an ext2 fs (a jsf formatted ram disk couldn't be mounted on my system):

    Code:
    mke2fs -m 0 /dev/ram3
    /dev/ram3 /ramdisksing/
    gave the following results:

    Code:
                            T  Q   Bytes         Ops        Time     Rate        IOPS         Latency    %CPU   OP_Type     ReqSize 
    ^MTARGET   PASS0001    0 128   17179869184   2097152     17.537    979.620     119582.57    0.0000     3.06   read        8192 
    ^MTARGET   PASS0002    0 128   17179869184   2097152     18.119    948.153     115741.34    0.0000     2.26   read        8192 
    ^MTARGET   PASS0003    0 128   17179869184   2097152     18.365    935.447     114190.29    0.0000     2.39   read        8192 
    ^MTARGET   PASS0004    0 128   17179869184   2097152     17.933    957.981     116941.10    0.0000     2.27   read        8192 
    ^MTARGET   PASS0005    0 128   17179869184   2097152     18.004    954.229     116483.08    0.0000     2.21   read        8192 
    ^MTARGET   Average     0 128   85899345920   10485760    85.710   1002.204     122339.31    0.0000     2.20   read        8192
    On earlier tests with striped and single ram disks, bm-flash reported 190k resp. 200k. My colleague set up the RAM disks then so I do have to ask how he exactly did that. In xdd however there seems to be no difference in performance. Since /dev/ram3 didn't show up in iostat there's nothing to report there.

    I'll look at the performance for two RAID6 arrays on two 1231s next and will try to set the journal on a complete different disk (there is 1680 in the machine with 4 SSDs, I'll try to put the journal there, alternatively there are also 2 SAS HDDs on that 1680 that could it)

  16. #41
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Ok, the xdd against ram is good as it shows it can calculate more than the iops we were seeing. I do not like that the averages of xdd are incorrect from your posting (iops, the average is higher than the individual passes, which unless there is some type of new math employed, is wrong. )

    As for ram* not showing up and for not being able to mount a jfs ramdisk yes there are bugs in both functions under linux. I've noticed that before but didn't care enough to send out an e-mail at the time to the lists.

    Move the journal to another ssd or spindle, max size is 128M so it doesn't need to be large (default when in-line is 32M which is what you have now) so if you want to rule that out completely you could try putting it on a 64M ramX disk (not good for production purposes obviously but gets it out of the way right now and doesn't introduce another delay point as we've tested that ram is > than your goal).

    I probably wouldn't play too much w/ raid-6 here as that would be worse than raid-5 for iops even if you're doing just 10% writes. Seriously I would look at doing either a bunch of raid-1's and LVMing them at the OS level or RAID-10 on the arecas. (using even number disks you don't want raid-1E). And remember your offset. With a 'stripe size' under raid-10 your 'data disks' would be 1/2 of the drives in your array. so with 2x6 that would be 3 drives each. so with 8KiB stripe size that would be 24KiB so the lvm default of 192KiB is fine.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  17. #42
    Registered User
    Join Date
    Feb 2006
    Location
    Germany (near Ramstein)
    Posts
    421
    How to combine 2 Areca´s? (please step by step)

  18. #43
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    @FEAR. It's hard to give a step by step for all situations as you've probably seen even in this thread there are a lot of variables and they have an effect on the setup of a system. As a high-level simple process something like this for linux:

    • Create an array on each areca controller
      Simple method here would be to have areca arrays of the same size
    • Your OS should see XX physical discs
      where XX is the number of controllers in your system (max 4 for areca) you can do multiple arrays on a single controller but won't go into that, it's the same basic principle though.
    • Under linux you have two basic options
      [INDENT]
      • mdadm (meta disk driver)
        • don't create any partition tables on the areca arrays (ie. /dev/sda only, no /dev/sda1 or whatever)
        • mdadm --create --help (displays the create help menu)
        • mdadm --create /dev/md0 --raid-level=0 --raid-devices=XX --chunk=XX </dev/arecaraid0> </dev/arecaraid1> </dev/arecaraidX>
          set your chunk size to be a multiple of your areca data stripe width (array size minus parity disks TIMES your stripe size) if possible (best performance for writes) or at least equal the size of your areca stripe size.
        • put your file system of choice on your new meta-device, EXT3 caps at 8TiB (forget what others on the net have said, if you're going beyond this or think you may don't use EXT3, use JFS or XFS). XFS be careful if you have any type of power outages or system freezes as you have a higher chance to loose data.
        • mount and use your filesystem
        • If you set your chunk size to be your data stripe width you will have some pain in growing your file system if you add drives to the areca disk (your data stripe width would change and you can't change that at the mdadm level without a reformat). However you could change the size of your drives (ie, 500GB to 1TB or 1TB to 2TB and keep the same number of drives but change their sizes in which case your data stripe width remains constant. If you use a chunk size = your areca stripe size you can grow the array easier but you loose some write performance (assuming parity raids here). To grow you would:
          • add/replace drives to your areca controller
          • If you added more drives, expand your raidset
          • then modify your volumeset and you have the option to change the volumeset size this will keep your current data in place but just make the 'disk' larger
          • reboot so that linux sees the new areca volume sizes
          • mdadm --grow /dev/md0 (grows your metadisk array at the OS level)
          • Then grow the file system (for jfs: mount -o remount,resize </mntpoint> ; for xfs xfs_growfs </mntpoint> ) you do this with the file system mounted
          • you're done
      • LVM (Logical volume manager)
        • don't create any partition tables on the areca arrays (ie. /dev/sda only, no /dev/sda1 or whatever)
        • calculate out your starting offset. LVM uses by default 192KiB for meta data at the beginning of each physical volume (your areca arrays here). You want to find an offset that is the smallest multiple of your areca data stripe width * stripe size and is greater than or equal to 192KiB (LVM meta data). For example a raid-6 array of 10 drives with a 64KiB stripe size is 10-2 = 8 data drives 8*64 = 512KiB which is > 192KiB so you have to pad your lvm starting point to be at 512KiB.
        • pvcreate --metadatasize 511K --metadatacopies=2 <arecaraid0> <arecaraid1> <arecaraidX>
        • pvscan -v (shows volumes that were added)
        • pvs -o+pe_start (shows starting offset of volumes, this should be 512KiB in above example if not re-create with different offset, LVM 'padds' to a 64KiB boundary so it's a little fuzzy)
        • vgcreate --physicalextentsize X <volumegroupname> </dev/arecaraid0> </dev/arecaraid1> </dev/arecaraidX> (pick an extent size so that it's reasonable anticipating future growth. You're not limited to 65535 w/ LVM2, default size is 4MiB. I set mine to 1GiB in size as I plan to grow to 100TiB or so with the current array, and will probably move to 4GiB when I rebuild at that point)
        • lvcreate --stripes X --stripesize X --extents X --name <lvname> <vgname> </dev/arecaraid0> </dev/arecaraid1> </dev/arecaraidX> (--stripes here is the number of physical volumes to stripe across (raid cards/areca arrays); --stripesize needs to be 2^n and best case should be the data stripe width of your areca array assuming it 2^n, otherwise should be your base array stripe size; --extents is how big you want your logical volume to be. You list the areca physical volumes to indicate to lvm what underlaying devices it should pull extents from for your stripe, more important if you don't want to stripe across all physical disks for example, but it's good practice to enumerate what you want to happen explicitly)
        • put your file system of choice on your new meta-device, EXT3 caps at 8TiB (forget what others on the net have said, if you're going beyond this or think you may don't use EXT3, use JFS or XFS). XFS be careful if you have any type of power outages or system freezes as you have a higher chance to loose data.
        • mount and use your filesystem
        • If you set your chunk size to be your data stripe width you will have some pain in growing your file system if you add drives to the areca disk (your data stripe width would change and you can't change that at the mdadm level without a reformat). However you could change the size of your drives (ie, 500GB to 1TB or 1TB to 2TB and keep the same number of drives but change their sizes in which case your data stripe width remains constant. If you use a chunk size = your areca stripe size you can grow the array easier but you loose some write performance (assuming parity raids here). To grow you would:
          • add/replace drives to your areca controller
          • If you added more drives, expand your raidset
          • then modify your volumeset and you have the option to change the volumeset size this will keep your current data in place but just make the 'disk' larger
          • reboot so that linux sees the new areca volume sizes
          • pvresize -v -d </dev/arecaraidX> (do this for each array that you expanded on teh areca's this tells lvm to re-check the physical disk)
          • pvscan -v
          • vgdisplay <vgname>
          • lvresize --extents X </dev/logicalvolume> (this expands your logical volume by X extents)
          • Then grow the file system (for jfs: mount -o remount,resize </mntpoint> ; for xfs xfs_growfs </mntpoint> ) you do this with the file system mounted
          • you're done


    Ok, a little long winded I guess but that's an overview.
    Last edited by stevecs; 04-20-2009 at 04:11 AM.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  19. #44
    Registered User
    Join Date
    Feb 2006
    Location
    Germany (near Ramstein)
    Posts
    421
    Puh - english is not my element

    Thx

  20. #45
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    @stevecs that's a handy explanation. People are probably going to reference that forum post for some time

    Last week while residing on a remote location I locked myself out of the machine I'm testing when giving the reboot command. Something to do with the enumeration of the PCIe devices and the machine going to maintenance mode after a reboot Took some time to find a fix for this (using udev).

    Anyway, my co-worker did some interesting further testing. He created a 2x6xRAID6@1231 config with stripe sizes for the array and lvm of respectively 8KiB/32Kib and 128KiB/512Kib. The results are rather interesting, as this is a more apples-to-apples comparison of the single controller vs the dual controller (the previous comparison I posted was a 1x6xRAID6@1231 vs 2x8xRAID0@1231+1680). Without further ado, these are the results:

    2x6xRAID6 2x1231ML stripe size 8k, lvm2 stripe size 32k
    Code:
                         T   Q   Bytes         Ops        Time       Rate       IOPS        Latency    %CPU   OP_Type     BlockSize
    TARGET   Average     0   4   51539607552   50331648   1222.791   42.149     41161.29    0.0000     0.02   read        1024 
    TARGET   Average     0   8   51539607552   50331648   798.801    64.521     63009.03    0.0000     0.09   read        1024 
    TARGET   Average     0  16   51539607552   50331648   777.554    66.284     64730.74    0.0000     0.23   read        1024
    TARGET   Average     0  32   51539607552   50331648   783.667    65.767     64225.83    0.0000     0.46   read        1024 
    TARGET   Average     0  64   51539607552   50331648   790.628    65.188     63660.31    0.0000     0.93   read        1024
    TARGET   Average     0 128   51539607552   50331648   800.083    64.418     62908.07    0.0000     1.90   read        1024 
    TARGET   Average     0 256   51539607552   50331648   809.726    63.651     62158.89    0.0000     3.94   read        1024 
    
    TARGET   Average     0   4   51539607552   25165824   667.203    77.247     37718.41    0.0000     0.02   read        2048 
    TARGET   Average     0   8   51539607552   25165824   423.184   121.790     59467.75    0.0000     0.08   read        2048 
    TARGET   Average     0  16   51539607552   25165824   395.697   130.250     63598.68    0.0000     0.23   read        2048
    TARGET   Average     0  32   51539607552   25165824   397.812   129.558     63260.62    0.0000     0.46   read        2048 
    TARGET   Average     0  64   51539607552   25165824   401.566   128.347     62669.24    0.0000     0.93   read        2048 
    TARGET   Average     0 128   51539607552   25165824   404.788   127.325     62170.33    0.0000     1.91   read        2048 
    TARGET   Average     0 256   51539607552   25165824   408.990   126.017     61531.60    0.0000     3.94   read        2048 
    
    TARGET   Average     0   4   51539607552   12582912   385.204   133.798     32665.58    0.0000     0.02   read        4096 
    TARGET   Average     0   8   51539607552   12582912   240.319   214.463     52359.18    0.0000     0.07   read        4096
    TARGET   Average     0  16   51539607552   12582912   202.105   255.014     62259.27    0.0000     0.22   read        4096
    TARGET   Average     0  32   51539607552   12582912   203.047   253.831     61970.52    0.0000     0.46   read        4096
    TARGET   Average     0  64   51539607552   12582912   204.619   251.881     61494.38    0.0000     0.92   read        4096
    TARGET   Average     0 128   51539607552   12582912   207.224   248.714     60721.23    0.0000     1.90   read        4096 
    TARGET   Average     0 256   51539607552   12582912   208.365   247.353     60388.91    0.0000     3.95   read        4096 
     
    TARGET   Average     0   4   51539607552   6291456   237.350   217.146      26507.08    0.0000     0.01   read        8192 
    TARGET   Average     0   8   51539607552   6291456   150.599   342.230      41776.13    0.0000     0.06   read        8192 
    TARGET   Average     0  16   51539607552   6291456   113.680   453.374      55343.46    0.0000     0.19   read        8192 
    TARGET   Average     0  32   51539607552   6291456   109.713   469.768      57344.75    0.0000     0.45   read        8192 
    TARGET   Average     0  64   51539607552   6291456   110.428   466.725      56973.29    0.0000     0.92   read        8192
    TARGET   Average     0 128   51539607552   6291456   111.486   462.295      56432.49    0.0000     1.88   read        8192 
    TARGET   Average     0 256   51539607552   6291456   112.282   459.018      56032.53    0.0000     3.94   read        8192
    
    TARGET   Average     0   4   51539607552   3145728   147.691   348.969      21299.38    0.0000     0.01   read       16384
    TARGET   Average     0   8   51539607552   3145728   101.245   509.059      31070.51    0.0000     0.04   read       16384
    TARGET   Average     0  16   51539607552   3145728    80.464   640.532      39094.97    0.0000     0.13   read       16384 
    TARGET   Average     0  32   51539607552   3145728    71.336   722.494      44097.56    0.0000     0.33   read       16384 
    TARGET   Average     0  64   51539607552   3145728    67.403   764.651      46670.57    0.0000     0.78   read       16384
    TARGET   Average     0 128   51539607552   3145728    65.747   783.905      47845.74    0.0000     1.72   read       16384
    TARGET   Average     0 256   51539607552   3145728    65.457   787.383      48058.06    0.0000     3.80   read       16384 
    
    TARGET   Average     0   4   51539607552   1572864    96.325   535.060      16328.73    0.0001     0.01   read       32768
    TARGET   Average     0   8   51539607552   1572864    74.608   690.802      21081.60    0.0000     0.04   read       32768
    TARGET   Average     0  16   51539607552   1572864    64.893   794.225      24237.83    0.0000     0.09   read       32768 
    TARGET   Average     0  32   51539607552   1572864    60.607   850.391      25951.87    0.0000     0.20   read       32768 
    TARGET   Average     0  64   51539607552   1572864    58.769   876.980      26763.31    0.0000     0.45   read       32768
    TARGET   Average     0 128   51539607552   1572864    58.051   887.829      27094.39    0.0000     0.95   read       32768
    TARGET   Average     0 256   51539607552   1572864    57.957   889.272      27138.42    0.0000     2.54   read       32768
    2x6xRAID6 2x1231ML stripe size 128k, lvm2 stripe size 512k
    Code:
    TARGET   Average     0   4   51539607552   50331648   1220.183   42.239     41249.26    0.0000     0.02   read        1024
    TARGET   Average     0   8   51539607552   50331648   792.691    65.019     63494.66    0.0000     0.09   read        1024
    TARGET   Average     0  16   51539607552   50331648   769.496    66.978     65408.55    0.0000     0.23   read        1024
    TARGET   Average     0  32   51539607552   50331648   778.130    66.235     64682.83    0.0000     0.46   read        1024
    TARGET   Average     0  64   51539607552   50331648   786.188    65.556     64019.84    0.0000     0.93   read        1024
    TARGET   Average     0 128   51539607552   50331648   794.469    64.873     63352.55    0.0000     1.91   read        1024
    TARGET   Average     0 256   51539607552   50331648   801.456    64.307     62800.25    0.0000     3.97   read        1024
    
    TARGET   Average     0   4   51539607552   25165824   661.605    77.901     38037.56    0.0000     0.02   read        2048
    TARGET   Average     0   8   51539607552   25165824   413.202   124.732     60904.37    0.0000     0.09   read        2048
    TARGET   Average     0  16   51539607552   25165824   387.614   132.966     64924.99    0.0000     0.23   read        2048
    TARGET   Average     0  32   51539607552   25165824   390.090   132.122     64512.78    0.0000     0.46   read        2048
    TARGET   Average     0  64   51539607552   25165824   393.227   131.068     63998.13    0.0000     0.94   read        2048
    TARGET   Average     0 128   51539607552   25165824   397.461   129.672     63316.47    0.0000     1.90   read        2048
    TARGET   Average     0 256   51539607552   25165824   401.310   128.428     62709.14    0.0000     3.97   read        2048
    
    TARGET   Average     0   4   51539607552   12582912   384.675   133.982     32710.49    0.0000     0.02   read        4096
    TARGET   Average     0   8   51539607552   12582912   230.493   223.606     54591.28    0.0000     0.07   read        4096
    TARGET   Average     0  16   51539607552   12582912   194.767   264.622     64605.03    0.0000     0.23   read        4096
    TARGET   Average     0  32   51539607552   12582912   195.312   263.884     64424.74    0.0000     0.46   read        4096
    TARGET   Average     0  64   51539607552   12582912   197.903   260.428     63581.05    0.0000     0.94   read        4096
    TARGET   Average     0 128   51539607552   12582912   199.588   258.229     63044.28    0.0000     1.93   read        4096
    TARGET   Average     0 256   51539607552   12582912   201.143   256.234     62557.02    0.0000     4.00   read        4096
    
    TARGET   Average     0   4   51539607552   6291456    246.272   209.279     25546.79    0.0000     0.01   read        8192
    TARGET   Average     0   8   51539607552   6291456    144.297   357.178     43600.85    0.0000     0.05   read        8192
    TARGET   Average     0  16   51539607552   6291456    102.134   504.626     61599.88    0.0000     0.21   read        8192
    TARGET   Average     0  32   51539607552   6291456    100.701   511.809     62476.71    0.0000     0.46   read        8192
    TARGET   Average     0  64   51539607552   6291456    101.461   507.975     62008.70    0.0000     0.94   read        8192
    TARGET   Average     0 128   51539607552   6291456    102.220   504.200     61547.91    0.0000     1.93   read        8192
    TARGET   Average     0 256   51539607552   6291456    103.033   500.223     61062.36    0.0000     4.06   read        8192
    
    TARGET   Average     0   4   51539607552   3145728    175.645   293.431     17909.60    0.0001     0.01   read       16384
    TARGET   Average     0   8   51539607552   3145728    102.537   502.644     30678.98    0.0000     0.04   read       16384
    TARGET   Average     0  16   51539607552   3145728     68.935   747.656     45633.33    0.0000     0.13   read       16384
    TARGET   Average     0  32   51539607552   3145728     55.238   933.050     56948.82    0.0000     0.41   read       16384
    TARGET   Average     0  64   51539607552   3145728     53.488   963.582     58812.37    0.0000     0.95   read       16384
    TARGET   Average     0 128   51539607552   3145728     53.993   954.559     58261.68    0.0000     1.93   read       16384
    TARGET   Average     0 256   51539607552   3145728     53.813   957.749     58456.38    0.0000     4.07   read       16384
    
    TARGET   Average     0   4   51539607552   1572864    137.397   375.113     11447.55    0.0001     0.01   read       32768
    TARGET   Average     0   8   51539607552   1572864     81.315   633.823     19342.73    0.0001     0.02   read       32768
    TARGET   Average     0  16   51539607552   1572864     55.840   922.982     28167.16    0.0000     0.08   read       32768
    TARGET   Average     0  32   51539607552   1572864     45.307  1137.554     34715.39    0.0000     0.21   read       32768
    TARGET   Average     0  64   51539607552   1572864     40.859  1261.413     38495.26    0.0000     0.50   read       32768
    TARGET   Average     0 128   51539607552   1572864     39.018  1320.911     40311.00    0.0000     1.15   read       32768
    TARGET   Average     0 256   51539607552   1572864     38.446  1340.564     40910.76    0.0000     3.16   read       32768
    On top of that, a test run with mdadm was done. This is (for us) just a theoretical test. Since our system administrator doesn't like mdadm, we're unable to use this in production. Nevertheless, it shows good results and might be interesting for anyone who is able/allowed to use mdadm:

    2x6xRAID6 2x1231ML stripe size 128k, mdadm stripe width 512k (mdadm -Cv /dev/md0 -l0 -n2 -c512 /dev/sda /dev/sdb)
    Code:
    TARGET   Average     0   4   51539607552   50331648  1191.929    43.240     42227.04    0.0000     0.02   read        1024
    TARGET   Average     0   8   51539607552   50331648   750.436    68.680     67069.86    0.0000     0.08   read        1024
    TARGET   Average     0  16   51539607552   50331648   692.892    74.383     72639.92    0.0000     0.24   read        1024
    TARGET   Average     0  32   51539607552   50331648   699.656    73.664     71937.70    0.0000     0.47   read        1024
    TARGET   Average     0  64   51539607552   50331648   706.650    72.935     71225.74    0.0000     0.96   read        1024
    TARGET   Average     0 128   51539607552   50331648   716.332    71.949     70263.06    0.0000     1.97   read        1024
    TARGET   Average     0 256   51539607552   50331648   729.305    70.669     69013.15    0.0000     4.08   read        1024
    
    TARGET   Average     0   4   51539607552   25165824   648.417    79.485     38811.20    0.0000     0.02   read        2048
    TARGET   Average     0   8   51539607552   25165824   397.888   129.533     63248.49    0.0000     0.08   read        2048
    TARGET   Average     0  16   51539607552   25165824   348.611   147.843     72188.94    0.0000     0.24   read        2048
    TARGET   Average     0  32   51539607552   25165824   350.670   146.975     71765.04    0.0000     0.47   read        2048
    TARGET   Average     0  64   51539607552   25165824   353.287   145.886     71233.27    0.0000     0.96   read        2048
    TARGET   Average     0 128   51539607552   25165824   356.934   144.395     70505.46    0.0000     1.96   read        2048
    TARGET   Average     0 256   51539607552   25165824   360.691   142.891     69771.16    0.0000     4.11   read        2048
    
    TARGET   Average     0   4   51539607552   12582912   378.426   136.195     33250.62    0.0000     0.02   read        4096
    TARGET   Average     0   8   51539607552   12582912   225.067   228.997     55907.47    0.0000     0.06   read        4096
    TARGET   Average     0  16   51539607552   12582912   175.755   293.247     71593.52    0.0000     0.23   read        4096
    TARGET   Average     0  32   51539607552   12582912   176.256   292.413     71389.95    0.0000     0.47   read        4096
    TARGET   Average     0  64   51539607552   12582912   177.682   290.067     70817.02    0.0000     0.96   read        4096
    TARGET   Average     0 128   51539607552   12582912   179.946   286.417     69926.05    0.0000     1.97   read        4096
    TARGET   Average     0 256   51539607552   12582912   181.056   284.661     69497.43    0.0000     4.12   read        4096
    
    TARGET   Average     0   4   51539607552   6291456   243.852   211.356      25800.30    0.0000     0.01   read        8192
    TARGET   Average     0   8   51539607552   6291456   142.302   362.184      44211.91    0.0000     0.05   read        8192
    TARGET   Average     0  16   51539607552   6291456    97.072   530.942      64812.25    0.0000     0.18   read        8192
    TARGET   Average     0  32   51539607552   6291456    90.704   568.215      69362.24    0.0000     0.47   read        8192
    TARGET   Average     0  64   51539607552   6291456    91.424   563.745      68816.51    0.0000     0.96   read        8192
    TARGET   Average     0 128   51539607552   6291456    92.256   558.656      68195.34    0.0000     1.98   read        8192
    TARGET   Average     0 256   51539607552   6291456    93.116   553.500      67565.95    0.0000     4.17   read        8192
    
    TARGET   Average     0   4   51539607552   3145728   174.584   295.214      18018.44    0.0001     0.01   read       16384
    TARGET   Average     0   8   51539607552   3145728   101.933   505.625      30860.88    0.0000     0.03   read       16384
    TARGET   Average     0  16   51539607552   3145728    68.285   754.769      46067.44    0.0000     0.11   read       16384
    TARGET   Average     0  32   51539607552   3145728    54.069   953.211      58179.40    0.0000     0.35   read       16384
    TARGET   Average     0  64   51539607552   3145728    48.850  1055.068      64396.25    0.0000     0.91   read       16384
    TARGET   Average     0 128   51539607552   3145728    48.718  1057.917      64570.11    0.0000     2.00   read       16384
    TARGET   Average     0 256   51539607552   3145728    48.720  1057.880      64567.86    0.0000     4.23   read       16384
    
    TARGET   Average     0   4   51539607552   1572864   137.836   373.921      11411.16    0.0001     0.01   read       32768
    TARGET   Average     0   8   51539607552   1572864    81.174   634.930      19376.53    0.0001     0.02   read       32768
    TARGET   Average     0  16   51539607552   1572864    55.613   926.757      28282.36    0.0000     0.07   read       32768
    TARGET   Average     0  32   51539607552   1572864    44.878  1148.451      35047.94    0.0000     0.19   read       32768
    TARGET   Average     0  64   51539607552   1572864    40.480  1273.218      38855.54    0.0000     0.46   read       32768
    TARGET   Average     0 128   51539607552   1572864    38.650  1333.480      40694.57    0.0000     1.04   read       32768
    TARGET   Average     0 256   51539607552   1572864    38.149  1351.000      41229.25    0.0000     2.97   read       32768
    Another thing to notice is that when disabling the various read ahead settings, sequential performance had dropped significantly:

    During the first run (the 8KiB/32KiB test shown above), read ahead was disabled:

    Code:
    dd if=/ssd/S0 of=/dev/zero bs=8k
    8388608+0 records in
    8388608+0 records out
    68719476736 bytes (69 GB) copied, 151.19 s, 455 MB/s
    During the second two runs (with the 128KiB/512KiB tests shown above), read ahead was enabled:

    Code:
    dd if=/ssd/S1 of=/dev/zero  bs=8k
    8388608+0 records in
    8388608+0 records out
    68719476736 bytes (69 GB) copied, 68.9137 s, 997 MB/s
    Of course it's a little hard to compare directly, as between the above tests two settings where changed: stripe size and read ahead, but nevertheless are the numbers impressive. So it might be fair to say that we were a little misled by bm-flash before. There clearly is a performance improvement when using a second controller, but bm-flash doesn't show it. xdd does. This can be explained because apparently, bm-flash always executes each sub-test for exactly 10 seconds and then moves on to the next sub-test. xdd however always tests a set amount of MB. If we were really professional testers, we probably would have looked for a third test tool to get more confirmation, but the machine we're testing has to be put into production some day soon. It's a petty IOMeter doesn't run reliably on Linux and that it seems to need a win32 environment for its UI.

    We have a couple of more things to test still, like the journal on a separate disk etc. Oh, we also did a very quick test with an LVM stripe involving 3 controllers: 2x6xRAID6@1231 + 1x4xRAID0@1680, using an array stripe size of 8Kib and an lvm stripe size of 32KiB. The 1231's had 4GB each, while the 1680 had 256MB or 512MB. This config is not really symmetric of course. This didn't give very good results at all, but we only tested it for a few minutes. Perhaps a 3x5 setup with identical RAID levels would have done better.

    Another interesting thing to note is that during the dual 1231 tests, the average queue size was completely symmetric now in iostat:

    Code:
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               1.27    0.00   16.93   31.20    0.00   50.60
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
    sda               0.00     0.00 32037.33    0.00    30.81     0.00     1.97     3.22    0.10   0.03  97.68
    sdb               0.00     0.00 32003.53    0.00    30.78     0.00     1.97     3.23    0.10   0.03  97.41
    sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    sdc1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    sdc2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
    dm-3              0.00     0.00 63062.20    0.00    61.58     0.00     2.00     6.28    0.10   0.02  99.92
    This snapshot was taken during a 1KiB request size/256 threads xdd test. In this case another interesting observation is that the r/s figure of sda and sdb nearly exactly add up for dm-2. However for larger request size, these didn't add up. I don't have the snapshot ready here, but I remember something like sda and sdb still doing 32k, while dm-3 reporting ~52k or so.

  21. #46
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Interesting results and the potential improvement using larger stripe sizes on the areca with flash though it's hard to really tell with the caching turned on, if that also holds with it off it may be due to cell/blocking on the SSD's being large.

    Assuming we are still not running into a controller or driver bottleneck and if space is less of a concern than performance, try raid-10 or a bunch of raid-1's lvmed & striped. See here for an older post of mine that shows some of the benefit:
    http://www.xtremesystems.org/forums/...6&postcount=82
    especially if you are heavy in doing reads. The duplication of data allows for the controller to order requests to drives better to handle overlapped requests.

    I like that with the dual controller test you have now is showing symmetry with the queues. That's good, though I don't understand why this would not have carried through when using 3 cards. Even with different amounts of cache which would only be in play for write tests (assuming you turning off the read ahead cache). Unless there was a tuning issue where it was mis-alligned, of other delays going off the southbridge chip (interrupts; bandwidth, et al).

    As for streaming performance to be decreased without read cache, that is to be expected and is the correct behavior. Streaming I/O is at the other end of of the tuning scale, a system tuned for streaming won't do well for random i/o and visa-versa.

    Also, shot you a pm here, I have a new beta of XDD that should fix the cpu detection routines and some work done to see about the averaging problem (most likely in the timer granularity with very high iops). It's too big to attach but I can e-mail it to you ~2MiB

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  22. #47
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    Assuming we are still not running into a controller or driver bottleneck
    Well, talking of which, my co-worker did noticed something that seems like a driver bottleneck:

    Code:
    modinfo jfs
    filename:       /lib/modules/2.6.26-1-amd64/kernel/fs/jfs/jfs.ko
    license:        GPL
    author:         Steve Best/Dave Kleikamp/Barry Arndt, IBM
    description:    The Journaled Filesystem (JFS)
    depends:        nls_base
    vermagic:       2.6.26-1-amd64 SMP mod_unload modversions 
    parm:           nTxBlock:Number of transaction blocks (max:65536) (int)
    parm:           nTxLock:Number of transaction locks (max:65536) (int)
    parm:           commit_threads:Number of commit threads (int)
    This 65536 number is awfully suspicious. If you look at the source code of one of the module's file: http://cvs.cens.ucla.edu/viewcvs/vie....c?rev=1.1.1.6

    There's a fragment there that checks whether the value of nTxLock is larger than 65536 and if so sets it straight back. We might get lucky if we simply remove that check and recompile the module, but not sure what can of worms we're opening then.

  23. #48
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Yes it is, however when you did your ram disk tests previously (post #40) were you using the JFS file system on the ram disks? If so you were getting > 100,000 IOPS so that would rule the file system out. But it would still leave in the areca driver as a limiter.

    You could set up an array on the areca (say 6 drives or whatever it takes to get 65K) and then put additional drives off a DIFFERENT controller (on-board or another NON-areca card). That would use a different driver and if it's a driver issue you should see an increase in performance > 65K.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

  24. #49
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    Yes it is, however when you did your ram disk tests previously (post #40) were you using the JFS file system on the ram disks?
    No, in fact I wasn't. I tried to use JFS, but after I put that fs on the RAM disk it could not be mounted anymore. I tried several times, but it just wouldn't work. Eventually I put a simple extfs2 on the RAM disk and tested with that. My co-worker (Dennis) tried again, but also couldn't get a JFS RAM disk to mount.

  25. #50
    Xtreme Addict
    Join Date
    Jul 2006
    Posts
    1,124
    Ok, so we have outstanding the file system (jfs) and possibly the areca driver. I've been searching for a bit here to find references to specific reproducible tests in that range (100K iops) but nothing that is generally published, just marketing fluff which does not do anyone any good.

    At this point, we can do a test then with what we have.
    - Format your array w/ EXT2fs, as that was tested w/ RAM disk and we know that that file system does work w/ > 100K IOPS. It won't be ideal but this will test out if the limit is on the driver or not for the areca. If you are still limited ~65K it's something in the driver as that's the only part that is different.

    - next to test out JFS (as that is probably what you want to use in production and we should vet that) put the number of drives you needed to reach the ~65K IOPS on a single areca card (was it 6 or 8 drives?). Then put additional drives on your on-board SATA channels (or if you have a NON-areca card). LVM stripe or mdadm raid them together, doesn't really make a difference and format w/ JFS. If you get > 65K then you know that the file system can handle it (single file system, multiple AG groups). If not then we have two issues to look at here and we would need to send Shaggy (Dave Kleikamp) an e-mail about it.

    |.Server/Storage System.............|.Gaming/Work System..............................|.Sundry...... ............|
    |.Supermico X8DTH-6f................|.Asus Z9PE-D8 WS.................................|.HP LP3065 30"LCD Monitor.|
    |.(2) Xeon X5690....................|.2xE5-2643 v2....................................|.Mino lta magicolor 7450..|
    |.(192GB) Samsung PC10600 ECC.......|.2xEVGA nVidia GTX670 4GB........................|.Nikon coolscan 9000......|
    |.800W Redundant PSU................|.(8x8GB) Kingston DDR3-1600 ECC..................|.Quantum LTO-4HH..........|
    |.NEC Slimline DVD RW DL............|.Corsair AX1200..................................|........ .................|
    |.(..6) LSI 9200-8e HBAs............|.Lite-On iHBS112.................................|.Dell D820 Laptop.........|
    |.(..8) ST9300653SS (300GB) (RAID0).|.PA120.3, Apogee, MCW N&S bridge.................|...2.33Ghz; 8GB Ram;......|
    |.(112) ST2000DL003 (2TB) (RAIDZ2)..|.(1) Areca ARC1880ix-8 512MiB Cache..............|...DVDRW; 128GB SSD.......|
    |.(..2) ST9146803SS (146GB) (RAID-1)|.(8) Intel SSD 520 240GB (RAID6).................|...Ubuntu 12.04 64bit.....|
    |.Ubuntu 12.04 64bit Server.........|.Windows 7 x64 Pro...............................|............... ..........|

Page 2 of 3 FirstFirst 123 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •