MMM
Results 1 to 25 of 68

Thread: Not getting more than 65000 IOPS with xtreme setup

Threaded View

  1. #11
    Registered User
    Join Date
    Feb 2009
    Location
    Europe, Amsterdam
    Posts
    43
    Quote Originally Posted by stevecs View Post
    Ok, first for small (and random I/O) this is surprising why you picked such a large stripe size. I would have assumed you would use 8K or 16K since you said your request size was fixed at 8K?
    It basically is. The main (actually the only) application that will be using the storage is PostgreSQL. PostgreSQL defaults to using an 8Kb block size. This can be changed, but that requires recompiling the software.

    We choose 128KB since that's what everybody recommends, including the supplier (ssdisk.eu). Nevertheless I plan to test with a RAID array stripe size of 4KB and an LVM stripe size of 8KB.

    Your metadatasize though is strange, if you have a stripe size of 128K why is your meta data size 250? I could see 255 (ie, buffer fill to 256 which is 2x128) or even 127 (single stripe) but not 250.
    That's indeed a story of its own. The default metadatasize is 192, so I assume this is the minimum. Since a multiple of 128k was supposed to be the most efficient, the value should be 256. However, as it appears pvcreate does some automatic rounding up. When you set it to 256, you end up getting 320. 250 is thus a little trick. When you set it to 250, you end up with 256.

    It appeared others found the same trick:

    LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

    # pvcreate –metadatasize 250k /dev/sdb2
    Physical volume “/dev/sdb2″ successfully created

    Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:
    See: http://thunk.org/tytso/blog/2009/02/...se-block-size/

    This is much better though not as high as I would expect (how many banks do you have populated in the system?
    The system is currently equipped with 32GB (which is 16GB less then we used for earlier testing) consisting of 8 modules of 4GB each.

    Code:
    #assuming /dev/ram0 doesn't already exist
    mknod -m 660 /dev/ram0 b 1 0
    mknod -m 660 /dev/ram1 b 1 1
    
    # size the disk the way you want it (8GiB below)
    dd if=/dev/zero of=/dev/ram0 bs=1M count=8K
    dd if=/dev/zero of=/dev/ram1 bs=1M count=8K
    I tried this, but it doesn't really work. This way the /dev/ramx is limited to 64mb. I finally did get it to work by setting the ram disk size in GRUB again (to 5GB) and simply using the existing /dev/ram0 /dev/ram 1 etc

    Creating an LVM stripe out of the two RAM disks gave me a slightly lower performance compared to just hitting a single RAM disk.

    I also tried to stripe a RAM disk with an Areca, just for the experiment, and it gave me more then 64k, namely 80k. I'll summarize these findings below:

    Code:
    config               | IOPS (bs=512b, 10 threads)
    ------------------------------------
    Single RAM disk      | 200k
    LVM striped RAM disk | 190k
    2 individual 5xRAID0 | 100k (2x50k)
    lvm striped RAM+SSD  | 80k
    1 12xRAID0           | 63k
    1 4xRAID0            | 63k
    2 lvm striped 8xRAID0| 60K
    2 lvm striped 4xRAID0| 60K
    This table is for the 512b blocks size. The larger block sizes have slightly less IOPS, but the relative performance between the configurations are the same.

    The single RAM disk shows that the system itself (kernel wise) is able to get more then ~60k IOPS. The test with putting load on two individual devices show that the hardware itself is capable of at least 100k IOPS (our target). This dual test includes the PCIe buses.

    We also see that a single Areca controller can do at least 63K IOPS. This is either from it's cache or from the SSD, but the controller itself can obviously handle it. Striping an Areca controller with a RAM disk gives 80K and Striping two RAM disks gives 190K, so LVM doesn't limit the IOPS either.

    Still, either increasing the number of disks on a single controller (e.g. from 4 to 12) or combining two controllers does not increase the number of IOPS.

    It could of course well be that this is really due to the probability cause mentioned above, but one would think that with 40 testing threads all doing completely random IO the chance that all SSDs receive an equal amount of load would be fairly high.

    HDD Read Ahead: (assuming you mean HDD Read Ahead Cache? or volume data read ahead cache?)
    It's HDD Read Ahead Cache I tried to set this setting to disabled, but the results are about the same. The following is for 1x12xRAID0:

    Code:
    Filling 4G before testing  ...   4096 MB done in 3 seconds (1365 MB/sec).
    
    
    
    Read Tests:
    
    
    
    Block |   1 thread    |  10 threads   |  40 threads   
    
     Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW   
    
          |               |               |               
    
     512B | 25633   12.5M | 64028   31.2M | 63074   30.7M 
    
       1K | 28228   27.5M | 63843   62.3M | 63060   61.5M 
    
       2K | 29437   57.4M | 63737  124.4M | 62826  122.7M 
    
       4K | 31409  122.6M | 63218  246.9M | 62516  244.2M 
    
       8K | 30017  234.5M | 62355  487.1M | 61603  481.2M 
    
      16K | 25898  404.6M | 61443  960.0M | 61167  955.7M 
    
      32K | 19703  615.7M | 51208 1600.2M | 50890 1590.3M
    One setting that I'm desperate for to try is Queue Depth. Thus far I'm unable to locate where to set this setting. It's mentioned in the Areca firmware release notes:

    Code:
     Change Log For V1.44 Firmware
    
     2007-4-26
    
         * Add enclosure add/remove event
         * Add Battery Status for SAS (Battery is not monitored)
    
     2007-5-8
    
         * Add Queue Depth Setting for ARC1680 (default 16)
    Does anyone know where this setting is supposed to be set on either the 1680 or 1231?

    edt: I did find the setting "Hdd Queue Depth Setting" on our 1680IX. This was set to 32 (the max). No such setting can be found on the 1231 though.
    Last edited by henk53; 03-05-2009 at 10:00 AM.

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •