
Originally Posted by
stevecs
Ok, first for small (and random I/O) this is surprising why you picked such a large stripe size. I would have assumed you would use 8K or 16K since you said your request size was fixed at 8K?
It basically is. The main (actually the only) application that will be using the storage is PostgreSQL. PostgreSQL defaults to using an 8Kb block size. This can be changed, but that requires recompiling the software.
We choose 128KB since that's what everybody recommends, including the supplier (ssdisk.eu). Nevertheless I plan to test with a RAID array stripe size of 4KB and an LVM stripe size of 8KB.
Your metadatasize though is strange, if you have a stripe size of 128K why is your meta data size 250? I could see 255 (ie, buffer fill to 256 which is 2x128) or even 127 (single stripe) but not 250.
That's indeed a story of its own. The default metadatasize is 192, so I assume this is the minimum. Since a multiple of 128k was supposed to be the most efficient, the value should be 256. However, as it appears pvcreate does some automatic rounding up. When you set it to 256, you end up getting 320. 250 is thus a little trick. When you set it to 250, you end up with 256.
It appeared others found the same trick:
LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:
# pvcreate –metadatasize 250k /dev/sdb2
Physical volume “/dev/sdb2″ successfully created
Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:
See: http://thunk.org/tytso/blog/2009/02/...se-block-size/
This is much better though not as high as I would expect (how many banks do you have populated in the system?
The system is currently equipped with 32GB (which is 16GB less then we used for earlier testing) consisting of 8 modules of 4GB each.
Code:
#assuming /dev/ram0 doesn't already exist
mknod -m 660 /dev/ram0 b 1 0
mknod -m 660 /dev/ram1 b 1 1
# size the disk the way you want it (8GiB below)
dd if=/dev/zero of=/dev/ram0 bs=1M count=8K
dd if=/dev/zero of=/dev/ram1 bs=1M count=8K
I tried this, but it doesn't really work. This way the /dev/ramx is limited to 64mb. I finally did get it to work by setting the ram disk size in GRUB again (to 5GB) and simply using the existing /dev/ram0 /dev/ram 1 etc 
Creating an LVM stripe out of the two RAM disks gave me a slightly lower performance compared to just hitting a single RAM disk.
I also tried to stripe a RAM disk with an Areca, just for the experiment, and it gave me more then 64k, namely 80k. I'll summarize these findings below:
Code:
config | IOPS (bs=512b, 10 threads)
------------------------------------
Single RAM disk | 200k
LVM striped RAM disk | 190k
2 individual 5xRAID0 | 100k (2x50k)
lvm striped RAM+SSD | 80k
1 12xRAID0 | 63k
1 4xRAID0 | 63k
2 lvm striped 8xRAID0| 60K
2 lvm striped 4xRAID0| 60K
This table is for the 512b blocks size. The larger block sizes have slightly less IOPS, but the relative performance between the configurations are the same.
The single RAM disk shows that the system itself (kernel wise) is able to get more then ~60k IOPS. The test with putting load on two individual devices show that the hardware itself is capable of at least 100k IOPS (our target). This dual test includes the PCIe buses.
We also see that a single Areca controller can do at least 63K IOPS. This is either from it's cache or from the SSD, but the controller itself can obviously handle it. Striping an Areca controller with a RAM disk gives 80K and Striping two RAM disks gives 190K, so LVM doesn't limit the IOPS either.
Still, either increasing the number of disks on a single controller (e.g. from 4 to 12) or combining two controllers does not increase the number of IOPS.
It could of course well be that this is really due to the probability cause mentioned above, but one would think that with 40 testing threads all doing completely random IO the chance that all SSDs receive an equal amount of load would be fairly high.
HDD Read Ahead: (assuming you mean HDD Read Ahead Cache? or volume data read ahead cache?)
It's HDD Read Ahead Cache
I tried to set this setting to disabled, but the results are about the same. The following is for 1x12xRAID0:
Code:
Filling 4G before testing ... 4096 MB done in 3 seconds (1365 MB/sec).
Read Tests:
Block | 1 thread | 10 threads | 40 threads
Size | IOPS BW | IOPS BW | IOPS BW
| | |
512B | 25633 12.5M | 64028 31.2M | 63074 30.7M
1K | 28228 27.5M | 63843 62.3M | 63060 61.5M
2K | 29437 57.4M | 63737 124.4M | 62826 122.7M
4K | 31409 122.6M | 63218 246.9M | 62516 244.2M
8K | 30017 234.5M | 62355 487.1M | 61603 481.2M
16K | 25898 404.6M | 61443 960.0M | 61167 955.7M
32K | 19703 615.7M | 51208 1600.2M | 50890 1590.3M
One setting that I'm desperate for to try is Queue Depth. Thus far I'm unable to locate where to set this setting. It's mentioned in the Areca firmware release notes:
Code:
Change Log For V1.44 Firmware
2007-4-26
* Add enclosure add/remove event
* Add Battery Status for SAS (Battery is not monitored)
2007-5-8
* Add Queue Depth Setting for ARC1680 (default 16)
Does anyone know where this setting is supposed to be set on either the 1680 or 1231?
edt: I did find the setting "Hdd Queue Depth Setting" on our 1680IX. This was set to 32 (the max). No such setting can be found on the 1231 though.
Bookmarks