Not getting more than 65000 IOPS with xtreme setup

**henk53** · 03-04-2009, 01:04 PM

Hi,

Together with my co-workers I'm trying to create a storage solution capable of doing more than 100k IOPS. See http://jdevelopment.nl/hardware/one-dvd-per-second/ for our attempts thus far.

The problem we're currently facing is that we're unable to get more than ~65k iops for a single device. What we have been trying to do is combining two devices, with each build of of an high performance RAID controller (the 1680IX-12/4GB RAM) with 8 SSDs (mtron pro 7535) on each, into 1 device using either LVM or mdadm striping (raid 0).

In all of our tests the total number of IOPS we get is never higher than ~65k. However, if we put a simultaneous load on the two controllers configured as individual devices, we get a combined total of nearly exactly 100k IOPS (for blocksizes up to 16K, 20 testing threads total, 10 threads per individual device).

Because 65k is 2^16, this number feels like a maximum setting somewhere and not as a hardware limitation. I'm not sure where to look for such a setting though. We're running Debian Linux Lenny. So far I've read up a little on IO schedulers and kernel tuning parameters, but any help would be greatly appreciated.

**One_Hertz** · 03-04-2009, 01:26 PM

very nice project. I look forward to more results.

I am not sure about the 65k limit... Napalm was messing around with MFT and was able to hit 300k+ IOPS with RAM:
http://www.xtremesystems.org/forums/...&postcount=558

**henk53** · 03-04-2009, 01:43 PM

Originally Posted by One_Hertz

very nice project. I look forward to more results.

Thanks

Next on our list is testing with a 1231ML, hopefully with two of those too.

I am not sure about the 65k limit... Napalm was messing around with MFT and was able to hit 300k+ IOPS with RAM:
http://www.xtremesystems.org/forums/...&postcount=558

That's quite interesting indeed. We are working on Linux though so it's a little hard to compare any setting to that, but it does mean it might be an idea to do some testing on Windows. I'm not sure how to create s software RAID on Windows (I have very little experience with Windows), but I'm sure it can't be that hard

**alfaunits** · 03-04-2009, 02:03 PM

Originally Posted by One_Hertz

I am not sure about the 65k limit... Napalm was messing around with MFT and was able to hit 300k+ IOPS with RAM:
http://www.xtremesystems.org/forums/...&postcount=558

You realize the card cannot achieve that kind of speed even if it were a double ARC1231?

PCI-e x8 is capped at 2GB/s, and realistic limit is around 1.5GB/s.
I donno how he got the numbers, but they aren't real

@OP: I think you won't be able to physically achieve more than that, and it has little to do with the card, but the Mtrons.
In best case scenario, the ranom I/O would be split across each drive in order, but in reality it will never happen ;( So you're going with mathematical probability here, which at 8 drives, presuming the stripe is the same as the block used for the random I/O, I think is 4 drives, i.e. after 5 I/Os you'll most probably hit one of the previous drives and thus cap the IOPs.

IIRC, a single Mtron does ~17000 random IOPs, so that makes perfect sense for 4x17000 = ~65000 IOPs.

**One_Hertz** · 03-04-2009, 02:34 PM

Originally Posted by alfaunits

You realize the card cannot achieve that kind of speed even if it were a double ARC1231?

PCI-e x8 is capped at 2GB/s, and realistic limit is around 1.5GB/s.
I donno how he got the numbers, but they aren't real

@OP: I think you won't be able to physically achieve more than that, and it has little to do with the card, but the Mtrons.
In best case scenario, the ranom I/O would be split across each drive in order, but in reality it will never happen ;( So you're going with mathematical probability here, which at 8 drives, presuming the stripe is the same as the block used for the random I/O, I think is 4 drives, i.e. after 5 I/Os you'll most probably hit one of the previous drives and thus cap the IOPs.

IIRC, a single Mtron does ~17000 random IOPs, so that makes perfect sense for 4x17000 = ~65000 IOPs.

I did say it was from RAM not storage devices...

@ the second part: I might have read that wrong but I think they tried with 12 drives too and got the same 6Xk?

**stevecs** · 03-04-2009, 02:38 PM

Very interesting study though a couple comments and observations:

First per your article you were using two ARC-1680IX-12's and using LVM2 to stripe at the OS level. What was your stripe size both for the LVM2 and for your array. LVM2 should be a multiple of your sub-array stripe size (probably equal to your stripe width of the base array that way each controller would have enough information to write full stripes which work here, but this is for large data not for 8K database blocks so some testing would need to be done. At least it should be => a multiple of your stripe size though). Also are you testing w/ a partition table on the array or raw (ie, alignment issues). Normally I test with no partition table (raw raid volume gets added to lvm, and then put a file system right on the lv).

In your write up you mentioned that you got ~3200MB/s out of two cards? That to me screams ram testing not disk testing. I could see with 2 controllers around 1800MB but not double that, unless I've mis-read what you were doing there.

As for IOPS from your results it really looks like a raid controller limit (IOP34x series) more than anything, but there are a couple things that can be tested.

- create a ram disk (system ram) and run your tests against that (with lvm2 & put a file system on it). This would rule out completely the drive/controller subsystem and still test your OS/filesystem. If you're still hitting 65K then it's something in kernel space (though honestly I've never heard of an issue like that before).
- if possible, try cards that DO NOT have a built-in expander (the -IX line from areca does, so you're going through that chip each time). The ARC-1680 doesn't have the expander which may improve performance a bit (also may help resolving some incompatibility issues you mentioned in the article)

**henk53** · 03-04-2009, 03:53 PM

Originally Posted by alfaunits

@OP: I think you won't be able to physically achieve more than that, and it has little to do with the card, but the Mtrons.
In best case scenario, the ranom I/O would be split across each drive in order, but in reality it will never happen ;(

That's true, but logic tells me that eventually on average both devices would get a similar amount of random I/O. I might be able to monitor somewhere how much I/O each device would be getting during a load test, but I would be very surprised really if it appears that one device had to process 80% of the I/O while the other had to do only 20%.

IIRC, a single Mtron does ~17000 random IOPs, so that makes perfect sense for 4x17000 = ~65000 IOPs.

Well, we were using a total of 16 drives, not 8

Each controller had 8 Mtron's. We striped two controllers, for a total number of 16 disks.

Originally Posted by stevecs

Very interesting study though a couple comments and observations:
First per your article you were using two ARC-1680IX-12's and using LVM2 to stripe at the OS level. What was your stripe size both for the LVM2 and for your array.

The stripe size for the array (set when creating the array in the areca bios) was 128k. For LVM we used the following command:

Code:

pvcreate --metadatasize 250 /dev/sdb /dev/sdc - vgcreate ssd /dev/sdb /dev/sdc - lvcreate -i2 -I128 -L447G -n ssd-striped ssd

Getting all the right alignment settings right proved to be rather tricky. I'm not 100% sure that we got everything absolutely right to be honest.

In your write up you mentioned that you got ~3200MB/s out of two cards? That to me screams ram testing not disk testing. I could see with 2 controllers around 1800MB but not double that, unless I've mis-read what you were doing there.

The 3.3GB/s was reported by bm-flash, but only at very large block sizes (>256K) and with multiple threads (>10). This is most likely indeed the cache, although with smaller block sizes we weren't getting this speed.

- create a ram disk (system ram) and run your tests against that (with lvm2 & put a file system on it).

Okay, what would you suggest as the best way to create a ram disk? I created one using GRUB (5GB), and got a lot more IOPS:

Code:

Filling 4G before testing  ...   4096 MB done in 2 seconds (2048 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads
Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
     |               |               |
512B |207091  101.1M |202807   99.0M |204409   99.8M
  1K |197413  192.7M |194549  189.9M |195600  191.0M
  2K |184750  360.8M |180739  353.0M ^C

I have to find out how to create two RAM disks so LVM can be used on those.

- if possible, try cards that DO NOT have a built-in expander (the -IX line from areca does, so you're going through that chip each time). The ARC-1680 doesn't have the expander which may improve performance a bit (also may help resolving some incompatibility issues you mentioned in the article)

That quite interesting to hear indeed. We just received the 1231ML and are going to test with that tomorrow. Currently it's well passed midnight here in Amsterdam, so I have to call it a day

**alfaunits** · 03-04-2009, 04:18 PM

Originally Posted by One_Hertz

I did say it was from RAM not storage devices...

That picture shows an Areca volume, from what I see..
And even then 8GB/s for memory copy is not something I would trust. Maybe it is possible, but I doubt it.

the second part: I might have read that wrong but I think they tried with 12 drives too and got the same 6Xk?

Ye, I think probability kills them here as well. I don't remember it enough to give the exact formula and calculate, but I think it has more to do with that than anything else.
I.e. not the card, not the rest of the computer, but just the drives.

**alfaunits** · 03-04-2009, 04:31 PM

Originally Posted by henk53

That's true, but logic tells me that eventually on average both devices would get a similar amount of random I/O. I might be able to monitor somewhere how much I/O each device would be getting during a load test, but I would be very surprised really if it appears that one device had to process 80% of the I/O while the other had to do only 20%.

Again if you RAID them in the OS, then the probability that you will hit the same physical array is 50%, and it won't allow the total random to be 2x, but 150%, which it is.

Well, we were using a total of 16 drives, not 8

Each controller had 8 Mtron's. We striped two controllers, for a total number of 16 disks.

Yes, that's how I calculated it

**stevecs** · 03-04-2009, 04:51 PM

Originally Posted by henk53

The stripe size for the array (set when creating the array in the areca bios) was 128k. For LVM we used the following command:

Code:

pvcreate --metadatasize 250 /dev/sdb /dev/sdc - vgcreate ssd /dev/sdb /dev/sdc - lvcreate -i2 -I128 -L447G -n ssd-striped ssd

Getting all the right alignment settings right proved to be rather tricky. I'm not 100% sure that we got everything absolutely right to be honest.

Ok, first for small (and random I/O) this is surprising why you picked such a large stripe size. I would have assumed you would use 8K or 16K since you said your request size was fixed at 8K? This doesn't make much of a difference with reads but writes can be more of a hit. Unless the SSD's behave better with larger stripe sizes (cell writing more efficient or something? I haven't played w/ SSD's much to really dig to that level).

But your command string tells me that you are using raw devices (/dev/sdb, /dev/sdc) which is good (no partition tables). Your metadatasize though is strange, if you have a stripe size of 128K why is your meta data size 250? I could see 255 (ie, buffer fill to 256 which is 2x128) or even 127 (single stripe) but not 250. Your setting for striping of 128 (-I128) is something a bit larger than I would have started using assuming your request size of 8K You probably picked this as it was your stripe size for the underlaying array. That has some merit though if that really is your request size I would set both to 8K (or 128K if you picked that for a SSD write reason) (array stripe and this, OR if you're doing a lot of writes to set this to be stripe size*#data disks in base array. That way you increase the chance of writing a full stripe to the array and not using as much controller cache waiting to fill it out causing a read/modify/write partial stripe penalty.

Originally Posted by henk53

Okay, what would you suggest as the best way to create a ram disk? I created one using GRUB (5GB), and got a lot more IOPS:

I have to find out how to create two RAM disks so LVM can be used on those.

This is much better though not as high as I would expect (how many banks do you have populated in the system? do you have any memory mirroring turned on or spare rows enabled?) Anyway it does prove at least that the i/o kernel subsystem is not likely the problem here at all. To do more testing you can create ram disks by:

Code:

#assuming /dev/ram0 doesn't already exist
mknod -m 660 /dev/ram0 b 1 0
mknod -m 660 /dev/ram1 b 1 1

# size the disk the way you want it (8GiB below)
dd if=/dev/zero of=/dev/ram0 bs=1M count=8K
dd if=/dev/zero of=/dev/ram1 bs=1M count=8K

then treating /dev/ram0 /dev/ram1 as physical discs for you to work on (adding them to lvm putting a file system on them et al).

This should give you a max value that you can reach on your hardware. You'll be lower than this going through the I/O hubs and then the card controllers and any expanders.

Another item from reading your specs on the arecas I see that you have some settings set to auto that you may want to modify:

HDD Read Ahead: (assuming you mean HDD Read Ahead Cache? or volume data read ahead cache?) Anyway, for random workloads you want to turn this down or off especially if you're going to be loading the system with high levels of really random traffic. You /may/ get by with hdd read ahead cache enabled, and volume data read ahead set to conservative which gives a little volume buffering (helps with raid parity checks for example) without too much of a penalty with random workloads.

To test your theory on reasons for the choppy writes you can try disabling or removing the cache on the controller to see if it evens it out (or using smaller caches). Areca had some problems in the past w/ large caches that I thought were fixed but you may be running into something there (as you surmised filling the cache and then taking the time to write it out to the drives).

**henk53** · 03-05-2009, 08:19 AM

Originally Posted by stevecs

Ok, first for small (and random I/O) this is surprising why you picked such a large stripe size. I would have assumed you would use 8K or 16K since you said your request size was fixed at 8K?

It basically is. The main (actually the only) application that will be using the storage is PostgreSQL. PostgreSQL defaults to using an 8Kb block size. This can be changed, but that requires recompiling the software.

We choose 128KB since that's what everybody recommends, including the supplier (ssdisk.eu). Nevertheless I plan to test with a RAID array stripe size of 4KB and an LVM stripe size of 8KB.

Your metadatasize though is strange, if you have a stripe size of 128K why is your meta data size 250? I could see 255 (ie, buffer fill to 256 which is 2x128) or even 127 (single stripe) but not 250.

That's indeed a story of its own. The default metadatasize is 192, so I assume this is the minimum. Since a multiple of 128k was supposed to be the most efficient, the value should be 256. However, as it appears pvcreate does some automatic rounding up. When you set it to 256, you end up getting 320. 250 is thus a little trick. When you set it to 250, you end up with 256.

It appeared others found the same trick:

LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

# pvcreate –metadatasize 250k /dev/sdb2
Physical volume “/dev/sdb2″ successfully created

Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:

See: http://thunk.org/tytso/blog/2009/02/...se-block-size/

This is much better though not as high as I would expect (how many banks do you have populated in the system?

The system is currently equipped with 32GB (which is 16GB less then we used for earlier testing) consisting of 8 modules of 4GB each.

Code:

#assuming /dev/ram0 doesn't already exist
mknod -m 660 /dev/ram0 b 1 0
mknod -m 660 /dev/ram1 b 1 1

# size the disk the way you want it (8GiB below)
dd if=/dev/zero of=/dev/ram0 bs=1M count=8K
dd if=/dev/zero of=/dev/ram1 bs=1M count=8K

I tried this, but it doesn't really work. This way the /dev/ramx is limited to 64mb. I finally did get it to work by setting the ram disk size in GRUB again (to 5GB) and simply using the existing /dev/ram0 /dev/ram 1 etc

Creating an LVM stripe out of the two RAM disks gave me a slightly lower performance compared to just hitting a single RAM disk.

I also tried to stripe a RAM disk with an Areca, just for the experiment, and it gave me more then 64k, namely 80k. I'll summarize these findings below:

Code:

config               | IOPS (bs=512b, 10 threads)
------------------------------------
Single RAM disk      | 200k
LVM striped RAM disk | 190k
2 individual 5xRAID0 | 100k (2x50k)
lvm striped RAM+SSD  | 80k
1 12xRAID0           | 63k
1 4xRAID0            | 63k
2 lvm striped 8xRAID0| 60K
2 lvm striped 4xRAID0| 60K

This table is for the 512b blocks size. The larger block sizes have slightly less IOPS, but the relative performance between the configurations are the same.

The single RAM disk shows that the system itself (kernel wise) is able to get more then ~60k IOPS. The test with putting load on two individual devices show that the hardware itself is capable of at least 100k IOPS (our target). This dual test includes the PCIe buses.

We also see that a single Areca controller can do at least 63K IOPS. This is either from it's cache or from the SSD, but the controller itself can obviously handle it. Striping an Areca controller with a RAM disk gives 80K and Striping two RAM disks gives 190K, so LVM doesn't limit the IOPS either.

Still, either increasing the number of disks on a single controller (e.g. from 4 to 12) or combining two controllers does not increase the number of IOPS.

It could of course well be that this is really due to the probability cause mentioned above, but one would think that with 40 testing threads all doing completely random IO the chance that all SSDs receive an equal amount of load would be fairly high.

HDD Read Ahead: (assuming you mean HDD Read Ahead Cache? or volume data read ahead cache?)

It's HDD Read Ahead Cache

I tried to set this setting to disabled, but the results are about the same. The following is for 1x12xRAID0:

Code:

Filling 4G before testing  ...   4096 MB done in 3 seconds (1365 MB/sec).



Read Tests:



Block |   1 thread    |  10 threads   |  40 threads   

 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW   

      |               |               |               

 512B | 25633   12.5M | 64028   31.2M | 63074   30.7M 

   1K | 28228   27.5M | 63843   62.3M | 63060   61.5M 

   2K | 29437   57.4M | 63737  124.4M | 62826  122.7M 

   4K | 31409  122.6M | 63218  246.9M | 62516  244.2M 

   8K | 30017  234.5M | 62355  487.1M | 61603  481.2M 

  16K | 25898  404.6M | 61443  960.0M | 61167  955.7M 

  32K | 19703  615.7M | 51208 1600.2M | 50890 1590.3M

One setting that I'm desperate for to try is Queue Depth. Thus far I'm unable to locate where to set this setting. It's mentioned in the Areca firmware release notes:

Code:

 Change Log For V1.44 Firmware

 2007-4-26

     * Add enclosure add/remove event
     * Add Battery Status for SAS (Battery is not monitored)

 2007-5-8

     * Add Queue Depth Setting for ARC1680 (default 16)

Does anyone know where this setting is supposed to be set on either the 1680 or 1231?

edt: I did find the setting "Hdd Queue Depth Setting" on our 1680IX. This was set to 32 (the max). No such setting can be found on the 1231 though.

**alfaunits** · 03-05-2009, 10:01 AM

Originally Posted by henk53

We choose 128KB since that's what everybody recommends, including the supplier (ssdisk.eu). Nevertheless I plan to test with a RAID array stripe size of 4KB and an LVM stripe size of 8KB.

No, that would give you the reverse of what you want, since the LVM array would then ALWAYS read/write to both Areca arrays, and that is one thing you WANT to avoid.
So, choose a stripe size that is equal to the buffer size, i.e. 8K, and LVM stripe of 16K.

If you want to test whether the system can achieve more IOPS at the hardware level, create a random test yourself. Namely, for a 12 disk Mtron array, 8K stripe, single controller only: create software that creates 12 threads, each reading random location multiplied by it's thread number. SO each thread would read from the same drive always, but you would get 12 drive reads.
That would tell you if the limit is: in the controller (if it can't go over 65K) or the CPU/chipset/kernel (if it can't go over some other amount, say... 200K that you have already observed).
Also note that Areca uses sector 0x208 as the starting sector for its arrays! When you need to align the partitions, this must be taken into account

**stevecs** · 03-05-2009, 03:18 PM

@henk53- If you are using 8KiB request sizes from postgresql then I would at least try 8KiB as a stripe size, I wouldn't use 4KiB as that would be 1/2 your request size so you would be splitting it across two drives (halving your subsystem's available iops which is opposite of what you want to do).

As for the lvm metadatasize rounding yes I'm aware of that and the default of 192KiB, that /should not/ increase to 320KiB if you are UNDER 256 (255KiB). Basically LVM ear-marks 3 blocks of 64KiB each by default for data. If you set it to 256 you are in the next block (remember this is 0 offset) so it will 'pad' that to 320 (256+64) as your new starting point for your data. So 255 should also work. The main point however is when you check the offset w/ pvs (pvs -o+pe_start) you should see your new starting point as long as it's aligned it doesn't really matter how you get there.

Now the question is (which I have NOT benchmarked myself) is should this be set to a multiple of your STRIPE size or a multiple of your data STRIPE WIDTH. (ie. assuming 5 drives in raid-5 with a 128K stripe size, your stripe width for the array would be 128*4 (minus 1 drive for parity) so 512KiB. By using your data stripe width for your LVM stripe size it /should/ increase your performance as you would have less stripe width fragmentation and cut down on the read/modify/write penalties per sub array. remember to set your metadatasize to your stripe width boundary not to your stripe size. It would prove to be very interesting if you could test that.

The numbers you get from your ram tests are very good and strongly point to the raid card being a bottleneck. Now why you don't get scale with multiple cards I think is due to how your i/o is being distributed and the low level updates (array & lvm striping) OR the areca driver. Another item to check as well is how do you have your interrupts distributed to your cpus' for the cards? do you have each card going to a different cpu & core?

As for queue depth I have not found anything on the ARC16xx cards that sets it per-drive, however you can set it per array (/sys/block/<device>/queue/nr_requests) max is 256 for the cards. At the same time here you may want to set your scheduler to deadline or noop if you haven't done so already (/sys/block/<device>/queue/scheduler )

**NapalmV5** · 03-07-2009, 09:27 AM

Originally Posted by alfaunits

You realize the card cannot achieve that kind of speed even if it were a double ARC1231?

PCI-e x8 is capped at 2GB/s, and realistic limit is around 1.5GB/s.
I donno how he got the numbers, but they aren't real

@OP: I think you won't be able to physically achieve more than that, and it has little to do with the card, but the Mtrons.
In best case scenario, the ranom I/O would be split across each drive in order, but in reality it will never happen ;( So you're going with mathematical probability here, which at 8 drives, presuming the stripe is the same as the block used for the random I/O, I think is 4 drives, i.e. after 5 I/Os you'll most probably hit one of the previous drives and thus cap the IOPs.

IIRC, a single Mtron does ~17000 random IOPs, so that makes perfect sense for 4x17000 = ~65000 IOPs.

you realize that the results could be cached ??

the numbers are real i didnt just pullem out of my butt

Originally Posted by One_Hertz

I did say it was from RAM not storage devices...

@ the second part: I might have read that wrong but I think they tried with 12 drives too and got the same 6Xk?

again its not the system ram.. controllers cache

**alfaunits** · 03-07-2009, 10:09 AM

Mate, even if the data is from the controller cache, that data cannot arrive faster than the PCI-e link allows, and that's 2GB/s. The cache is not in the system RAM for a dedicated controller.
And even then, those numbers would be impossible.

**henk53** · 03-09-2009, 03:19 AM

Originally Posted by stevecs

Now the question is (which I have NOT benchmarked myself) is should this be set to a multiple of your STRIPE size or a multiple of your data STRIPE WIDTH.

Sorry for my ignorance, but is "STRIPE size" the array stripe size (the one I set in the Areca controller) and "STRIPE WIDTH" the LVM stripe size (the one I use for the pvcreate command) ?

(ie. assuming 5 drives in raid-5 with a 128K stripe size, your stripe width for the array would be 128*4 (minus 1 drive for parity) so 512KiB.

Okay, so let me see if I get this right. If I thus set an array stripe size of 8KiB, then when using 5 drives in raid-5, I would use an LVM stripe size of 8*4= 32KiB.

The LVM command would then initially become something like this:

Code:

pvcreate --metadatasize 250 /dev/sdb /dev/sdc - vgcreate ssd /dev/sdb /dev/sdc - lvcreate -i2 -I32 -L447G -n ssd-striped ssd

remember to set your metadatasize to your stripe width boundary not to your stripe size. It would prove to be very interesting if you could test that.

Hmmm, if I use metadatasize = 250, it will become 256 which is aligned with 32, Or do you mean something else here? I'm definitely willing to test this

**stevecs** · 03-09-2009, 02:54 PM

stripe size = the contiguous space assigned to each drive (what you set in the controller). stripe width is the # of drives in your array TIMES the stripe size. Now a data stripe width is the number of data drives per array (ie, raid-5 of 5 drives would have a stripe width of '5' or a data stripe width of 4, a raid-6 of 5 drives would have a stripe width of 6 and data stripe width of 4 (2 drives for parity).

Now for lvm you have to worry about your 192K initial metadata. So as you calculated if you have a raid-5 of 5 drives (4 data) your stripe width would be 32. Since 192KiB is divisible by 32 (6 stripes) you want to start your array AT 192K (so in this example you are alligned right out the gate you don't need to set a metadatasize argument). But you want your LVM stripe size (between arrays) to be a multiple of your individual array data stripe width (ie, 32KiB) That way you are not hopping between arrays and only writing partial stripes.

Now for argument if you have a stripe width of 128KiB for your array (same 5 drive raid-5) that would be 512KiB for data stripe width. So you want to use your metadatasize offset to START your partition at 512KiB (or multiple thereof), and use your lvm striping (between your arrays) to also be 512KiB (data stripe width of a single array) or a multiple thereof.

so

Code:

pvcreate --metadatasize X /dev/sdb /dev/sdc
#  X 511 or whatever it takes to start at 512KiB with padding)
vgcreate ssd -s Y /dev/sdb /dev/sdc
# Y here is your physical extent size which you probably want to be a multiple of your data stripe width default is 4MiB and that's fine as we our data stripe widths here are evenly divisable to that for 8K (32KiB) or 128K (512KiB)
lvcreate -i2 -IZ -L447G -n ssd-striped ssd
# Z here should be a multiple of your data stripe width (32KiB or 512KiB)

make sense?

**henk53** · 03-10-2009, 05:04 AM

Originally Posted by stevecs

stripe size = the contiguous space assigned to each drive (what you set in the controller). stripe width is the # of drives in your array TIMES the stripe size. Now a data stripe width is the number of data drives per array (ie, raid-5 of 5 drives would have a stripe width of '5' or a data stripe width of 4, a raid-6 of 5 drives would have a stripe width of 6 and data stripe width of 4 (2 drives for parity).

Ah, ok. That's a very clear explanation. Thanks a lot. Your earlier suggestion of using the NOOP scheduler turned out to be quite beneficial. Using 12 disks on a single controller the number of IOPS increased from ~63k to ~71k.

Next to that I did some more testing, this time I build 2 RAID0 arrays, each consisting of 8 Mtrons on an Areca card. For the time being the first Areca card is a 1231ml and the other is a 1680ix. We ordered a second 1231ml, but it didn't arrive yet.

The array stripe size was set to 8KiB and the LVM stripe size to 64KiB. The metadatasize was kept at 256 KiB, which is a multiple of the data width you mentioned (8 disks in RAID0 * 8KiB stripe size = 64KiB). I'm not really sure why I choose 64KiB for the LVM stripe size, but I just did. I will re-test this soon with an 8KiB LVM stripe size. The scheduler used was the NOOP scheduler.

I tested the setup in three ways: the two devices individually, the two devices concurrently and the two devices striped via LVM. The results are as follows:

Code:

config               | IOPS (bs=512b, 10 threads)
------------------------------------
2 individual 8xRAID0 | 113k (2x56.5k)
1 8xRAID0            | 71k
2 lvm striped 8xRAID0| 65K

So with the smaller stripe size, the NOOP scheduler and the metadatasize alligned with the stripe WIDTH, I'm still seeing the same thing happening. Testing the devices concurrently gives me a very high number of IOPS, proving the hardware itself is quite capable. Testing an individual device gives a number of IOPS that is approximately 60% of the concurrent test. Finally, testing the LVM stripe gives an amount of IOPS that is about 10% lower then what a single device gives me.

Relatively these numbers really aren't unlike the numbers that I've seen before. In theory striping should increase performance (and it actually does for sequential IO), but for random IO I just don't see it happening.

**stevecs** · 03-10-2009, 05:20 AM

Yeah, NOOP is probably the best when using intelligent raid controllers, deadline is also helpful in rare cases where you completely saturate the raid controller's buss and you still want to have the system performing other tasks (but if you're waiting on I/O anyway as your sole bottleneck it's kind of moot). Another item you want to make sure of if you're really doing random I/O is to disable read-aheads on the controller (or at the very least set it to conservative) you don't want the controller to hold up the drive bus trying to read more sectors in that will most likely never be requested in a random workload environment so by disabling read-ahead you tell the controller to just retrieve the data that is being requested NOT any additional sectors.

Another comment here is that you probably want to use 4K block sizes not 512b (your memory page size is 4K, so that would be the most efficient (or multiples thereof like your 8K you're trying). smaller sizes are generally less efficient in terms of cpu.

If so then what this is really pointing to is a driver issue that may be capped by this or interrupts. Do you have your IRQ's for your raid cards on different cpu's? (what's your /proc/interrupts look like?) also, what is your setting in /sys/block/<raid device>/queue/nr_requests ?

edit: was interrupted, noticed that you did set your array to, your LVM stripe size for 8KiB should be 64KiB when using 8-way raid-0 (you want to use the data stripe width of an individual array for your LVM stripe size).

more mundane check as well, you are putting your cards into different I/O hubs on your server board right? (w/ 8KiB this is ~512MiB/s transfer rates)

**henk53** · 03-10-2009, 05:59 AM

Originally Posted by stevecs

Another item you want to make sure of if you're really doing random I/O is to disable read-aheads on the controller (or at the very least set it to conservative) you don't want the controller to hold up the drive bus trying to read more sectors in that will most likely never be requested in a random workload environment so by disabling read-ahead you tell the controller to just retrieve the data that is being requested NOT any additional sectors.

Read-ahead is indeed enabled on the controller. I'll test again with these disabled.

Another comment here is that you probably want to use 4K block sizes not 512b (your memory page size is 4K, so that would be the most efficient (or multiples thereof like your 8K you're trying). smaller sizes are generally less efficient in terms of cpu.

Is this in reference to the heading in my result table "IOPS (bs=512b, 10 threads)" ?

I'm not really explicitly testing with 512b, bm-flash tests with various block sizes and starts with 512b. I just took the first line of the results

So, 512b is just a reading. I could just as well pick the 8KiB line as a reading.

When you did the 8-way raid-0 (what stripe size?) that you want to allign your array to 8*stripe size offset, did you?

For the 8-way raid 0 setup I set the array stripe size on each controller to 8KiB and the LVM stripe size first to 64KiB and then in a second test to 8KiB. The lvm metadatasize (which I assume is what you mean with "allign your array to 8*stripe size offset) was set to 256.

Since in this setup LVM stripes two devices, which each consist of 8 disks in RAID 0, the data WIDTH would then be 2*8*8KiB=128KiB, right? Since the metadatasize of LVM was set to 256KiB, this should be aligned for the 128 data width.

If so then what this is really pointing to is a driver issue that may be capped by this or interrupts. Do you have your IRQ's for your raid cards on different cpu's? (what's your /proc/interrupts look like?) also, what is your setting in /sys/block/<raid device>/queue/nr_requests ?

/sys/block/<raid device>/queue/nr_requests for both RAID devices is set to 256.

/proc/interrupts shows the following:

Code:

 cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:         42          0          0          0          0          0          0          0   IO-APIC-edge      timer
  1:          1          0          1          2          1          1          1          1   IO-APIC-edge      i8042
  6:          0          1          0          0          0          1          0          0   IO-APIC-edge      floppy
  8:          0          1          0          0          0          0          0          0   IO-APIC-edge      rtc0
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          0          0          1          0          0          1          0          2   IO-APIC-edge      i8042
 14:         13         11          9         12         13         13         10         13   IO-APIC-edge      ide0
 20:          5          5          8          4          6          5          6          5   IO-APIC-fasteoi   uhci_hcd:usb1
 21:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
 22:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 23:         24         23         21         26         25         25         24         24   IO-APIC-fasteoi   ehci_hcd:usb4
 48:        457        442        460        446        453        439        460        446   IO-APIC-fasteoi   arcmsr
 52:         14         24         14         22         16         29         16         22   IO-APIC-fasteoi   arcmsr
1267:          0          0          0          0          0          1          0          0   PCI-MSI-edge      eth0
1268:          7          7          6          5         10          8          6          8   PCI-MSI-edge      eth0-rx3
1269:         18         18         19         19          4          6          6          4   PCI-MSI-edge      eth0-rx2
1270:          9         12          7          9         13         11         10         11   PCI-MSI-edge      eth0-rx1
1271:         26         23         24         22         20         27         19         22   PCI-MSI-edge      eth0-rx0
1272:         33         40         37         40         47         42         50         49   PCI-MSI-edge      eth0-tx0
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:       4012       3414       4110       7342       3026       3324       4046       3879   Local timer interrupts
RES:        289        128        320        193        160         85        215        434   Rescheduling interrupts
CAL:        326        477        471        443        412        458        471        442   function call interrupts
TLB:        272        279        966        934        261        272        920        911   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts

**henk53** · 03-10-2009, 06:03 AM

This is btw what /proc/interrupts shows during a load test on the LVM striped array:

Code:

 cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:         42          0          0          0          0          0          0          0   IO-APIC-edge      timer
  1:          1          0          1          2          1          1          1          1   IO-APIC-edge      i8042
  6:          0          1          0          0          0          1          0          0   IO-APIC-edge      floppy
  8:          0          1          0          0          0          0          0          0   IO-APIC-edge      rtc0
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          0          0          1          0          0          1          0          2   IO-APIC-edge      i8042
 14:         13         11          9         12         13         13         10         13   IO-APIC-edge      ide0
 20:          5          5          8          4          6          5          6          5   IO-APIC-fasteoi   uhci_hcd:usb1
 21:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
 22:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 23:         24         23         21         26         25         25         24         24   IO-APIC-fasteoi   ehci_hcd:usb4
 48:     336872     336984     336130     336383     337866     337167     337506     337323   IO-APIC-fasteoi   arcmsr
 52:     336991     336875     337740     337475     335995     336691     336368     336542   IO-APIC-fasteoi   arcmsr
1267:          0          0          0          0          0          1          0          0   PCI-MSI-edge      eth0
1268:         10          9          6          7         13         10         10         12   PCI-MSI-edge      eth0-rx3
1269:         20         21         20         22          6          7          7          5   PCI-MSI-edge      eth0-rx2
1270:         67         70         63         74         81         73         71         72   PCI-MSI-edge      eth0-rx1
1271:         97         95        102         93         84        104         87         91   PCI-MSI-edge      eth0-rx0
1272:        145        150        144        147        156        149        156        154   PCI-MSI-edge      eth0-tx0
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:      18164      17107      11112      20057       8536       8791       9079       8773   Local timer interrupts
RES:     369121     327407     190796     206083     104982     104750      98572      92634   Rescheduling interrupts
CAL:        328        479        473        445        414        460        473        442   function call interrupts
TLB:        286        295       1028       1041        284        286       1014        994   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts

It seems clear that interrupts are divided equally over the total number of CPUs in the system, especially for the Areca drivers (arcmsr).

**stevecs** · 03-10-2009, 06:41 AM

Your interrupts look good. much better than what I saw when doing some testing here w/ 10GbE cards which is why I brought it up.

Read-ahead should be disabled on the drive (if you have the option) and controller when doing random testing this will allow the drive to release back to the higher layers without hanging on trying to send more data than what was actually requested. (also true for the OS as well but let's focus on the atomic pieces first.)

Yes I was referring to your bm-flash header line. I just grabbed it here and haven't used it before, did you just grab this out of the blue to test or does this actually mimic your workload? and have you baselined it before on other arrays to verify it's accuracy?

If your array (raid-0 / 8 drives, 8KiB stripe size) that would give you a data stripe width of 64KiB which if you offset your array w/ metadatasize to allign to 256KiB which you said above you are alligned. Your LVM stripe width should be set to a multiple of a SINGLE sub-array so 64KiB. The LVM stirpe width indicates how much data is to be contained on each physical volume AS IT SEES IT. In this case that physical volume is actually your array which is comprised of multiple disks so you want it to be set to your data stripe width. Having 256KiB here is ok as it's a multiple of 64KiB but that means you will send 4 times your stripe width to each sub array, this could be good or bad (only way is to test) but ideally with a 100% random workload it should be best to have it equal to you data stripe width (that way you have a higher chance of distributing your workload to each controller).

You can try increasing your nr_requests (try 512 or 1024) to see if it has an effect. I don't believe that areca cards can handle more than 256, though I've not tested on an array that was capable of such high IOPS to verify without statistical error coming into play.

Another tool to test with would be XDD just to verify the numbers that you're getting w/ bm-flash. (http://www.ioperformance.com/products.htm)

fast script to help run it. This creates a 64GiB file (S0) to run against with 100% random read & write I/O w/ 8KiB request size across the range of the file you may want to increase that likewise if you wanted to increase the 16384 MiB to read/write to as you have a lot of ram but the -dio flag should do direct i/o anyway. Just as a comparison point.

Code:

#!/bin/bash
################################
CONTROLLER="ARC1680ix"
RAID="R0"
DISKS="D8"
DRIVE="st31000340ns"
SS="SS008k"
FS="jfs"
USERNAME=ftpadmin


TMP="/var/ftp/tmp"
XDD="/usr/local/bin/xdd.linux"
XDDTARGET="${TMP}/S0"

# XDD Tests
echo "deleting old $XDDTARGET..."
rm $XDDTARGET
echo "creating new $XDDTARGET..."
dd if=/dev/zero of=$XDDTARGET bs=1M count=65536

sync ; sleep 5
for QD in 1 2 4 8 16 32 64 128 256;  do
  sync ; sleep 5
  $XDD -verbose -op read -target S0 -blocksize 512 -reqsize 8 -mbytes 16384 -passes 5 -dio -seek random -seek range 128000000 -queuedepth $QD > $TMP/xdd-$CONTROLLER-$RAID-$DISKS-$DRIVE-$SS-$FS-READ-QD$QD.txt
  sync ; sleep 5
  $XDD -verbose -op write -target S0 -blocksize 512 -reqsize 8 -mbytes 16384 -passes 5 -dio -seek random -seek range 128000000 -queuedepth $QD > $TMP/xdd-$CONTROLLER-$RAID-$DISKS-$DRIVE-$SS-$FS-WRITE-QD$QD.txt
done

echo "deleting old $XDDTARGET..."
rm $XDDTARGET

Also have you opened up a query to areca yet on this to see if there is a driver limit at all?

**henk53** · 03-10-2009, 08:17 AM

Originally Posted by stevecs

Yes I was referring to your bm-flash header line. I just grabbed it here and haven't used it before, did you just grab this out of the blue to test or does this actually mimic your workload?

We more or less grabbed it out of the blue. There is some justification for using it, but its thin. The thing is that we're obviously not professional testers or benchmarkers. In normal live I'm a Java (lead) developer. Most other tools we tried were quite cumbersome to setup, while bm-flash immediately gives results in understandable figures (IOPS and MB/s). We spend a lot of time on getting IOMeter to work on Linux, but it just doesn't work. The so-called GRUNT thread always hangs. Always...

and have you baselined it before on other arrays to verify it's accuracy?

We did use it on another array (another server having a 1680 and 4 Mtron 7000's) but that's not really helping to test the accuracy.

You can try increasing your nr_requests (try 512 or 1024) to see if it has an effect.

We did indeed try that, but this didn't made any difference.

fast script to help run it.

Okay, I'm running the script and XDD (had to change -target S0 to -target $XDDTARGET

). Although it's still running, these are the first results:

Code:

IOIOIOIOIOIOIOIOIOIOI XDD version 6.5.013007.0001 IOIOIOIOIOIOIOIOIOIOIOI
xdd - I/O Performance Inc. Copyright 1992-2007
Starting time for this run, Tue Mar 10 16:42:09 2009

ID for this run, 'No ID Specified'
Maximum Process Priority, disabled
Passes, 5
Pass Delay in seconds, 0
Maximum Error Threshold, 0
Target Offset, 0
I/O Synchronization, 0
Total run-time limit in seconds, 0
Output file name, stdout
CSV output file name, 
Error output file name, stderr
Pass seek randomization, disabled
File write synchronization, disabled
Pass synchronization barriers, enabled
Number of Targets, 1
Number of I/O Threads, 1

Computer Name, mrhpgdb2, User Name, henk.dewit
OS release and version, Linux 2.6.26-1-amd64 #1 SMP Sat Jan 10 17:57:00 UTC 2009
Machine hardware type, x86_64
Number of processors on this system, 1
Page size in bytes, 4096
Number of physical pages, 8255301
Megabytes of physical memory, 32247
Seconds before starting, 0
        Target[0] Q[0], /ssd/S0
                Target directory, "./"
                Process ID, 18681
                Thread ID, -169722992
                Processor, all/any
                Read/write ratio, 100.00,  0.00
                Throttle in MB/sec,   0.00
                Per-pass time limit in seconds, 0
                Blocksize in bytes, 512
                Request size, 8, blocks, 4096, bytes
                Number of Requests, 4194304
                Start offset, 0
                Number of MegaBytes, 16384
                Pass Offset in blocks, 0
                I/O memory buffer is a normal memory buffer
                I/O memory buffer alignment in bytes, 4096
                Data pattern in buffer, '0x00'
                Data buffer verification is disabled.
                Direct I/O, enabled
                Seek pattern, random
                Seek range, 128000000
                Preallocation, 0
                Queue Depth, 1
                Timestamping, disabled
                Delete file, disabled

                        T  Q       Bytes      Ops    Time      Rate      IOPS   Latency     %CPU  OP_Type    ReqSize     
^MTARGET   PASS0001    0  1   17179869184   4194304   557.624    30.809     7521.75    0.0001     0.00   read        4096 
^MTARGET   PASS0002    0  1   17179869184   4194304   558.737    30.748     7506.76    0.0001     0.00   read        4096 
^MTARGET   PASS0003    0  1   17179869184   4194304   559.249    30.720     7499.88    0.0001     0.00   read        4096 
^MTARGET   PASS0004    0  1   17179869184   4194304   558.915    30.738     7504.37    0.0001     0.00   read        4096 
^MTARGET   PASS0005    0  1   17179869184   4194304   559.025    30.732     7502.89    0.0001     0.00   read        4096 
^MTARGET   Average       0  1   85899345920   20971520   2793.549    30.749     7507.12    0.0001     0.00   read        4096 
^M         Combined    1  1   85899345920   20971520   2793.549    30.749     7507.12    0.0001     0.00   read        4096

Also have you opened up a query to areca yet on this to see if there is a driver limit at all?

Not yet about this issue. We had a couple of other outstanding issues with them, like the fact that a hot spare is not picked up when a 1680 contains two individual RAID sets. They initially replied, asking for more information, but after that nothing but silence. I'll try to contact them about this issue too though.

**stevecs** · 03-10-2009, 08:44 AM

Yeah, just found the arcmsr.h and it's limited to 256 commands max for the controller:
#define ARCMSR_MAX_OUTSTANDING_CMD 256

sorry about the script, I modified it when I posted it to more match what you were trying to do so there may have been a couple typos. This first batch is showing 1 thread at a time (Q=1) if you want to skip all that and just run it w/ the max 256 outstanding threads (just don't do the for loop and set it to 256) as that will flood the array. though if you did let it run it should show scalability based on queue depth which may be interesting if there is a leveling off at some point which we can map back to a bottleneck.

**henk53** · 03-10-2009, 09:18 AM

Originally Posted by stevecs

Yeah, just found the arcmsr.h and it's limited to 256 commands max for the controller:
#define ARCMSR_MAX_OUTSTANDING_CMD 256

I just took a look at the source too, but good find

if you want to skip all that and just run it w/ the max 256 outstanding threads (just don't do the for loop and set it to 256) as that will flood the array.

For a quick test I shortened the loop to 128 and 256 (and just two passes). Oh, and I also compiled the code. In the previous test I just executed the binary that was already in the /bin directory. These are the results for the 128 threads test:

Code:

IOIOIOIOIOIOIOIOIOIOI XDD version 6.5.013007.0001 IOIOIOIOIOIOIOIOIOIOIOI
xdd - I/O Performance Inc. Copyright 1992-2007
Starting time for this run, Tue Mar 10 18:03:26 2009

ID for this run, 'No ID Specified'
Maximum Process Priority, disabled
Passes, 2
Pass Delay in seconds, 0
Maximum Error Threshold, 0
Target Offset, 0
I/O Synchronization, 0
Total run-time limit in seconds, 0
Output file name, stdout
CSV output file name, 
Error output file name, stderr
Pass seek randomization, disabled
File write synchronization, disabled
Pass synchronization barriers, enabled
Number of Targets, 1
Number of I/O Threads, 128

Computer Name, mrhpgdb2, User Name, henk.dewit
OS release and version, Linux 2.6.26-1-amd64 #1 SMP Sat Jan 10 17:57:00 UTC 2009
Machine hardware type, x86_64
Number of processors on this system, 1
Page size in bytes, 4096
Number of physical pages, 8255301
Megabytes of physical memory, 32247
Seconds before starting, 0
        Target[0] Q[0], /ssd/S0
                Target directory, "./"
                Process ID, 20815
                Thread ID, 1141754192
                Processor, all/any
                Read/write ratio, 100.00,  0.00
                Throttle in MB/sec,   0.00
                Per-pass time limit in seconds, 0
                Blocksize in bytes, 512
                Request size, 8, blocks, 4096, bytes
                Number of Requests, 32768
                Start offset, 0
                Number of MegaBytes, 16384
                Pass Offset in blocks, 0
                I/O memory buffer is a normal memory buffer
                I/O memory buffer alignment in bytes, 4096
                Data pattern in buffer, '0x00'
                Data buffer verification is disabled.
                Direct I/O, enabled
                Seek pattern, queued_interleaved
                Seek range, 128000000
                Preallocation, 0
                Queue Depth, 128
                Timestamping, disabled
                Delete file, disabled

                     T  Q       Bytes      Ops    Time      Rate      IOPS   Latency     %CPU  OP_Type    ReqSize     
^MTARGET   PASS0001    0 128   17179869184   4194304    76.375   224.942     54917.49    0.0000     1.42   read        4096 
^MTARGET   PASS0002    0 128   17179869184   4194304    76.593   224.300     54760.70    0.0000     1.24   read        4096 
^MTARGET   Average     0 128   34359738368   8388608   152.927   224.681     54853.83    0.0000     1.33   read        4096 
^M         Combined    1 128   34359738368   8388608   152.927   224.681     54853.83    0.0000     1.32   read        4096 
Ending time for this run, Tue Mar 10 18:06:22 2009

The tool thus gives the number 54k for IOPS, which is distinctively different from the number that bm-flash gives me. I'll try to test with the same number of threads that bm-flash uses (10 resp. 40). I do wonder about one other thing and that's the fact that XDD reports the number of cpus on the system as 1, while in fact there are 8 (or 2 physical cpus).

edit:

The write test finally completed:

Code:

IOIOIOIOIOIOIOIOIOIOI XDD version 6.5.013007.0001 IOIOIOIOIOIOIOIOIOIOIOI
xdd - I/O Performance Inc. Copyright 1992-2007
Starting time for this run, Tue Mar 10 18:06:28 2009

ID for this run, 'No ID Specified'
Maximum Process Priority, disabled
Passes, 2
Pass Delay in seconds, 0
Maximum Error Threshold, 0
Target Offset, 0
I/O Synchronization, 0
Total run-time limit in seconds, 0
Output file name, stdout
CSV output file name, 
Error output file name, stderr
Pass seek randomization, disabled
File write synchronization, disabled
Pass synchronization barriers, enabled
Number of Targets, 1
Number of I/O Threads, 128

Computer Name, mrhpgdb2, User Name, henk.dewit
OS release and version, Linux 2.6.26-1-amd64 #1 SMP Sat Jan 10 17:57:00 UTC 2009
Machine hardware type, x86_64
Number of processors on this system, 1
Page size in bytes, 4096
Number of physical pages, 8255301
Megabytes of physical memory, 32247
Seconds before starting, 0
        Target[0] Q[0], /ssd/S0
                Target directory, "./"
                Process ID, 20951
                Thread ID, 1152395600
                Processor, all/any
                Read/write ratio,  0.00, 100.00
                Throttle in MB/sec,   0.00
                Per-pass time limit in seconds, 0
                Blocksize in bytes, 512
                Request size, 8, blocks, 4096, bytes
                Number of Requests, 32768
                Start offset, 0
                Number of MegaBytes, 16384
                Pass Offset in blocks, 0
                I/O memory buffer is a normal memory buffer
                I/O memory buffer alignment in bytes, 4096
                Data pattern in buffer, '0x00'
                Data buffer verification is disabled.
                Direct I/O, enabled
                Seek pattern, queued_interleaved
                Seek range, 128000000
                Preallocation, 0
                Queue Depth, 128
                Timestamping, disabled
                Delete file, disabled

                       T  Q       Bytes      Ops       Time         Rate      IOPS       Latency    %CPU  OP_Type    ReqSize     
^MTARGET   PASS0001    0 128   17179869184   4194304   2635.499     6.519     1591.46    0.0006     0.04   write        4096

It might be more clear to paste the figures separately from the overview data:

The write figures, 128 threads:

Code:

                       T  Q       Bytes      Ops       Time         Rate      IOPS       Latency    %CPU  OP_Type    ReqSize     
^MTARGET   PASS0001    0 128   17179869184   4194304   2635.499     6.519     1591.46    0.0006     0.04   write        4096 
^MTARGET   PASS0002    0 128   17179869184   4194304   2911.876     5.900     1440.41    0.0007     0.04   write        4096 
^MTARGET   Average     0 128   34359738368   8388608   5543.046     6.199     1513.36    0.0007     0.04   write        4096 
^M         Combined    1 128   34359738368   8388608   5547.000     6.194     1512.28    0.0007     0.04   write        4096

The read figures for 256 threads:

Code:

                       T  Q       Bytes      Ops       Time      Rate        IOPS        Latency    %CPU  OP_Type    ReqSize     
^MTARGET   PASS0001    0 256   17179869184   4194304   337.441    50.912     12429.73    0.0001     0.86   read        4096 
^MTARGET   PASS0002    0 256   17179869184   4194304    76.499   224.577     54828.35    0.0000     3.09   read        4096 
^MTARGET   Average     0 256   34359738368   8388608   413.915    83.012     20266.49    0.0000     1.27   read        4096 
^M         Combined    1 256   34359738368   8388608   413.915    83.012     20266.49    0.0000     1.27   read        4096

There is rather large performance difference between the first and second pass. When I did this run again (10 passes), I got a more consistent picture:

Code:

                     T Q     Bytes         Ops        Time     Rate        IOPS        Latency    %CPU   OP_Type     ReqSize     
TARGET   PASS0001    0 256   17179869184   4194304    76.641   224.159     54726.28    0.0000     3.95   read        4096 
TARGET   PASS0002    0 256   17179869184   4194304    76.726   223.912     54665.97    0.0000     3.22   read        4096 
TARGET   PASS0003    0 256   17179869184   4194304    76.594   224.297     54760.02    0.0000     3.20   read        4096 
TARGET   PASS0004    0 256   17179869184   4194304    76.691   224.013     54690.69    0.0000     3.26   read        4096 
TARGET   PASS0005    0 256   17179869184   4194304    76.680   224.045     54698.57    0.0000     3.24   read        4096 
TARGET   PASS0006    0 256   17179869184   4194304    76.702   223.983     54683.23    0.0000     3.22   read        4096 
TARGET   PASS0007    0 256   17179869184   4194304    76.708   223.966     54679.14    0.0000     3.26   read        4096 
TARGET   PASS0008    0 256   17179869184   4194304    76.259   225.283     55000.76    0.0000     3.27   read        4096 
TARGET   PASS0009    0 256   17179869184   4194304    76.835   223.596     54588.75    0.0000     3.22   read        4096 
TARGET   PASS0010    0 256   17179869184   4194304    76.295   225.177     54974.74    0.0000     3.24   read        4096 
TARGET   Average     0 256  171798691840   41943040  765.943   224.297     54760.01    0.0000     3.31   read        4096 
         Combined    1 256  171798691840   41943040  766.000   224.280     54755.93    0.0000     3.29   read        4096

Finally the write figures for 256 thread. This again took quite some time to complete.

Code:

                       T  Q    Bytes         Ops       Time         Rate      IOPS       Latency    %CPU   OP_Type      ReqSize     
^MTARGET   PASS0001    0 256   17179869184   4194304   2615.796     6.568     1603.45    0.0006     0.10   write        4096 
^MTARGET   PASS0002    0 256   17179869184   4194304   2893.969     5.936     1449.33    0.0007     0.07   write        4096 
^MTARGET   Average     0 256   34359738368   8388608   5508.566     6.238     1522.83    0.0007     0.09   write        4096 
^M         Combined    1 256   34359738368   8388608   5509.000     6.237     1522.71    0.0007     0.08   write        4096

Thread: Not getting more than 65000 IOPS with xtreme setup

Thread Tools

Search Thread

Rate This Thread

Display

Not getting more than 65000 IOPS with xtreme setup

Bookmarks

Bookmarks

Posting Permissions