Your interrupts look good. much better than what I saw when doing some testing here w/ 10GbE cards which is why I brought it up.

Read-ahead should be disabled on the drive (if you have the option) and controller when doing random testing this will allow the drive to release back to the higher layers without hanging on trying to send more data than what was actually requested. (also true for the OS as well but let's focus on the atomic pieces first.)

Yes I was referring to your bm-flash header line. I just grabbed it here and haven't used it before, did you just grab this out of the blue to test or does this actually mimic your workload? and have you baselined it before on other arrays to verify it's accuracy?

If your array (raid-0 / 8 drives, 8KiB stripe size) that would give you a data stripe width of 64KiB which if you offset your array w/ metadatasize to allign to 256KiB which you said above you are alligned. Your LVM stripe width should be set to a multiple of a SINGLE sub-array so 64KiB. The LVM stirpe width indicates how much data is to be contained on each physical volume AS IT SEES IT. In this case that physical volume is actually your array which is comprised of multiple disks so you want it to be set to your data stripe width. Having 256KiB here is ok as it's a multiple of 64KiB but that means you will send 4 times your stripe width to each sub array, this could be good or bad (only way is to test) but ideally with a 100% random workload it should be best to have it equal to you data stripe width (that way you have a higher chance of distributing your workload to each controller).

You can try increasing your nr_requests (try 512 or 1024) to see if it has an effect. I don't believe that areca cards can handle more than 256, though I've not tested on an array that was capable of such high IOPS to verify without statistical error coming into play.

Another tool to test with would be XDD just to verify the numbers that you're getting w/ bm-flash. (http://www.ioperformance.com/products.htm)

fast script to help run it. This creates a 64GiB file (S0) to run against with 100% random read & write I/O w/ 8KiB request size across the range of the file you may want to increase that likewise if you wanted to increase the 16384 MiB to read/write to as you have a lot of ram but the -dio flag should do direct i/o anyway. Just as a comparison point.

Code:
#!/bin/bash
################################
CONTROLLER="ARC1680ix"
RAID="R0"
DISKS="D8"
DRIVE="st31000340ns"
SS="SS008k"
FS="jfs"
USERNAME=ftpadmin


TMP="/var/ftp/tmp"
XDD="/usr/local/bin/xdd.linux"
XDDTARGET="${TMP}/S0"

# XDD Tests
echo "deleting old $XDDTARGET..."
rm $XDDTARGET
echo "creating new $XDDTARGET..."
dd if=/dev/zero of=$XDDTARGET bs=1M count=65536

sync ; sleep 5
for QD in 1 2 4 8 16 32 64 128 256;  do
  sync ; sleep 5
  $XDD -verbose -op read -target S0 -blocksize 512 -reqsize 8 -mbytes 16384 -passes 5 -dio -seek random -seek range 128000000 -queuedepth $QD > $TMP/xdd-$CONTROLLER-$RAID-$DISKS-$DRIVE-$SS-$FS-READ-QD$QD.txt
  sync ; sleep 5
  $XDD -verbose -op write -target S0 -blocksize 512 -reqsize 8 -mbytes 16384 -passes 5 -dio -seek random -seek range 128000000 -queuedepth $QD > $TMP/xdd-$CONTROLLER-$RAID-$DISKS-$DRIVE-$SS-$FS-WRITE-QD$QD.txt
done

echo "deleting old $XDDTARGET..."
rm $XDDTARGET

Also have you opened up a query to areca yet on this to see if there is a driver limit at all?