Ok, saw your comment on # procs and verified that here on a quad core system as well, so there's a bug with that. I sent an e-mail out to the author, it never really cropped up before (much lower IOPS so wasn't really saturating anything anyway). Though in your case even w/ 256 commands it's taking only ~3.5-4% cpu which is not bad at all. The write tests you can ignore/stop it's interesting to note that they are very low but then again I would have assumed that due to the nature of SSD's, more of a data point, but with databases you would probably do an initial load and most traffic would be read on these anyway right?
As for your read iops, xdd has historically been pretty good to give readings close to theoretical ones for performance in 'worst-case' (ie, 100% random workloads without caching) which is generally lower than a lot of number presented on the web. (most don't go for worst-case scenarios). Anyway without going through bm_flash code to see what it was doing the delta of ~20K iops could be due to caching or workload differences.
Can you try xdd against your 2x5GB ram disks w/ lvm? w/ your bm_flash you got ~190K, I want to see if we are running into another issue w/ the software for testing.
Also you may want to try a run w/ deadline as a scheduler (fast run either w/ bm_flash or xdd) to see how that compares to noop unless you've already tried that).
What slots do you have your cards in on your system? and do you have anything else in any other slots? (ideally running just the cards and it being headless) From the supermicro site it looks like J5 & J6 are the ones you want to use as they are tied directly to the MCH at full speed.
Also just found this doing some searching: http://www.usenix.org/events/lsf08/t...cardi_SATA.pdf

