Meausuring QD in win7

**Computurd** · 01-28-2010, 05:26 AM

oh wow. those acards do help at game loading times. when is the last time you have seen a single ssd load that fast? i think whay you mean to say is that they dont scale well. it is also if the games, programs etc take advantage of all of the processor core ( multithreading, multicore usage)
throw in my usual comment about not just game loading times being pertinent information, performance being key as well... etc etc yada yada yada

**Levish** · 01-28-2010, 05:41 AM

In ME2 the entire "level loading" time isn't even spent waiting on the disks, same with ME1. Out of the ~30 seconds a transition takes only about 5 seconds is actual disk use (so far at least). So no matter how fast the disks are you aren't going to shave off 75% of the loading time with that particular game no matter what you have/are doing.

**Computurd** · 02-16-2010, 06:00 PM

The Queue Length counter reports the number of incoming SMB requests that are queued for processing, waiting for a worker thread to become available. There are separate per-processor Server Work queues to minimize inter-processor communication delays.

i think we are off target with queue depth computatuions. it seems that each processor has its own qd counter.
http://technet.microsoft.com/en-us/l.../cc300400.aspx

**Spoiler** · 02-17-2010, 02:57 PM

Originally Posted by Levish

In ME2 the entire "level loading" time isn't even spent waiting on the disks, same with ME1. Out of the ~30 seconds a transition takes only about 5 seconds is actual disk use (so far at least). So no matter how fast the disks are you aren't going to shave off 75% of the loading time with that particular game no matter what you have/are doing.

+1

I ran a bunch of tests with different disk configurations loading Company of Heroes. I was trying to see what would lower a start-up of an 8 player skirmish match. Among a few of the disk configurations: I used a single vertex SSD, a single SAS drive, raided 0 SSD's, and a RAM disk. I place the entire game folder on each setup. Game load times on the SSD's weren't off by more than a second or two from each other. Even the RAMdisk didn't shave any more time off!

The biggest difference in load times on this game was OC'ing the processor. I saw decreases of a few seconds for every 300MHz OC'd.

For example...

Time to load 8 player skirmish map in COH on Intel E8400 @ 3.0GHz (2x30GB Raid 0)................14 seconds.

Time to load 8 player skirmish map in COH on Intel E8400 @ 3.0GHz (9GB RAMdisk)................13 seconds.

Time to load 8 player skirmish map in COH on Intel E8400 @ 3.8Ghz (2x30GB Raid 0)................9 seconds.

That was pretty big difference after OC'ing the processor.

**Tiltevros** · 03-02-2010, 03:19 AM

after long long time having my hair white lol i came with this think over here
the test screens are with 4k file size ICH10R with 8 SSDs in RAID0 128k stripe size on Tyan S7025 motherboard.
4k file size @ random speeds 1,2,4,8,16,32,64 queue depth

**Tiltevros** · 03-02-2010, 03:20 AM

now 32 and 64 queue depth

**Ao1** · 04-04-2010, 04:09 AM

GullLars was kind enough to give me quite a detailed explanation on qd/iops/ access times some time ago, which provides some great insights. This came via a PM that I did not want to share without permission, which only just arrived.

QD, IOPS, access time, and block size of random access.

The smallest block size the file system operates with is the cluster size. For NTFS this defaults to 4KB.

Flash SSDs also (normally) operate with 4KB as their base block size.

QD = Queue Depth. This is the number of outstanding IOs towards storage, unanswered requests if you like. The OS may send as many requests it wants at the same time, but for the SSD to be able to utilize this it needs to support NCQ (native command queue) and the drivers and HBA (SATA controller between CPU and SSD) must also support it. NCQ allows an SSD to receive up to 32 requests at the same time and process them internally in the fastest possible way, regardless of the order they were received.
SSDs have multiple flash channels, Indilinx Barefoot and Samsung based SSDs have 4, intel x25-M has 10. The number of channels is the same as the number of requests the SSD can answer in parallel. If the _momentary_ queue depth, this is the number of requests sent in a burst plus unanswered requests waiting in the SSD, is f.ex. 6, an x25-M can answer ALL requests in parallel and have NONE outstanding, while Vertex can answer 4 of them in parallel and have 2 outstanding, and then answer the 2 outstanding when 2 of the previous requests have finished. As you can see, an x25-M is likely to have a lot lower number of unanswered requests waiting than Vertex, and as such the QD shown in the OS will likely be lower for x25-M.

The OS will show AVERAGE QD between each update, and not momentary like i mentioned in the example above. An average QD above 1 shown in the OS means the storage is not able to answer requests faster than they arrive and has a backlog. However, the maximum momentary QD may be several times the average QD when bursts of small requests are sent. In these cases an SSD with more channels can utilize more channels and finish the job faster, given each of the channels is as fast or faster than the competition. This is where we begin to talk IOPS.

IOPS = IO operations per second, or requests per second if you like. The maximum number of IOPS an SSD can do theoretically is equal to the number of channels times IOPS per channel. IOPS per channel is 1/[average access time in seconds]. Most MLC SSDs can do about 5000 IOPS per channel, but the scaling when working in parallel on multiple channels gets limited by the controller that has to administrate and match the outstanding requests with the channels that can answer them.

An Vertex with 4 channels can do 5000 IOPS with QD=1, and about 16000 IOPS with QD=5 (QD > channels to ensure none of the outstanding requests are on the same channels, which would cause one channel to be unused). Increasing the QD further has no effect on vertex since all 4 channels are already saturated and deliver max performance. As you notice, 16000 is less than 4x5000, the 4000 IOPS missing is because of the time the controller needs to administrate and match requests to channels.

An x25-M with 10 channels can also do 5000 IOPS with QD=1, and about 18000 with QD=5 (at which point MAXIMUM 5 channels are in use), but scales further to about 30-32000 IOPS at QD=12 (at which point most channels are in use), and can reach up to 40000 IOPS at QD=32 (it varies with setups from 32-40K).

By adding more drives in RAID, you increase the number of channels available to answer requests in parallel. By RAIDing 2 Vertex, you have 8 channels that can answer up to 8 requests in parallel. This means a 2 R0 vertex can do roughly 30-32000 IOPS at QD=10 when all channels are saturated.
RAIDing 2 x25-Ms will give you 20 channels, so you would need over QD=20 to saturate them, but when that happens, you can answer up to 70-80000 4KB random read requests per second.

With larger random read blocks, the x25-M uses all it's 10 channels to split the larger blocks up into 4KB blocks, like RAID. Or at least something like this. Because of this, it can answer 8KB blocks faster but fewer in parallel, and you end up with 5 8KB blocks in parallel at almost 4KB access times. 2,5 16KB blocks in parallel, and 1,25 32KB blocks (meaning it can do 1 32KB block + 1 4KB block, even if that 4KB is the first part of a new 32KB block). You get the point. By adding more x25-Ms in RAID, you can answer more larger block random requests in parallel, and at the same time you decrease the chance that requests will be to the same channel, which causes witting for the previous to finish. Adding more SSDs in RAID like this also reduces the organisational workload on each SSD controller when QD gets high and results in lower access times when QD gets higher.

The CPU like you know operates at Gigaherz, finishing operations in nanoseconds. Level 1 cache has a few ns access time, L2 cache has very low double digit ns access time, L3 cache has a bit higher double digit ns access time, then you get to RAM that has very roughly 50ns (+- 20ns or more). As you can see, the CPU has to wait very few cycles for it's internal cache, and some when it needs data from RAM.

When the CPU needs access storage, things change dramatically. Hard drives have access times around 10-15ms, this is 1 million times slower than L2 and L3 cache. Meaning it has to wait millions of cycles for data to arrive.
With SSDs you get access times normally in the area of 0,1ms (for single 4KB requests). 0,1ms = 100µs, roughly 2000 times longer access times than RAM and 10000 times L2-L3 cache. Still, a LOT lower than 1 million.

Thread: Meausuring QD in win7

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions