Sunday Jan 10, 2010

Pitfalls of Benchmarking Flash with dd(1M)

We frequently see questions on flash benchmarking, specifically as it relates to traditional disk and storage microbenchmarks. This article will walk through some examples of the pitfalls of using dd for flash benchmarking, and suggest an alternative strategy for measuring raw IO throughput. Many of these issues are not unique to dd, but they are easy to understand and explain in the context of the simple dd workload.

Background

The dd utility is a basic copy utility; it was never really intended as a benchmarking tool, but nonetheless, has become a rather popular sequential workload generator for quick tests. It's been my (very unofficial, unscientific) observation that it's not uncommon for folks to plug in a drive, then fire up a dd to see what kind of throughput they can achieve.


So What's the Problem?


Well, let's fire up a dd on one of our flash modules, and see what happens. We'll do a simple sequential read workload, and use a basic iostat to monitor the throughput. For this particular example, I'm using one of the four flash modules in the Sun F20 PCIe Accelerator Card in my workstation.

# dd of=/dev/null if=/dev/rdsk/c1t0d0s2 count=200000
200000+0 records in
200000+0 records out

# iostat -xCn 1
...
                   extended device statistics                
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t3d0
16038.1    0.0 8019.1    0.0  0.0  0.9    0.0    0.1   0  86 c1
16038.1    0.0 8019.1    0.0  0.0  0.9    0.0    0.1   2  86 c1t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t3d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 unknown:vold(pid546)
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 ubur-home2.east:/export/home2/12/lisan
...


And the results are... 8MB/s. Not very impressive. So what's going wrong?

If we dig into the dd man page, we see the following:

    ibs=n

        Specifies the input block size in n  bytes  (default  is
        512).

    obs=n

        Specifies the output block size in n bytes  (default  is
        512).

    bs=n

        Sets both input and  output  block  sizes  to  n  bytes,
        superseding  ibs=  and obs=. If no conversion other than
        sync, noerror, and  notrunc  is  specified,  each  input
        block  is copied to the output as a single block without
        aggregating short blocks.


Wow. The default block size is 512 bytes. Today's flash devices are pretty uniformly 4k sector devices: this means you need to use multiple of 4k block sizes in order to achieve optimal performance. The dd default block size of 512 bytes is definitely something likely to impact performance.

With this in mind, let's go back and run the same test, but this time with the block size cranked up to 16k:

# dd of=/dev/null if=/dev/rdsk/c1t0d0s2 bs=16k count=200000
200000+0 records in
200000+0 records out

# iostat -xCn 1
...
                   extended device statistics                
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t3d0
8406.1    0.0 134497.3    0.0  0.0  0.9    0.0    0.1   0  93 c1
8406.1    0.0 134497.3    0.0  0.0  0.9    0.0    0.1   1  93 c1t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t3d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 unknown:vold(pid546)
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 ubur-home2.east:/export/home2/12/lisan


Bumping the blocksize up to 16k bumped up our performance from 8MB/s to 134MB/s. That's quite a bump! But we're actually still leaving a lot of performance on the table in this 16k blocksize sequential read case.

So now what's the problem?

In this particular case, it's the lack of parallelism: dd is a single threaded benchmark. This can be observed in the iostat output by looking at the "actv" and "wait" queues. At any given point in time, there is 0.9 (best to round that to 1) IO request outstanding.

Modern enterprise flash and SSDs may appear as a single disk drive to the operating system by virtue of the their SAS and SATA interfaces. In this case, however, appearances are misleading. In actuality each of these flash modules contains multiple NAND dies, and multiple channels supplying IO to the dies. Consequently, using a single-threaded benchmark can be performance limiting: it does not allow us to utilize the inherent parallelism built into these devices. So how does one get around this?

Quite simply, by using a benchmark tool a bit more up to the task. Our recommendation is vdbench: a multiplatform tool designed for benchmarking storage, freely available at vdbench.org.

With that in mind, let's rerun the same 16k blocksize sequential read workload as above, but this time using 32 threads of parallelism, with vdbench as a driver.

Here's a (4-line) vdbench config file to do 16k sequential reads using 32 threads:
sd=sd1,lun=/dev/rdsk/c1t0d0s2
wd=sr,sd=sd1,readpct=100,rhpct=0,seekpct=0,xfersize=16k
rd=default,elapsed=1m,interval=1,forthreads=(32),io=max
rd=seq_rd,wd=sr

And here is the output from iostat during the 32 thread vdbench run:
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    0.8    0.0    5.6  0.0  0.0    0.0    0.2   0   0 c0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
   0.0    0.8    0.0    5.6  0.0  0.0    0.0    0.2   0   0 c0t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t3d0
15940.7    0.0 255051.1    0.0  0.0 31.8    0.0    2.0   0 100 c1
15940.7    0.0 255051.1    0.0  0.0 31.8    0.0    2.0   2 100 c1t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t3d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 unknown:vold(pid546)
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 ubur-home2.east:/export/home2/12/lisan


A couple of things to note from the iostat output above. First, we have bumped the performance up to 255MB/sec-- a far cry from the 8MB/s we got with the default dd settings, and also a large improvement from the single threaded, 16k dd results of 134MB/sec. Also, the number of outstanding IOs/sec is now recorded as 31.8, which illustrates that we are processing large numbers of IOs in parallel.

There is another option here as well, and that is to ratchet up the block size even further. If the block size is large enough, a single-threaded workload such as dd(1M) can typically push enough data to fully utilize a flash device. Of course, you may not know if you are limited by lack of a parallel workload unless you are using a tool that allows you to vary the amount of parallelism!

The Bottom Line

In order to see the optimal performance with flash devices, both the block size and parallelism are key. If you are going to benchmark IO on flash, be sure to take a look at these issues, whatever your benchmark of choice may be. Also, we really encourage you to check out vdbench. It runs on multiple platforms and OS's, is freely available, and well suited for testing modern flash devices and traditional storage.
About

Lisa Noordergraaf

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Feeds