Sunday Jan 10, 2010

IO Sizes and Alignments with Dtrace

In some previous postings I've talked a bit about the impact of IO size and alignments on the performance of flash devices. These parameters are easily controlled in a microbenchmarking environment, but can be quite difficult to determine in more complex application environments.

A lot of folks have asked me about using iostat in this context. The iostat utility is great for telling you the total number of reads and writes, but it doesn't convey information about individual sizes and alignments (start addresses) of those IOs. You can easily figure out the \*average\* IO size from iostat, but that may or may not correlate well to actual size of individual reads and writes depending on the workload.

One way of getting more visibility into IO sizes and alignments is to use dtrace to gather the information of interest. In this post, I'll share with you a script that has been used internally for this purpose. This should be viewed as a "unsupported, evolving" script, and I will post updates as we tweak it. Nonetheless, it seems to be a fairly useful bit of code and worth sharing. A link to the full script is provided at the bottom of this post.

A few bits of info about this script...

The script instruments a function in the sd driver, and basically just histograms the sizes of IO packets. It turns out that one of the difficult parts about doing this is not getting the packet size, but gathering the information for only the devices you are interested in.

One option is to restrict by using major/minor device numbers, but that is not the most intuitive thing from a user standpoint. These also tend to vary quite a bit, so one would always be hacking up the script. We needed a better way...

Paul Riethmuller came up with the swell idea of using the product id string that is stored when a inquiry command is done. For example:

# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t1d0 
          /pci@0,0/pci108e,5354@1f,2/disk@1,0
       1. c0t2d0 
          /pci@0,0/pci108e,5354@1f,2/disk@2,0
       2. c0t3d0 
          /pci@0,0/pci108e,5354@1f,2/disk@3,0
       3. c1t0d0 
          /pci@0,0/pci8086,3408@1/pci1000,1000@0/sd@0,0
       4. c1t1d0 
          /pci@0,0/pci8086,3408@1/pci1000,1000@0/sd@1,0
       5. c1t2d0 
          /pci@0,0/pci8086,3408@1/pci1000,1000@0/sd@2,0
       6. c1t3d0 
          /pci@0,0/pci8086,3408@1/pci1000,1000@0/sd@3,0
Specify disk (enter its number): 6
selecting c1t3d0
[disk formatted]


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        fdisk      - run the fdisk program
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !     - execute , then return
        quit

format> inquiry
Vendor:   ATA     
Product:  MARVELL SD88SA02
Revision: D20R

The script as currently presented below tracks the number of total IOs issued for all devices, but only keeps detailed statistics for those devices that have a Product ID starting with "MARVEL". (This happens to be the Product ID returned if you run the inquiry command on a Sun F5100 or F20 flash device. ;-)).

Also note that currently the script produces limited output every 5 seconds, but the full histogram of IO sizes is only printed when one hits \^C. Obviously this can be changed to suit. Also note I'm not much of a dtrace hacker- you can find much better in other places at Sun. ;-)

Acknowledgements: Thank you, Paul Riethmuller, for all the good bits and ideas. Any broken bits are due solely to me.

Some sample output:

# ./align.d

4.966478729 seconds elapsed
       2 IOs issued

5.000003686 seconds elapsed
    2465 IOs issued
    2437 Marvell IOs issued
               5 RDs        2432 WRs
       0 Marvell IOs misaligned start
       5 Marvell IOs non-multiple of 4KB size
       0 Percent non-4k Marvell IOs

\^C
11.461851773 seconds elapsed

    2774  Tot IOs issued
    2740 Marvell IOs issued
       0 Marvell IOs misaligned start addr
       5 Marvell IOs non-multiple of 4KB
       0 Percent non-4k Marvell IOs

 MARVELL WRITE IO 
 IO_size Count

               36                0
              512                0
                0                2
           131072             2733

 MARVELL READ IO 
 IO_size Count

                0                0
           131072                0
               36                2
              512                3

 MARVELL ALL IO 
 IO_size Count

                0                2
               36                2
              512                3
           131072             2733

align.d script

Updated 2/14/2010, minor bugfixes (errors in comments):
align-v2.d script

Pitfalls of Benchmarking Flash with dd(1M)

We frequently see questions on flash benchmarking, specifically as it relates to traditional disk and storage microbenchmarks. This article will walk through some examples of the pitfalls of using dd for flash benchmarking, and suggest an alternative strategy for measuring raw IO throughput. Many of these issues are not unique to dd, but they are easy to understand and explain in the context of the simple dd workload.

Background

The dd utility is a basic copy utility; it was never really intended as a benchmarking tool, but nonetheless, has become a rather popular sequential workload generator for quick tests. It's been my (very unofficial, unscientific) observation that it's not uncommon for folks to plug in a drive, then fire up a dd to see what kind of throughput they can achieve.


So What's the Problem?


Well, let's fire up a dd on one of our flash modules, and see what happens. We'll do a simple sequential read workload, and use a basic iostat to monitor the throughput. For this particular example, I'm using one of the four flash modules in the Sun F20 PCIe Accelerator Card in my workstation.

# dd of=/dev/null if=/dev/rdsk/c1t0d0s2 count=200000
200000+0 records in
200000+0 records out

# iostat -xCn 1
...
                   extended device statistics                
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t3d0
16038.1    0.0 8019.1    0.0  0.0  0.9    0.0    0.1   0  86 c1
16038.1    0.0 8019.1    0.0  0.0  0.9    0.0    0.1   2  86 c1t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t3d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 unknown:vold(pid546)
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 ubur-home2.east:/export/home2/12/lisan
...


And the results are... 8MB/s. Not very impressive. So what's going wrong?

If we dig into the dd man page, we see the following:

    ibs=n

        Specifies the input block size in n  bytes  (default  is
        512).

    obs=n

        Specifies the output block size in n bytes  (default  is
        512).

    bs=n

        Sets both input and  output  block  sizes  to  n  bytes,
        superseding  ibs=  and obs=. If no conversion other than
        sync, noerror, and  notrunc  is  specified,  each  input
        block  is copied to the output as a single block without
        aggregating short blocks.


Wow. The default block size is 512 bytes. Today's flash devices are pretty uniformly 4k sector devices: this means you need to use multiple of 4k block sizes in order to achieve optimal performance. The dd default block size of 512 bytes is definitely something likely to impact performance.

With this in mind, let's go back and run the same test, but this time with the block size cranked up to 16k:

# dd of=/dev/null if=/dev/rdsk/c1t0d0s2 bs=16k count=200000
200000+0 records in
200000+0 records out

# iostat -xCn 1
...
                   extended device statistics                
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t3d0
8406.1    0.0 134497.3    0.0  0.0  0.9    0.0    0.1   0  93 c1
8406.1    0.0 134497.3    0.0  0.0  0.9    0.0    0.1   1  93 c1t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t3d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 unknown:vold(pid546)
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 ubur-home2.east:/export/home2/12/lisan


Bumping the blocksize up to 16k bumped up our performance from 8MB/s to 134MB/s. That's quite a bump! But we're actually still leaving a lot of performance on the table in this 16k blocksize sequential read case.

So now what's the problem?

In this particular case, it's the lack of parallelism: dd is a single threaded benchmark. This can be observed in the iostat output by looking at the "actv" and "wait" queues. At any given point in time, there is 0.9 (best to round that to 1) IO request outstanding.

Modern enterprise flash and SSDs may appear as a single disk drive to the operating system by virtue of the their SAS and SATA interfaces. In this case, however, appearances are misleading. In actuality each of these flash modules contains multiple NAND dies, and multiple channels supplying IO to the dies. Consequently, using a single-threaded benchmark can be performance limiting: it does not allow us to utilize the inherent parallelism built into these devices. So how does one get around this?

Quite simply, by using a benchmark tool a bit more up to the task. Our recommendation is vdbench: a multiplatform tool designed for benchmarking storage, freely available at vdbench.org.

With that in mind, let's rerun the same 16k blocksize sequential read workload as above, but this time using 32 threads of parallelism, with vdbench as a driver.

Here's a (4-line) vdbench config file to do 16k sequential reads using 32 threads:
sd=sd1,lun=/dev/rdsk/c1t0d0s2
wd=sr,sd=sd1,readpct=100,rhpct=0,seekpct=0,xfersize=16k
rd=default,elapsed=1m,interval=1,forthreads=(32),io=max
rd=seq_rd,wd=sr

And here is the output from iostat during the 32 thread vdbench run:
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    0.8    0.0    5.6  0.0  0.0    0.0    0.2   0   0 c0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
   0.0    0.8    0.0    5.6  0.0  0.0    0.0    0.2   0   0 c0t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t3d0
15940.7    0.0 255051.1    0.0  0.0 31.8    0.0    2.0   0 100 c1
15940.7    0.0 255051.1    0.0  0.0 31.8    0.0    2.0   2 100 c1t0d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t1d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t2d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t3d0
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 unknown:vold(pid546)
   0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 ubur-home2.east:/export/home2/12/lisan


A couple of things to note from the iostat output above. First, we have bumped the performance up to 255MB/sec-- a far cry from the 8MB/s we got with the default dd settings, and also a large improvement from the single threaded, 16k dd results of 134MB/sec. Also, the number of outstanding IOs/sec is now recorded as 31.8, which illustrates that we are processing large numbers of IOs in parallel.

There is another option here as well, and that is to ratchet up the block size even further. If the block size is large enough, a single-threaded workload such as dd(1M) can typically push enough data to fully utilize a flash device. Of course, you may not know if you are limited by lack of a parallel workload unless you are using a tool that allows you to vary the amount of parallelism!

The Bottom Line

In order to see the optimal performance with flash devices, both the block size and parallelism are key. If you are going to benchmark IO on flash, be sure to take a look at these issues, whatever your benchmark of choice may be. Also, we really encourage you to check out vdbench. It runs on multiple platforms and OS's, is freely available, and well suited for testing modern flash devices and traditional storage.
About

Lisa Noordergraaf

Search

Categories
  • Sun
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Feeds