Wednesday Nov 05, 2008

Throttling disks

The disk drivers in Solaris support SCSI tagged queuing and have done for a long time. This enables them to send more than one command to a LUN (logical Unit) at a time. The number of commands that can be sent in parallel is limited, throttled, by the disk drivers so that they never send more commands than the LUN can cope with. While it is possible for the LUN to respond with a “queue full” SCSI status to tell the driver that it can not cope with any more commands there are significant problems with relying on this approach:

  • Devices connected via fibre channel have to negotiate onto the loop to return the queue full status. This can mean that by the time the device manages to return queue full the host can have sent many more commands. This risks that the LUN can end up with a situation it can not cope with and typically results in the LUN resetting.

  • If the LUN is being accessed from more than one host it is possible for it to return Queue full on the very first command. This makes it hard for the host to know when it will be safe to send a command since there are none outstanding from that host.

  • If the LUN is part of many LUNs on a single target it may share the total pool of commands that can be accepted by all the LUNS and so again could respond with “queue full” on the first command to a LUN.

  • In the two cases above the total number of commands a single host can send a single LUN will vary depending on conditions that the host simply can not know making adaptive algorithms unreliable.

All the above issues result in people taking the safest option and setting the throttle for a device as low as required so that the LUN never needs to send queue full. In some cases as low as 1. This is bad when limited to an individual LUN, it is terrible when done globally on the entire system.

As soon as you get to the point where you hit the throttle two things happen:

  1. You are no longer transferring data over the interconnect (fibre channel, parallel scsi or iscsi) for writes. This data has to wait until another command can complete before it can be transferred. This then reduces the throughput of the device. You writes can end up being throttled by reads and hence tend towards the speed of the spinning disk if the read has to go to the disk even though you may have a write cache.

  2. The command is queued on the waitq which will increase the latency still further if the queue becomes deep. See here for information about disksort's effect on latency.

Given that the system will regularly dump large numbers of commands on devices for short periods of time you want those commands to be handled as quickly as possible to minimized applications hanging while their IO is completed. If you want to observe the maximum number of commands sent to a device then there is a D script here to do that.

So the advice for configuring storage would be:

Wednesday Feb 27, 2008

Latency Bubbles follow up

Following on from the latency bubbles in your IO posting. I have been asked two questions about this post privately:

  1. How can you map those long numbers in the output into readable entries, eg sd0.

  2. How can I confirm that disksort has been turned off?

The first one just requires another glob of D:

##pragma D option quiet

#define SD_TO_DEVINFO(un) ((struct dev_info \*)((un)->un_sd->sd_dev))

#define DEV_NAME(un) \\
        stringof(`devnamesp[SD_TO_DEVINFO(un)->devi_major].dn_name) /\* ` \*/

#define DEV_INST(un) (SD_TO_DEVINFO(un)->devi_instance)


fbt:ssd:ssdstrategy:entry,
fbt:sd:sdstrategy:entry
{
        bstart[(struct buf \*)arg0] = timestamp;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ arg0 != 0 /
{
        this->buf = (struct buf \*)((struct scsi_pkt \*)arg0)->pkt_private;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ this->buf /
{
        this->priv = (struct sd_xbuf \*) this->buf->b_private;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ this->priv /
{
        this->un = this->priv->xb_un;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ this->buf && bstart[this->buf] && this->un /
{
        @l[DEV_NAME(this->un), DEV_INST(this->un)] =
                lquantize((timestamp - bstart[this->buf])/1000000, 0,
                60000, 60000);
        @q[DEV_NAME(this->un), DEV_INST(this->un)] =
                quantize((timestamp - bstart[this->buf])/1000000);
                bstart[this->buf] = 0;
}


The second required a little bit of mdb. Yes you can also get the same from dtrace mdb gives the the immediate answer, firstly for all the disks that use the sd driver and then for instance 1:

# echo '\*sd_state::walk softstate | ::print -at "struct sd_lun" un_f_disksort_disabled' | mdb -k
300000ad46b unsigned un_f_disksort_disabled = 0
60000e23f2b unsigned un_f_disksort_disabled = 0
# echo '\*sd_state::softstate 1 | ::print -at "struct sd_lun" un_f_disksort_disabled' | mdb -k
300000ad46b unsigned un_f_disksort_disabled = 0

Friday Jan 11, 2008

Latency bubbles in your disk IO

The following was written in response to an email from a customer about monitoring IO in response to my scsi.d postings. Tim covers where disk IO requests can be queued in his posting titled “Where can I/O queue up in sd/ssdwhich I would recommend as a starting point.

The disk IO sub-systems are built to provide maximum through put which is most often the right thing. However the weakness of tuning for throughput is that occasionally you can get some bizarre behaviour when it comes to latency. The way that optimum IO bandwidth is achieved is by sorting each io by logical block address (LBA) and then issuing those in order to minimize head seek. This is documented in the disksort(9F) manual page.

So if you have a sequence of writes to blocks N, N+1, N+2, N-200, N+3, N+4, N+5,N+6, N+7 in that order and your LUN as a queue depth and therefore throttle of 2.1 The IO's will actually be delivered to the LUN in this order N, N+1, N+2, N+3, N+4, N+5,N+6, N+7, N-200. Hence there will be a significant latency applied to the IO going to LBA N-200 and in practice it is possible to have IO requests delayed on the waitq for many seconds (I have a pathological test case that can hold them there for the time it takes to perform an IO on nearly every block on the LUN, literally hours). You better hope that that IO was not your important one!

This issue only comes into play in the disk driver has reached the throttle for the device as up until that point each IO can be passed straight to the LUN for processing.2 Once the driver has reached the throttle for the LUN it begins queuing IO requests internally and by default will sort them to get maximum throughput. Clearly the lower the throttle the the sooner you get into this potential scenario.

Now for the good news. For most disk arrays sorting by LBA does not make much sense since the LUN will be made up of a number of drives and there will be a read cache and a write cache. So for these devices it makes sense to disable disksort and deliver the IO requests to the LUN in the order in which they are delivered to the disk driver. If you look in the source for sd.c you will see that we do this by default for most common arrays. To achieve this there is a flag, “disable disksort”, that can be set in sd.conf or ssd.conf depending on which driver is in use. See Micheal's blog entry about editing sd.conf. While you are reading that entry note you can use it to set the throttle for individual LUNS so you do not have to set [s]sd_max_throttle, which will penalize all devices rather than just the one you were aiming for. If you have just one that only has a small queue depth and you will see below why a small queue depth can be a really bad thing.

So how could you spot these latency bubbles?

It will come as no surprise that the answer is dtrace. Using my pathological test case, but with it set to run for only 10 minute to a single spindle, the following D produces a clear indication that all is not well:

fbt:ssd:ssdstrategy:entry,
fbt:sd:sdstrategy:entry
{
        start[(struct buf \*)arg0] = timestamp;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ start[(this->buf = (struct buf \*)((struct scsi_pkt \*)arg0)->pkt_private)] != 0 /
{
        this->un = ((struct sd_xbuf \*) this->buf->b_private)->xb_un;
        @[this->un] = lquantize((timestamp - start[this->buf])/1000000, 
                 60000, 600000, 60000);
        @q[this->un] = quantize((timestamp - start[this->buf])/1000000);
        
        start[this->buf] = 0;
}

This produces the following output3, the times are milliseconds:


dtrace: script 'ssdrwtime.d' matched 4 probes
\^C

 

    6597960853440
           value  ------------- Distribution ------------- count    
         < 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 377204   
           60000 |                                         0        
          120000 |                                         0        
          180000 |                                         0        
          240000 |                                         0        
          300000 |                                         0        
          360000 |                                         0        
          420000 |                                         0        
          480000 |                                         2        
          540000 |                                         300      
       >= 600000 |                                         0        

 
    6597960853440
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         40       
               1 |                                         9        
               2 |                                         6        
               4 |                                         17       
               8 |                                         23       
              16 |                                         6        
              32 |                                         36       
              64 |@@                                       15407    
             128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   361660   
             256 |                                         0        
             512 |                                         0        
            1024 |                                         0        
            2048 |                                         0        
            4096 |                                         0        
            8192 |                                         0        
           16384 |                                         0        
           32768 |                                         0        
           65536 |                                         0        
          131072 |                                         0        
          262144 |                                         0        
          524288 |                                         302      
         1048576 |                                         0        

Now recall that my test case is particularly unpleasant but it demonstrates the point. 300 IO requests took over 9 minutes and they only actually got to complete as the test case was shutting down. While the vast majority of the IO requests complete in less than 256ms.


Now lets run the same pathological test with disksort disabled:

dtrace: script 'ssdrwtime.d' matched 4 probes
\^C

    6597960853440
           value  ------------- Distribution ------------- count    
         < 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 543956   
           60000 |                                         0        


    6597960853440
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         30       
               1 |                                         21       
               2 |                                         30       
               4 |                                         0        
               8 |                                         0        
              16 |                                         50       
              32 |                                         3        
              64 |                                         384      
             128 |                                         505      
             256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  531169   
             512 |@                                        11764    
            1024 |                                         0        

Notice that now the majority of the IO requests took longer now, falling in the 256ms bucket rather than the 128ms bucket but none of the IO requests took many minutes.


Now my test case is pathological but if you have drives with small queue depths and disk sort is still enabled you are open to some quite spectacular latency bubbles. To mitigate this my advice is:

  1. Don't ever set the global [s]sd_max_throttle in /etc/system. Use the [s]sd.conf file to set the appropriate throttle for each device.

  2. Consider what is more important to you. Throughput or latency. If is is latency or if your LUN is on a storage array then turn of disksort using the [s]sd.conf file.

  3. If you have pathological applications then understand that the IO subsystem can give you throughput or bounded latency, not both. So separate out the IO devices that need throughput from those for which latency is more important.

  4. Be aware that even “dumb” disk drives often implement disk sort internally so in some cases they can give a similar issues when they have a queue depth of greater than 24. In those cases you may find it better to throttle them down to a queue depth of 2 and disable disksort in [s]sd to get the most predictable latency all be it at the expense of throughput. If this is your issue then you can spot this either by using scsi.d directly or by modifying it to produce aggregations like those above. I'll leave that as an exercise for the reader.


1The queue depth of a LUN is the number of commands that it can handle at the same time. The throttle is usually set to the same number and it used by the disk driver to prevent it sending more commands than the device can cope with.

2Now the LUN itself may then re order the IO if it has more then two IO's in it's internal queue.

3Edited to remove output for other drives.

4With a queue depth of 2 the drive can not sort the IO requests. It has to have one as active and the next one waiting. When the active one completes the waiting one will be actioned before a new command can come from the initiator.

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today