Tuesday Nov 27, 2012

ZFS for Database Log Files


I've been troubled by drop outs in CPU usage in my application server, characterized by the CPUs suddenly going from close to 90% CPU busy to almost completely CPU idle for a few seconds. Here is an example of a drop out as shown by a snippet of vmstat data taken while the application server is under a heavy workload.

# vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s3 s4 s5 s6   in   sy   cs us sy id
 1 0 0 130160176 116381952 0 16 0 0 0 0  0  0  0  0  0 207377 117715 203884 70 21 9
 12 0 0 130160160 116381936 0 25 0 0 0 0 0  0  0  0  0 200413 117162 197250 70 20 9
 11 0 0 130160176 116381920 0 16 0 0 0 0 0  0  1  0  0 203150 119365 200249 72 21 7
 8 0 0 130160176 116377808 0 19 0 0 0 0  0  0  0  0  0 169826 96144 165194 56 17 27
 0 0 0 130160176 116377800 0 16 0 0 0 0  0  0  0  0  1 10245 9376 9164 2  1 97
 0 0 0 130160176 116377792 0 16 0 0 0 0  0  0  0  0  2 15742 12401 14784 4 1 95
 0 0 0 130160176 116377776 2 16 0 0 0 0  0  0  1  0  0 19972 17703 19612 6 2 92

 14 0 0 130160176 116377696 0 16 0 0 0 0 0  0  0  0  0 202794 116793 199807 71 21 8
 9 0 0 130160160 116373584 0 30 0 0 0 0  0  0 18  0  0 203123 117857 198825 69 20 11


This behavior occurred consistently while the application server was processing synthetic transactions: HTTP requests from JMeter running on an external machine.

I explored many theories trying to explain the drop outs, including:
  • Unexpected JMeter behavior
  • Network contention
  • Java Garbage Collection
  • Application Server thread pool problems
  • Connection pool problems
  • Database transaction processing
  • Database I/O contention

Graphing the CPU %idle led to a breakthrough:

AppServerIdle.jpg

Several of the drop outs were 30 seconds apart. With that insight, I went digging through the data again and looking for other outliers that were 30 seconds apart. In the database server statistics, I found spikes in the iostat "asvc_t" (average response time of disk transactions, in milliseconds) for the disk drive that was being used for the database log files.

Here is an example:

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

    0.0 2053.6    0.0 8234.3  0.0  0.2    0.0    0.1   0  24 c3t60080E5...F4F6d0s0
    0.0 2162.2    0.0 8652.8  0.0  0.3    0.0    0.1   0  28 c3t60080E5...F4F6d0s0
    0.0 1102.5    0.0 10012.8  0.0  4.5    0.0    4.1   0  69 c3t60080E5...F4F6d0s0
    0.0   74.0    0.0 7920.6  0.0 10.0    0.0  135.1   0 100 c3t60080E5...F4F6d0s0
    0.0  568.7    0.0 6674.0  0.0  6.4    0.0   11.2   0  90 c3t60080E5...F4F6d0s0
    0.0 1358.0    0.0 5456.0  0.0  0.6    0.0    0.4   0  55 c3t60080E5...F4F6d0s0
    0.0 1314.3    0.0 5285.2  0.0  0.7    0.0    0.5   0  70 c3t60080E5...F4F6d0s0

Here is a little more information about my database configuration:
  • The database and application server were running on two different SPARC servers.
  • Storage for the database was on a storage array connected via 8 gigabit Fibre Channel
  • Data storage and log file were on different physical disk drives
  • Reliable low latency I/O is provided by battery backed NVRAM
  • Highly available:
    • Two Fibre Channel links accessed via MPxIO
    • Two Mirrored cache controllers
    • The log file physical disks were mirrored in the storage device
  • Database log files on a ZFS Filesystem with cutting-edge technologies, such as copy-on-write and end-to-end checksumming

Why would I be getting service time spikes in my high-end storage? First, I wanted to verify that the database log disk service time spikes aligned with the application server CPU drop outs, and they did:

AppServerIdleLogService.jpg

At first, I guessed that the disk service time spikes might be related to flushing the write through cache on the storage device, but I was unable to validate that theory.

After searching the WWW for a while, I decided to try using a separate log device:

# zpool add ZFS-db-41 log c3t60080E500017D55C000015C150A9F8A7d0

The ZFS log device is configured in a similar manner as described above: two physical disks mirrored in the storage array.

This change to the database storage configuration eliminated the application server CPU drop outs:

AppServerIdleAfter.jpg

Here is the zpool configuration:

# zpool status ZFS-db-41
  pool: ZFS-db-41
 state: ONLINE
 scan: none requested
config:

        NAME                                     STATE
        ZFS-db-41                                ONLINE
          c3t60080E5...F4F6d0  ONLINE
        logs
          c3t60080E5...F8A7d0  ONLINE


Now, the I/O spikes look like this:

                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1053.5    0.0 4234.1  0.0  0.8    0.0    0.7   0  75 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1131.8    0.0 4555.3  0.0  0.8    0.0    0.7   0  76 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1167.6    0.0 4682.2  0.0  0.7    0.0    0.6   0  74 c3t60080E5...F8A7d0s0
    0.0  162.2    0.0 19153.9  0.0  0.7    0.0    4.2   0  12 c3t60080E5...F4F6d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1247.2    0.0 4992.6  0.0  0.7    0.0    0.6   0  71 c3t60080E5...F8A7d0s0
    0.0   41.0    0.0   70.0  0.0  0.1    0.0    1.6   0   2 c3t60080E5...F4F6d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1241.3    0.0 4989.3  0.0  0.8    0.0    0.6   0  75 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1193.2    0.0 4772.9  0.0  0.7    0.0    0.6   0  71 c3t60080E5...F8A7d0s0

We can see the steady flow of 4k writes to the ZIL device from O_SYNC database log file writes. The spikes are from flushing the transaction group.

Like almost all problems that I run into, once I thoroughly understand the problem, I find that other people have documented similar experiences. Thanks to all of you who have documented alternative approaches.

Saved for another day: now that the problem is obvious, I should try "zfs:zfs_immediate_write_sz" as recommended in the ZFS Evil Tuning Guide.

References:

About

user12620111

Search

Archives
« November 2012 »
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
28
29
30
 
       
Today