Thursday Dec 10, 2009

ZFS likes to have ECC RAM

I have been using custom built {ZFS,OpenSolaris}-based NAS at home for more than a year. The machine was built partly from second hand components (e.g. motherboard), from in-house unused iron and from minority of brand new stuff (more on that in a separate entry). The machine has been running constantly and serving data occasionally with very light load. One day I needed to perform some administrative task and realized it's not possible to SSH into the machine. Console login revealed the uptime is just couple of days, both pools (root pool and data pool) contain staggering number of checksum errors. In the /var/crash/ directory there was couple of crash dumps. Some of them were corrupted and mdb(1) refused to load them in or reported garbage. The times of the crashes corresponded to the Sunday night scrubbing for each of the pool. At least two of the dumps contained interesting and fairly obvious stack trace. I no longer have the file so here's just the entry from the log:

Nov  1 02:27:20 chiba \^Mpanic[cpu0]/thread=ffffff0190914040: 
Nov  1 02:27:20 chiba genunix: [ID 683410 kern.notice] BAD TRAP: type=d (#gp General protection) rp=ffffff0006822380 addr=488b
Nov  1 02:27:20 chiba unix: [ID 100000 kern.notice] 
Nov  1 02:27:20 chiba unix: [ID 839527 kern.notice] sh: 
Nov  1 02:27:20 chiba unix: [ID 753105 kern.notice] #gp General protection
Nov  1 02:27:20 chiba unix: [ID 358286 kern.notice] addr=0x488b
Nov  1 02:27:20 chiba unix: [ID 243837 kern.notice] pid=740, pc=0xfffffffffba0373a, sp=0xffffff0006822470, eflags=0x10206
Nov  1 02:27:20 chiba unix: [ID 211416 kern.notice] cr0: 8005003b cr4: 6f8
Nov  1 02:27:20 chiba unix: [ID 624947 kern.notice] cr2: fee86fa8
Nov  1 02:27:20 chiba unix: [ID 625075 kern.notice] cr3: b96a0000
Nov  1 02:27:20 chiba unix: [ID 625715 kern.notice] cr8: c
Nov  1 02:27:20 chiba unix: [ID 100000 kern.notice] 
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     rdi: ffffff018b1e1c98 rsi: ffffff01a032dfb8 rdx: ffffff0190914040
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     rcx: ffffff018ef054b0  r8:                c  r9:                b
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     rax: ffffff01a032dfb8 rbx:                0 rbp: ffffff00068224a0
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     r10:                0 r11:                0 r12: ffbbff01a032d740
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     r13: ffffff01a032dfb8 r14: ffffff018b1e1c98 r15:             488b
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     fsb:                0 gsb: fffffffffbc30400  ds:               4b
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]      es:               4b  fs:                0  gs:              1c3
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     trp:                d err:                0 rip: fffffffffba0373a
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]      cs:               30 rfl:            10206 rsp: ffffff0006822470
Nov  1 02:27:20 chiba unix: [ID 266532 kern.notice]      ss:               38
Nov  1 02:27:20 chiba unix: [ID 100000 kern.notice] 
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822260 unix:die+10f ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822370 unix:trap+43e ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822380 unix:_cmntrap+e6 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068224a0 genunix:kmem_slab_alloc_impl+3a ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068224f0 genunix:kmem_slab_alloc+a1 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822550 genunix:kmem_cache_alloc+130 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068225c0 zfs:dbuf_create+4e ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068225e0 zfs:dbuf_create_bonus+2a ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822630 zfs:dmu_bonus_hold+7e ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068226c0 zfs:zfs_zget+5a ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822780 zfs:zfs_dirent_lock+3fc ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822820 zfs:zfs_dirlook+d9 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068228a0 zfs:zfs_lookup+25f ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822940 genunix:fop_lookup+ed ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822b80 genunix:lookuppnvp+3a3 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822c20 genunix:lookuppnatcred+11b ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822c90 genunix:lookuppn+5c ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822e90 genunix:exec_common+1ac ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822ec0 genunix:exece+1f ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822f10 unix:brand_sys_syscall32+19d ()
Nov  1 02:27:20 chiba unix: [ID 100000 kern.notice] 
Nov  1 02:27:20 chiba genunix: [ID 672855 kern.notice] syncing file systems...

Also, next to the messages on the console I found some entries in /var/adm/messages like this one:

Nov  2 12:15:01 chiba genunix: [ID 647144 kern.notice] ksh93: Cannot read /lib/amd64/ld.so.1

Later on, the condition of the machine worsened and it was not even possible to execute some commands due to I/O errors up to the point when the machine had to be halted.

The panic occurring in kmem routines, loads of checksum errors on both mirrored pools (same number of errors for each disk in the mirror) and the fact that the system was running with the same build for couple of months without a problem lead me to try memtest:

The errors started appearing on the screen in the first couple of seconds of the run. It turned out one of the 3 1GB DDR2 chips went bad. In case you're wondering, the DIMMS were bought as new 1 year ago, were branded (all of them from the same brand known for gaming/overclocking equipment, same type) and had aluminium heat sink on it, so no low quality stuff.

I was able to recover the data from past snapshots and replaced the RAM with ECC DIMMS (which required new motherboard+CPU combo). This is nice case of semi-silent data corruption detection. Without checksums the machine would be happily panicking and corrupting data without giving clear indication what is going on (e.g. which files were corrupted). So, even for home NAS solution ECC RAM is good (if not essential) to have.

FMA should do the right thing if one of the ECC modules goes bad which means it will not allow the bad pages to be used (the pages will be retired). The list of retired pages is persistent across reboots. More on FMA and ECC RAM can be found e.g. in this discussion on fm-discuss or in the FMA and DIMM serial numbers blog entry in Rob Johnston's blog or in the Eversholt rules for AMD in usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc.

Thursday Apr 03, 2008

ZFS is going to save my laptop data next time

The flashback is still alive even weeks after: the day before my presentation at FIRST Technical Colloquium in Prague I brought my 2 years old laptop with the work-in-progress slides to the office. Since I wanted to finish the slides in the evening a live-upgrade process was fired off on the laptop to get fresh Nevada version. (of course, to show off during the presentation ;))
LU is very I/O intensive process and the red Ferrari notebooks tend to get _very_ hot. In the afternoon I noticed that the process failed. To my astonishment, the I/O operations started to fail. After couple of reboots (and zpool status / fmadm faulty commands) it was obvious that the disk cannot be trusted anymore. I was able to rescue some data from the ZFS pool which was spanning the biggest slice of the internal disk but not all data. (ZFS is not willing to get corrupted data out.) My slides were lost as well as other data.

After some time I stumbled upon James Gosling's blog entry about ZFS mirroring on laptop. This get me started (or more precisely I was astonished and wondered how is it possible that this idea escaped me because at that time ZFS had been in Nevada for a long time) and I have discovered several similar and more in-depth blog entries about the topic.
After some experiments with borrowed USB disk it was time to make it reality on a new laptop.

The process was a multi-step one:

  1. First I had to extend the free slice #7 on the internal disk so it spans the remaining space on the disk because it was trimmed after the experiments. In the end the slices look like this in format(1) output:
    Part      Tag    Flag     Cylinders         Size            Blocks
      0       root    wm       3 -  1277        9.77GB    (1275/0/0)   20482875
      1 unassigned    wm    1278 -  2552        9.77GB    (1275/0/0)   20482875
      2     backup    wm       0 - 19442      148.94GB    (19443/0/0) 312351795
      3       swap    wu    2553 -  3124        4.38GB    (572/0/0)     9189180
      4 unassigned    wu       0                0         (0/0/0)             0
      5 unassigned    wu       0                0         (0/0/0)             0
      6 unassigned    wu       0                0         (0/0/0)             0
      7       home    wm    3125 - 19442      125.00GB    (16318/0/0) 262148670
      8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
      9 alternates    wu       1 -     2       15.69MB    (2/0/0)         32130
    
  2. Then the USB drive was connected to the system and recognized via format(1):
    AVAILABLE DISK SELECTIONS:
           0. c0d0 
              /pci@0,0/pci-ide@12/ide@0/cmdk@0,0
           1. c5t0d0 
              /pci@0,0/pci1025,10a@13,2/storage@4/disk@0,0
    
  3. Live upgrade boot environment which was not active was deleted via ludelete(1M) and the slice was commented out in /etc/vfstab. This was needed to make zpool(1M) happy.
  4. ZFS pool was created out of the slice on the internal disk (c0d0s7) and external USB disk (c5t0d0). I had to force it cause zpool(1M) complained about the overlap of c0d0s2 (slice spanning the whole disk) and c0d0s7:
    # zpool create -f data mirror c0d0s7 c5t0d0
    
    For a while I have struggled with finding a name for the pool (everybody seems either to stick to the 'tank' name or come up with some double-cool-stylish name which I wanted to avoid because of the likely degradation of the excitement from that name) but then chosen the ordinary data (it's what it is, after all).
  5. I have verified that it is possible to disconnect the USB disk and safely connect it while an I/O operation is in progress:
    root:moose:/data# mkfile 10g /data/test &
    [1] 10933
    root:moose:/data# zpool status
      pool: data
     state: ONLINE
     scrub: none requested
    config:
    
    	NAME        STATE     READ WRITE CKSUM
    	data        ONLINE       0     0     0
    	  mirror    ONLINE       0     0     0
    	    c0d0s7  ONLINE       0     0     0
    	    c5t0d0  ONLINE       0     0     0
    
    errors: No known data errors
    
    It survived it without a hitch (okay, I had to wait for the zpool command to complete a little bit longer due to the still ongoing I/O but that was it) and resynced the contents automatically after the USB disk was reconnected:
    root:moose:/data# zpool status
      pool: data
     state: DEGRADED
    status: One or more devices are faulted in response to persistent errors.
    	Sufficient replicas exist for the pool to continue functioning in a
    	degraded state.
    action: Replace the faulted device, or use 'zpool clear' to mark the device
    	repaired.
     scrub: resilver in progress for 0h0m, 3.22% done, 0h5m to go
    config:
    
    	NAME        STATE     READ WRITE CKSUM
    	data        DEGRADED     0     0     0
    	  mirror    DEGRADED     0     0     0
    	    c0d0s7  ONLINE       0     0     0
    	    c5t0d0  FAULTED      0     0     0  too many errors
    
    errors: No known data errors
    
    Also, with heavy I/O it is needed to mark the zpool as clear after the resilver completes via zpool clear data because the USB drive is marked as faulty. Normally this will not happen (unless the drive really failed) because I will be connecting and disconnecting the drive only when powering on or shutting down the laptop, respectively.
  6. After that I have used Mark Shellenbaum's blog entry about ZFS delegated administration (it was Mark who did the integration) and ZFS Delegated Administration chapter from OpenSolaris ZFS Administration Guide and created permissions set for my local user and assigned those permissions to the ZFS pool 'data' and the user:
      # chmod A+user:vk:add_subdirectory:fd:allow /data
      # zfs allow -s @major_perms clone,create,destroy,mount,snapshot data
      # zfs allow -s @major_perms send,receive,share,rename,rollback,promote data
      # zfs allow -s @major_props copies,compression,quota,reservation data
      # zfs allow -s @major_props snapdir,sharenfs data
      # zfs allow vk @major_perms,@major_props data
    
    All of the commands had to be done under root.
  7. Now the user is able to create a home directory for himself:
      $ zfs create data/vk
    
  8. Time to setup the environment of data sets and prepare it for data. I have separated the data sets according to a 'service level'. Some data are very important (e.g. presentations ;)) to me so I want them multiplied via the ditto blocks mechanism so they are actually present 4 times in case of copies dataset property set to 2. Also, documents are not usually accompanied by executable code so the exec property was set to off which will prevent running scripts or programs from that dataset.
    Some data are volatile and in high quantity so they do not need any additional protection and it is good idea to compress them with better compression algorithm to save some space. The following table summarizes the setup:
       dataset             properties                       comment
     +-------------------+--------------------------------+------------------------+
     | data/vk/Documents | copies=2                       | presentations          |
     | data/vk/Saved     | compression=on exec=off        | stuff from web         |
     | data/vk/DVDs      | compression=gzip-8 exec=off    | Nevada ISOs for LU     |
     | data/vk/CRs       | compression=on copies=2        | precious source code ! |
     +-------------------+--------------------------------+------------------------+
    
    So the commands will be:
      $ zfs create -o copies=2 data/vk/Documents
      $ zfs create -o compression=gzip-3 -o exec=off data/vk/Saved
      ...
    
  9. Now it is possible to migrate all the data, change home directory of the user to /data/vk (e.g. via /usr/ucb/vipw) and relogin.

However, this is not the end of it but just beginning. There are many things to make the setup even better, to name a few:

  • setup some sort of automatic snapshots for selected datasets
    The set of scripts and SMF service for doing ZFS snapshots and backup (see ZFS Automatic For The People and related blog entries) made by Tim Foster could be used for this task.
  • make zpool scrub run periodically
  • detect failures of the disks
    This would be ideal to see in Gnome panel or Gnome system monitor.
  • setup off-site backup via SunSSH + zfs send
    This could be done using the hooks provided by Tim's scripts (see above).
  • Set quotas and reservations for some of the datasets.
  • Install ZFS scripts for Gnome nautilus so I will be able to browse, perform and destroy snapshots in nautilus. Now which set of scripts to use ? Chris Gerhard's or Tim Foster's ? Or should I just wait for the official ZFS support for nautilus to be integrated ?
  • Find how exactly will the recovery scenario (in case of laptop drive failure) will look like.
    To import the ZFS pool from the USB disk should suffice but during my experiments I was not able to complete it successfully.

With all the above the data should be safe from disk failure (after all disks are often called "spinning rust" so they are going to fail sooner or later) and also the event of loss of both laptop and USB disk.

Lastly, a philosophical thought: One of my colleagues considers hardware as a necessary (and very faulty) layer which is only needed to make it possible to express the ideas in software. This might seem extreme but come to think of it. ZFS is special in this sense - being a software which provides that bridge, it's core idea to isolate the hardware faults.

About

blog about security and various tools in Solaris

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today