Thursday Apr 03, 2008

ZFS is going to save my laptop data next time

The flashback is still alive even weeks after: the day before my presentation at FIRST Technical Colloquium in Prague I brought my 2 years old laptop with the work-in-progress slides to the office. Since I wanted to finish the slides in the evening a live-upgrade process was fired off on the laptop to get fresh Nevada version. (of course, to show off during the presentation ;))
LU is very I/O intensive process and the red Ferrari notebooks tend to get _very_ hot. In the afternoon I noticed that the process failed. To my astonishment, the I/O operations started to fail. After couple of reboots (and zpool status / fmadm faulty commands) it was obvious that the disk cannot be trusted anymore. I was able to rescue some data from the ZFS pool which was spanning the biggest slice of the internal disk but not all data. (ZFS is not willing to get corrupted data out.) My slides were lost as well as other data.

After some time I stumbled upon James Gosling's blog entry about ZFS mirroring on laptop. This get me started (or more precisely I was astonished and wondered how is it possible that this idea escaped me because at that time ZFS had been in Nevada for a long time) and I have discovered several similar and more in-depth blog entries about the topic.
After some experiments with borrowed USB disk it was time to make it reality on a new laptop.

The process was a multi-step one:

  1. First I had to extend the free slice #7 on the internal disk so it spans the remaining space on the disk because it was trimmed after the experiments. In the end the slices look like this in format(1) output:
    Part      Tag    Flag     Cylinders         Size            Blocks
      0       root    wm       3 -  1277        9.77GB    (1275/0/0)   20482875
      1 unassigned    wm    1278 -  2552        9.77GB    (1275/0/0)   20482875
      2     backup    wm       0 - 19442      148.94GB    (19443/0/0) 312351795
      3       swap    wu    2553 -  3124        4.38GB    (572/0/0)     9189180
      4 unassigned    wu       0                0         (0/0/0)             0
      5 unassigned    wu       0                0         (0/0/0)             0
      6 unassigned    wu       0                0         (0/0/0)             0
      7       home    wm    3125 - 19442      125.00GB    (16318/0/0) 262148670
      8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
      9 alternates    wu       1 -     2       15.69MB    (2/0/0)         32130
  2. Then the USB drive was connected to the system and recognized via format(1):
           0. c0d0 
           1. c5t0d0 
  3. Live upgrade boot environment which was not active was deleted via ludelete(1M) and the slice was commented out in /etc/vfstab. This was needed to make zpool(1M) happy.
  4. ZFS pool was created out of the slice on the internal disk (c0d0s7) and external USB disk (c5t0d0). I had to force it cause zpool(1M) complained about the overlap of c0d0s2 (slice spanning the whole disk) and c0d0s7:
    # zpool create -f data mirror c0d0s7 c5t0d0
    For a while I have struggled with finding a name for the pool (everybody seems either to stick to the 'tank' name or come up with some double-cool-stylish name which I wanted to avoid because of the likely degradation of the excitement from that name) but then chosen the ordinary data (it's what it is, after all).
  5. I have verified that it is possible to disconnect the USB disk and safely connect it while an I/O operation is in progress:
    root:moose:/data# mkfile 10g /data/test &
    [1] 10933
    root:moose:/data# zpool status
      pool: data
     state: ONLINE
     scrub: none requested
    	data        ONLINE       0     0     0
    	  mirror    ONLINE       0     0     0
    	    c0d0s7  ONLINE       0     0     0
    	    c5t0d0  ONLINE       0     0     0
    errors: No known data errors
    It survived it without a hitch (okay, I had to wait for the zpool command to complete a little bit longer due to the still ongoing I/O but that was it) and resynced the contents automatically after the USB disk was reconnected:
    root:moose:/data# zpool status
      pool: data
     state: DEGRADED
    status: One or more devices are faulted in response to persistent errors.
    	Sufficient replicas exist for the pool to continue functioning in a
    	degraded state.
    action: Replace the faulted device, or use 'zpool clear' to mark the device
     scrub: resilver in progress for 0h0m, 3.22% done, 0h5m to go
    	data        DEGRADED     0     0     0
    	  mirror    DEGRADED     0     0     0
    	    c0d0s7  ONLINE       0     0     0
    	    c5t0d0  FAULTED      0     0     0  too many errors
    errors: No known data errors
    Also, with heavy I/O it is needed to mark the zpool as clear after the resilver completes via zpool clear data because the USB drive is marked as faulty. Normally this will not happen (unless the drive really failed) because I will be connecting and disconnecting the drive only when powering on or shutting down the laptop, respectively.
  6. After that I have used Mark Shellenbaum's blog entry about ZFS delegated administration (it was Mark who did the integration) and ZFS Delegated Administration chapter from OpenSolaris ZFS Administration Guide and created permissions set for my local user and assigned those permissions to the ZFS pool 'data' and the user:
      # chmod A+user:vk:add_subdirectory:fd:allow /data
      # zfs allow -s @major_perms clone,create,destroy,mount,snapshot data
      # zfs allow -s @major_perms send,receive,share,rename,rollback,promote data
      # zfs allow -s @major_props copies,compression,quota,reservation data
      # zfs allow -s @major_props snapdir,sharenfs data
      # zfs allow vk @major_perms,@major_props data
    All of the commands had to be done under root.
  7. Now the user is able to create a home directory for himself:
      $ zfs create data/vk
  8. Time to setup the environment of data sets and prepare it for data. I have separated the data sets according to a 'service level'. Some data are very important (e.g. presentations ;)) to me so I want them multiplied via the ditto blocks mechanism so they are actually present 4 times in case of copies dataset property set to 2. Also, documents are not usually accompanied by executable code so the exec property was set to off which will prevent running scripts or programs from that dataset.
    Some data are volatile and in high quantity so they do not need any additional protection and it is good idea to compress them with better compression algorithm to save some space. The following table summarizes the setup:
       dataset             properties                       comment
     | data/vk/Documents | copies=2                       | presentations          |
     | data/vk/Saved     | compression=on exec=off        | stuff from web         |
     | data/vk/DVDs      | compression=gzip-8 exec=off    | Nevada ISOs for LU     |
     | data/vk/CRs       | compression=on copies=2        | precious source code ! |
    So the commands will be:
      $ zfs create -o copies=2 data/vk/Documents
      $ zfs create -o compression=gzip-3 -o exec=off data/vk/Saved
  9. Now it is possible to migrate all the data, change home directory of the user to /data/vk (e.g. via /usr/ucb/vipw) and relogin.

However, this is not the end of it but just beginning. There are many things to make the setup even better, to name a few:

  • setup some sort of automatic snapshots for selected datasets
    The set of scripts and SMF service for doing ZFS snapshots and backup (see ZFS Automatic For The People and related blog entries) made by Tim Foster could be used for this task.
  • make zpool scrub run periodically
  • detect failures of the disks
    This would be ideal to see in Gnome panel or Gnome system monitor.
  • setup off-site backup via SunSSH + zfs send
    This could be done using the hooks provided by Tim's scripts (see above).
  • Set quotas and reservations for some of the datasets.
  • Install ZFS scripts for Gnome nautilus so I will be able to browse, perform and destroy snapshots in nautilus. Now which set of scripts to use ? Chris Gerhard's or Tim Foster's ? Or should I just wait for the official ZFS support for nautilus to be integrated ?
  • Find how exactly will the recovery scenario (in case of laptop drive failure) will look like.
    To import the ZFS pool from the USB disk should suffice but during my experiments I was not able to complete it successfully.

With all the above the data should be safe from disk failure (after all disks are often called "spinning rust" so they are going to fail sooner or later) and also the event of loss of both laptop and USB disk.

Lastly, a philosophical thought: One of my colleagues considers hardware as a necessary (and very faulty) layer which is only needed to make it possible to express the ideas in software. This might seem extreme but come to think of it. ZFS is special in this sense - being a software which provides that bridge, it's core idea to isolate the hardware faults.


blog about security and various tools in Solaris


« April 2014