Friday Oct 27, 2006

Where to put ZFS filesystems in a pool

Before we had ZFS, I was always telling people not not put things in the root of a file system. Specifically don't share the root of a file sytem. That way if you want to add another share to the file system later with different permissions you could and all was good. It was (is) good advice.

With ZFS you end up with lots of file systems and the advice does not hold anymore. Where previously you were trying to share the file system resources now you would just create a new file system in a pool and have done with it.

Today I realized that for ZFS there is some similar advice worth following and that is don't put all your file systems in the root of the pool. Todays example is that I have a number of file systems and one zvol in a pool. It would be nice to be able to use a single recursive snapshot to back up all the file systems but not the zvol, since that zvol is my swap partition. While a snapshot of swap is kind of cool in that you can do it serves no purpose other than to use storage at an alarming rate.

So now I have moved all the file systems under one uber filesystem called “fs”:

# zfs list -t filesystem,volume
NAME                              USED  AVAIL  REFER  MOUNTPOINT
tank                             84.9G   187G  26.5K  /tank
tank/fs                          82.8G   187G  32.5K  /tank/fs
tank/fs/downloads                31.9G   187G  2.20G  /tank/fs/downloads
tank/fs/downloads/nv             26.9G   187G  37.5K  /tank/fs/downloads/nv
tank/fs/downloads/nv/46          2.49G   187G  2.48G  /tank/fs/downloads/nv/46
tank/fs/downloads/nv/47          2.45G   187G  2.45G  /tank/fs/downloads/nv/47
tank/fs/downloads/nv/48          2.55G   187G  2.45G  /tank/fs/downloads/nv/48
tank/fs/downloads/nv/49          2.46G   187G  2.45G  /tank/fs/downloads/nv/49
tank/fs/downloads/nv/50          2.52G   187G  2.45G  /tank/fs/downloads/nv/50
tank/fs/downloads/nv/51          2.50G   187G  2.46G  /tank/fs/downloads/nv/51
tank/fs/downloads/nv/tmp         12.0G   187G  4.78G  /tank/fs/downloads/nv/tmp
tank/fs/local                    66.8M   187G  57.0M  /tank/fs/local
tank/fs/opt                      1.67G  28.3G  25.5K  /tank/fs/opt
tank/fs/opt/SUNWspro              459M  28.3G   453M  /opt/SUNWspro
tank/fs/opt/csw                   340M  28.3G   121M  /opt/csw
tank/fs/opt/sfw                   907M  28.3G   880M  /opt/sfw
tank/fs/opt/spamd                 110K  28.3G  24.5K  /tank/fs/opt/spamd
tank/fs/shared                   12.1G  37.9G  28.5K  /tank/fs/shared
tank/fs/shared/music             5.71G  37.9G  5.70G  /tank/fs/shared/music
tank/fs/shared/pics              6.36G  37.9G  6.32G  /tank/fs/shared/pics
tank/fs/shared/projects           424K  37.9G  25.5K  /tank/fs/shared/projects
tank/fs/shared/projects/kitchen   376K  37.9G  46.5K  /tank/fs/shared/projects/kitchen
tank/fs/users                    25.4G  74.6G  32.5K  /tank/fs/users
tank/fs/users/user1               300K  74.6G  29.5K  /tank/fs/users/user1
tank/fs/users/user1              23.1G  74.6G  22.2G  /tank/fs/users/user2
tank/fs/users/user3              13.5M  74.6G  9.72M  /tank/fs/users/user3
tank/fs/users/user4               652M  74.6G   614M  /tank/fs/users/user4
tank/fs/users/user5              1.12G  74.6G   987M  /tank/fs/users/user5
tank/fs/users/user6               500M  74.6G   341M  /tank/fs/users/user6
tank/fs/var                      10.8G  19.2G  35.5K  /tank/fs/var
tank/fs/var/crash                5.10G  19.2G  5.09G  /var/crash
tank/fs/var/dhcp                  128K  19.2G    30K  /tank/fs/var/dhcp
tank/fs/var/log                  49.5K  19.2G    27K  /tank/fs/var/log
tank/fs/var/mail                  871M  19.2G   179M  /var/mail
tank/fs/var/mqueue                662K  19.2G  24.5K  /var/spool/mqueue
tank/fs/var/named                 442K  19.2G   369K  /tank/fs/var/named
tank/fs/var/openldap-data         130K  19.2G  82.5K  /tank/fs/var/openldap-datatank/fs/var/opt                    46K  19.2G  24.5K  /tank/fs/var/opt
tank/fs/var/samba                  46K  19.2G  24.5K  /tank/fs/var/samba
tank/fs/var/tmp                  4.90G  19.2G  2.45G  /tank/fs/var/tmp
tank/fs/web                       920M   104M  2.56M  /tank/fs/web
tank/swap                        45.1M   189G  45.1M  -

I could have tweaked the mount point of tank to be “none” and of tank/fs to be “tank” but did not to avoid potential confusion in the future. I should really also ask that “zfs snapshot -r” have a -t option so you could get it to snapshot based on a type.


Wednesday Oct 25, 2006

53394 snapshots

After almost 2 months of running ZFS at home things are stabilizing with the configuration. I'm still surprised by the number of file systems and even more surprised by the number of snapshots:

: pearson TS 1 $; zfs list -t filesystem | wc -l
: pearson TS 2 $; zfs list -t snapshot | wc -l
: pearson TS 3 $; df -h /tank
Filesystem             size   used  avail capacity  Mounted on
tank                   272G    33K   194G     1%    /tank
: pearson TS 4 $;

Approximately 1334 snapshots per file system. I've used the snapshots 3 times to recover various things I have cocked up (I'm refraining from using the F word after the storm in a tea cup Tim's posting caused, even if no one would notice). However I sleep better knowing that my families data is safe from their user error. Only I can mess them up!


Saturday Oct 21, 2006

Shared samba directories

The samba set up on the new server for users has been flawless, but the shared directories slightly less so. I had a problem where if one of the family created a directory then the rest of the family could not add to that directory. Looking on the Solaris side it was clear the problem, the directory was created mode 755. Typing this I realize just how bad that is. 755 could not possibly mean anything to anyone who was not up to their armpits into UNIX computing and the explication would fill pages and indeed it does.

The permissions I want to force for directories are "read, write and execute for group" as well as the owner. Ie mode 775. It would also be nice if I could stop one user deleting the other users work so setting the sticky bit would also be good giving mode 1755.

Trundling through the smb.conf manual page tells me that there is an option, "force directory mode" that does exactly what it implies and what I want. I'm sure I could achieve the same with an ACL and will do that later so that SMB and NFS give the same results. However for now smb.conf serves this purpose.

So the new entry in the smb.conf for the shared area where we keep pictures looks like this:

   comment = Pictures
   path = /tank/shared/pics
   public = yes
   writable = yes
   printable = no
   write list = @staff
   force directory mode = 1775
   force create mode = 0444
   root preexec = ksh -c '/usr/sbin/zfs snapshot tank/shared/pics@smb$(/tank/local/smbdate)'

Now everyone can add to the file system but can't delete others photos, plus I get a snapshot every time someone starts to access the file system.


Friday Oct 13, 2006

Build 50@home & NAT

Build 50 of nevada hit my home server today with little fuss thanks to live upgrade. So far no unpleasant surprises although I had to loose the zone for the web server as live upgrade in nevada unlike live upgrade in 10 can't handle zones yet. I will however still used zones as testing grounds.

The system has been live now for a few weeks, doing NAT, firewall, email (imaps and SMTP) via exim with spamassasin and clamd for antivirus, Samba providing widows server support, ntp, DNS and DHCP I have fallen someway behind in the documentation of it though.

Getting NAT (Network Address Translation) for any non geeks still here was a breeze. I simply followed the instruction on Ford's blog, substituting my network device (rtls0) in the right places and stopping before any of the zones stuff due to not needing it.

My /etc/ipf/ipnat.conf has ended up looking like this:

: pearson TS 15 $; cat /etc/ipf/ipnat.conf
map rtls0 -> 0/32 proxy port ftp ftp/tcp
map rtls0 -> 0/32 portmap tcp/udp auto
map rtls0 -> 0/32
: pearson TS 16 $;

and smf starts it without fault.


Thursday Sep 28, 2006

I'm an Xpert (sic)

BigAdmin are running an Ask the Xpert(sic) session on ZFS and that Xpert is me.

Very un-British to claim to be an Xpert but someone had to do it. Sorry about the photo, taken with a self timer and never really got it right.


Monday Sep 25, 2006

Has ZFS just saved my data?

My new home server has had it's first ZFS checksum error. The problem here is that zfs has not told me what that error was so it is impossible for me to say how bad it is, or heaven forbid, that it could be a false positive.

It leaves lots of questions in my mind about what ZFS does, if anything, to verify the kind of problem to attempt to narrow down where the fault is. Need to do some reading of the zfs source.

# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
 scrub: scrub in progress, 0.01% done, 20h7m to go

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d0s7  ONLINE       0     0     1
            c5d0s7  ONLINE       0     0     0

errors: No known data errors

One thing I did straight away was to scrub the pool. However the scrub never completed, just exercised the disks all weekend. Checking the OpenSolaris ZFS discussion forum I was hitting this bug:

6343667 need itinerary so interrupted scrub/resilver doesn't have to start over

Where the scrub gets restarted when ever a snapshot is taken. Not so good if you snaphost every 10 minutes.


Thursday Sep 21, 2006

ZFS @ The Cambridge Solaris User Group

Last night I went and demonstrated ZFS at the Cambridge Solaris User Group. This was fun for 3 reasons:

  1. I got to see a presentation on Xen by Steven Hand.

  2. I got to see a presentation from Sun on Sun Ray and the global secure desktop.

  3. I got asked some interesting questions.

Most of the interesting questions I could give good answers to but the two that sort of stumped me were:

  1. ZFS quotas and snapshots. The question boiled down to a requirement to have snapshots not included in the users quota. Otherwise you get into the situation where the user can't delete anything as it is all backed by snapshots so there is no way to recover the space.

    Searching the ZFS mailing list on this has come up before in this thread. There is even a change request already filed:

    6431277 want filesystem-only quotas

  2. Permissions on the .zfs/snaphost mountpoints.

    The problem was this. Suppose a user has a file in their home directory and they make it mode 644. Then a snapshot is taken. Then the user realises that perhaps the permissions are inappropriate and changes them to 600. However the old version is still in the .zfs/snapshot directory with mode 644, hence readable.

    It is true that this really exposes an process issue in that the data was public and since we don't have mandatory access control we really have to trust the users to do the right thing. If someone came across the file in the window between being created and the permissions being fixed the data is out. However, in the real world, the snapshot increases the risk.

    I'm left wondering if you should be able to set and ACL on the .zfs and or .zfs/snapshot directory so that only the “owner” or owners of the file system could access the directory.

    6338043 need a method to access snapshots in alternate locations

    Seems to be a starting point, in that you could mount the snapshots under a directory of your choice with an ACL, but that would be a hack. Need to start this discussion over on the the ZFS discussion forum.

All in all a pleasant evening even if I did not get home until after midnight. As I was leaving the event one of the locals was carrying his pannier to his bike to ride home and I actually thought it would have been cool to have brought the bike up by train and then ride home through the night. Only 100 miles. Luckily I did not think of this earlier!


Thursday Sep 07, 2006

How many snapshots?

Having a non laptop system that is on all the time running zfs with automatic snapshots you start to build up the snapshots at an alarming rate.

# for i in $(zfs list -Ho name -t snapshot )
zfs get -pH used $i
done | nawk '{ count++;
        if ($3 > 0 ) {
        print count, nonzero, total/(1024\^2)
7071 188 83.939

So after one week I have 7071 snapshots of which only 188 currently contain data taking just 85 megabytes with the total pool taking 42.8G.

No downsides have been seen so far so while the numbers appear alarming I see no reason not to continue.


Tuesday Sep 05, 2006

Cleaning up zfs snapshots

Thank you to the anonymous comments about samba and ZFS and the clean up script.

A days worth of samba snapshots look like this:

tank/users/cjg@smb1157437900  37.5K      -  21.1G  -
tank/users/cjg@smb1157441840      0      -  21.1G  -
tank/users/cjg@smb1157441861      0      -  21.1G  -
tank/users/cjg@smb1157000000  40.5K      -  21.1G  -
tank/users/cjg@smb1157445557  40.5K      -  21.1G  -
tank/users/cjg@smb2006-09-05-12:03  40.5K      -  21.1G  -
tank/users/cjg@smb2006-09-05-18:27      0      -  21.1G  -
tank/users/keg@smb2006-09-05-18:29      0      -   465M  -
tank/users/rlg@smb1157441373      0      -   673M  -
tank/users/rlg@smb1157446766      0      -   675M  -
tank/users/rlg@smb1157449795    21K      -   675M  -
tank/users/rlg@smb2006-09-05-17:14      0      -   675M  -
tank/users/rlg@smb2006-09-05-17:54      0      -   675M  -
tank/users/rlg@smb2006-09-05-18:07      0      -   675M  -
tank/users/stc@smb1157437923      0      -   294M  -
tank/users/stc@smb1157446971      0      -   294M  -
tank/users/stc@smb2006-09-05-15:34      0      -   294M  -
tank/users/stc@smb2006-09-05-17:47      0      -   294M  -
tank/users/stc@smb2006-09-05-20:27      0      -   294M  -

from which you can see I experimented with naming them with the seconds from the epoch to make the clean up script simpler. However after a few minutes I realized there was a better way.

I now have a clean up script that uses the zfs file system creation time to do all the sorting. Getting this to work quickly requires a script to convert the time stamp into seconds from the epoch:

puts [clock scan $argv ]

Call the script “convert2secs” and then the rest of script is simple;

#!/bin/ksh -p
#       Quick scipt to clean up the snapshots created by each samma login.
#       See:
#       It is clear that this could be much more generic. Espeically if you
#       could add a property to the snapshot to say when it should be deleted.
ALL_TYPES="smb minute hour day month boot"





NUMBER_OF_SNAPSHOTS_hour=$((7 \* 24 \* 2))
DAYS_TO_KEEP_hour=$((7 \* 24))


today=$(convert2secs $(date))

function do_fs
        typeset fd
        typeset -i count=0
        typeset -i seconds2keep
        typeset -i time2go
        typeset -i number_of_snapshots
        typeset type=$2
        # days2keep and number_of_snaphots should come from
        # file system properties. Until then the are fed from the
        # global entities.
        days2keep=$(eval echo \\${DAYS_TO_KEEP_${type}})
        number_of_snapshots=$(eval echo \\${NUMBER_OF_SNAPSHOTS_${type}})

        seconds2keep=$(( days2keep \* 24 \* 60 \* 60 ))
        time2go=$((today - seconds2keep))

        for fs in $(zfs list -r -t snapshot -o name $1 | grep $type | sort -r -t @ -k 1)
                snap_time=$(convert2secs $(/usr/sbin/zfs list -H -o creation ${fs}))

                if (( count > number_of_snapshots )) && \\
                        (( snap_time < time2go ))
                        zfs destroy $fs
                        : echo $fs is kept for $((snap_time - time2go)) seconds
                let count=count+1

for type in ${ALL_TYPES}
        for i in $(zfs list -H -t snapshot -r $@ | sort | nawk -F '@' '/'$type'/ { print $1 }' | uniq)
                do_fs $i $type

When zfs has user defined options all the configuration can be kept in the file system but until then the configuration variables will do.

The script allows me to have different classes of snapshot: smb, minute, hour, day, month and boot. This allows the same script to clean up both the snapshots taken by samba and the ones taken via cron and boot.

The script errs on the side of not destroying snapshots so for each class I'm keeping all snapshots less than a certain number of days old and also keeping a minimun number of snapshots.


Minimum number of snapshots

Number of days to keep snapshots











28 \* 2



7 \* 24 \* 2

7 \* 24


60 \* 24


The advantage is that I can now both keep the snapshots longer and also give them more user friendly names. The new snapshot cron job script is here. I'm sure the number of snapshots generated is overkill but while I have the disk space why not?

Now if I can stop smb mangling the names all would be perfect.


Monday Sep 04, 2006

Samba meets ZFS

There has been much progress on the new server at home which I will write up later. Today I'll dig into what I have done to make samba and ZFS play well together. As I mentioned getting Samba running was easy. However there is a bit more that you can do to make ZFS even and Samba even better together.

Why not have zfs take a snapshot whenever you login to the PC? So in addition to the regular snapshots I also get one of the home directory of each user when they login.

Just add this line to the [homes] section in the smb.conf:

root preexec = ksh -c '/usr/sbin/zfs snapshot tank/users/%u@smb$(/tank/local/smbdate)'

Then you need the smbdate script to print a good date. You can't just use the date command directly as Samba expands the % entries before they are passed to date. Hence I wrap it in a script:

#!/bin/ksh -p
exec date +%F-%R

This results in snapshots like this each time a user logins on the PC

# zfs list | grep smb
tank/users/cjg@smb2006-09-04-22:53      0      -  21.1G  -

At some point a script to clean up the old snapshots will be needed.


Friday Sep 01, 2006

Home server progress

Progress on the new home server:

All the user accounts have been created each with their own ZFS file system for their home directory.

I've installed my ZFS snapshot script and have crontab entries like this:

10,20,30,40,50 \* \* \* \* /tank/local/snapshot minute
0 \* \* \* \* /tank/local/snapshot hour
1 1 \* \* \* /tank/local/snapshot day
2 1 1 \* \* /tank/local/snapshot month

I think there is some scope for improvement here which would mean keeping the snapshots for longer. When the proposal to have user defined properties becomes real I will define the snapshot rules in the file systems and have the script use those rather than a one size fits all.

I have samba running from SMF thanks to Trevor's manifest and script. I did the changes to the script suggested in the comments. This all worked perfectly and now the PC flies and that is before I install gigabit Ethernet in it. Already you can see the snapshot directories under .zfs in each of the samba file systems on the PC which is just about as cool as it can get.

Finally I have solaris serving dhcp and have turned off the server on the Qube. Most uncharacteristically I used the GUI tool to configure dhcp and apart from having to create another ZFS file system to put the data in the GUI did it all. Very slick. Plus by some magic it managed to hand out the same IP addresses as the Qube used to to each computer. I suspect I should have done DNS before DHCP since the DHCP server can update the DNS records so this may have to be done again.


Thursday Aug 31, 2006

New home server arrived

The new server arrived and like a good geek I stayed up late last night putting it together and loading the Solaris Operating System on it. So far I've not got that far. A base install with mirrored root file system, plus a second boot environment for live upgrade and the rest of the disk(s) are there for real data on ZFS.

Laying out the disks was harder than it should have been due to me wanting to put all the non ZFS bits at the end of the disk not the beginning so that when we have a complete ZFS on root solution I can delete the two root mirrors and allow ZFS to grow over the whole drive. So on the disks the vtoc looks like this:

Part      Tag    Flag     Cylinders         Size            Blocks
  0 unassigned    wm   37368 - 38642        9.77GB    (1275/0/0)   20482875
  1 unassigned    wu       0                0         (0/0/0)             0
  2     backup    wm       0 - 38909      298.07GB    (38910/0/0) 625089150
  3 unassigned    wu       0                0         (0/0/0)             0
  4       root    wu   36093 - 37367        9.77GB    (1275/0/0)   20482875
  5 unassigned    wu   38643 - 38645       23.53MB    (3/0/0)         48195
  6 unassigned    wu   38646 - 38909        2.02GB    (264/0/0)     4241160
  7       home    wm       3 - 36092      276.46GB    (36090/0/0) 579785850
  8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
  9 alternates    wu       1 -     2       15.69MB    (2/0/0)         32130

Which looks even worse than it really is. On the disk starting at the lowest LBA I have:

  1. The boot blocks on slice 8

  2. Then alternates on slice 9 (format just gives you these for “free”)

  3. The Zpool on slice 7

  4. The second boot environment on slice 4

  5. The first boot environment on slice 0

  6. The metadbs on slice 5

  7. A dump device on slice 6

All the partitions, except the dump device are mirrored onto the other disk so both drives have the same vtoc. As you can see I can grow the zpool over both boot blocks and the metadbs when ZFS on root is completely here.

The next thing to do will be SAMBA.


Saturday Aug 26, 2006

New home server ordered

Finally a replacement for my Qube has been ordered. Based around this bare bones system with 2G of RAM and a pair of 300G disks.

Since one of the Qubes successfully lost all of it's data due to ext2 file system mayhem this week, thankfully I have backups, I can't wait to get to be using better filesystems and mirroring software. That Qube is now not even booting so rather than mess with the recovery CD time to move into the 21st century.

The goals for the system are to:

  1. Provide Email Service

  2. Act as a Windows File server

  3. DHCP server

  4. DNS server

  5. Small Web service

  6. Single user Sun Ray server

Moving email and data over will be quick as I am naturally nervous about the data on the Qube. The other services will be staged more slowly.


Friday Jul 21, 2006

Why not use a raid controller to do mirroring or other kind of RAID underneath a zpool?

I got asked this today.

The ZFS manual says it is not recommended:

ZFS works best when given whole physical disks. Although constructing logical devices using a volume manager, such as Solaris Volume Manager (SVM), Veritas Volume Manager (VxVM), or a hardware volume manager (LUNs or hardware RAID) is possible, these configurations are not recommended. While ZFS functions properly on such devices, less-than-optimal performance might be the result.

Disks are identified both by their path and by their device ID, if available. This method allows devices to be reconfigured on a system without having to update any ZFS state. If a disk is switched between controller 1 and controller 2, ZFS uses the device ID to detect that the disk has moved and should now be accessed using controller 2. The device ID is unique to the drive's firmware. While unlikely, some firmware updates have been known to change device IDs. If this situation happens, ZFS can still access the device by path and update the stored device ID automatically. If you inadvertently change both the path and the ID of the device, then export and re-import the pool in order to use it.

So why not use that raid controller?

One reason is that you are preventing ZFS from recovering from a disk that returns the wrong or bad data. Since ZFS would be presented with a single device if it detects a checksum error it has no way to ask the raid controller to read the other side of the mirror or to recalculate Xor from a RAID 5 array. If on the other hand you let ZFS have direct control of the disks if one supplies bad data for any reason ZFS can read from the redundant copy.

I'm left wondering about controllers with write cache where the performance would be a factor that however will have to wait.


Thursday May 25, 2006

Living with ZFS

A few things are coming together as I live with ZFS on my laptops and the external USB drive. The first and most surprising is that I don't like interacting with the disks on the Qube 3 anymore. No snapshot to roll back to if I make a mistake? I honestly was not expecting that I would get so used to the safety of snapping everything all the time.

The external drive, after much investigation of some issues, is working flawlessly. It claims that it's write cache is off and behaves as such. As I have mentioned before it is partitioned with an EFI label into 3 partitions. Partitions 0 and 1 act as mirrors for my two laptops I online them when I can and let it resilver. The third partition zpool containing file systems that contain backups of everything else, including the data on the Qube 3.

Things I have learnt about using ZFS:

  1. Don't put your file systems in the “root” of a pool. Administratively it makes life easier if you group them. So all my users home directories are in “home/users” and not directly in “home”. It allows you to set properties that are inherited on “home/users” and have them effect all the users home directories. This is not dissimilar to the old advice of never sharing the root of a file system via NFS. If you did it severely limits your options to change things in the future. The NFS advice is of course no longer valid with ZFS since there is no longer the one to one connection between file systems and volumes.

  2. If you are taking lots of snapshots make sure you either have lots of disk space or have a way to clear the old snapshots. My snapping every minute does this where as the snapping on boot did not.

  3. If you even suspect that your pool may be imported on another system use a unique name. I'm not even sure what happens if you have two pools with the same name but just in case it is bad make your pools unique.

Things I would like:

  1. Ability to delegate creation of file systems and snapshots to users. Rbac lets me create do this but leaves some nasties that require scripts. One is that the user can not create any files in the file system with out runing chmod or chown on it. Essentially the same as this request on the OpenSolaris ZFS forum.

  2. Boot of ZFS and live upgrade to be fully integrated. I know it is coming but once you have your root file system on ZFS is seems a great shame to have to keep two UFS boot environments around for upgrade purposes. Especially as they are not even compressed (another thing I have just got used to).


Friday Apr 21, 2006

External usb disk drive

If I had not got caught out again by the M2 hanging during boot this would have been a breeze. Instead I was to struggle off and on for hours before finding the answer on my own blog. Doh.

Anyway I have bought a 160G USB 2.0 disk drive so that I can backup my laptops and have some extra space for things that it would be nice to keep but not on the cramped internal drives. Looks like a nice bit of kit in it's fanless enclosure.

Plugged the drive in and pointed zpool at the device as seen by volume manager and I now have a pool that lives on this disk with lots of file systems on it. I can see a need for a script to run zfs backup on each local file system redirected to a file system on the external box.

1846 # zfs list -r removable
removable             20.7G   131G  12.5K  /removable
removable/bike        4.88G   131G  4.88G  /removable/bike
removable/nv           137M   131G   137M  /removable/nv
removable/principia   9.50K   131G  9.50K  /removable/principia
removable/scratch        9K   131G     9K  /removable/scratch
removable/sigma       5.14G   131G  9.50K  /removable/sigma
removable/sigma/home  5.14G   131G  9.50K  /removable/sigma/home
removable/sigma/home/cjg  5.14G   131G   597M  /removable/sigma/home/cjg
removable/sigma/home/cjg/pics  4.55G   131G  4.55G  /removable/sigma/home/cjg/pics
removable/sigma_backup   556M   141G   556M  -
removable/users        586M   131G  9.50K  /removable/users
removable/users/cjg    586M   131G   586M  /removable/users/cjg
1847 #

Exporting the pool and then reimporting it on the other laptop all works as expected which is good and I hope is going to allow me to do the live upgrade from the OS image on that drive so it does not have to get slurped over the internet twice.

I did over achieve and manage to crash one system as plan A was to have a zvol for each laptop on the disk and use that as a backup mirror for the internal drive which could be offlined when not in use. Alas this just hung and appears to be known issue with putting zpools inside zpools. However the talk that USB storage and zfs don't work together does not appear to be close to the truth.


Tuesday Apr 18, 2006

ZFS root file system

As Tabriz has pointed out you can now do “boot and switch” to get yourself a ZFS root file system so to give this a bit of a work out I flipped out build system to use it. It has a compressed root file system now.

: FSS 1 $; df -h /
Filesystem             size   used  avail capacity  Mounted on
tank/rootfs             19G   2.9G    16G    16%    /
: FSS 2 $;

The instructions Tabriz gives are slightly different if you are using live upgrade to keep those old UFS boot environments in sync. For a start if you use a build 37 BE that is not currently the one you are booted off for the source of your new zfs root file system then you don't have to do all the steps creating mount points and /devices.

So steps 6, 7 and 9 distil to:

  • lumount a build 37 archive on /a:

    lumount -n b37 /a

  • Copy that archive into /zfsroot;

    # cd /a
    # find . -xdev -depth -print | cpio -pvdm /zfsroot

You do have to take greater care when updating the boot archive as that may not live on the currently booted boot environment but apart from that it was a breeze. The system has been up for almost a week and I have a clone of a snapshot that is also bootable just in case I mess up the original. Doing that was as simple as taking the clone and editing /etc/vfstab and /etc/system in it to reflect it's new name. Then building it's boot archive.

# zfs list
tank                  2.90G  15.8G     9K  /tank
tank/rootfs           2.90G  15.8G  2.89G  legacy
tank/rootfs@works     2.38M      -  2.77G  -
tank/rootfs@daytwo    1.71M      -  2.88G  -
tank/rootfs@daythree  1.89M      -  2.88G  -
tank/rootfs@dayfour    576K      -  2.89G  -
tank/rootfs2            51K  15.8G  2.88G  legacy
tank/scratch          98.5K  15.8G  98.5K  /tank/scratch
tank/scratch@x            0      -  98.5K  -

Whilst many the features that get released by a ZFS root file system are easy to predict the beauty of it in action is something else.


Tuesday Mar 07, 2006

When you have lost your data.....

Today was not the first time, and won't be the last time that someone contacted me after loosing a load of data. Previous incidents have included when my partner was writing a letter on my Ultra1 at home and my daughter decided that the system would run much better without power.

On rebooting the file was zero length. This letter had taken days to craft so simply starting again was really low on the list of priorities. The file system on which it had lived was UFS and I could access the system over the network (I was on a business trip at the time), there were no backups...

The first thing about any situation like this is not to panic. You need to get the partition that used to contain the data quiesed so that no further damage is done. Unmount it if you can. Then take a backup of it using dd:

dd if=/dev/rdsk/c0t0d0s4 of=copy_of_slice bs=128k

Now you use the copy_of _slice to go looking for the blocks. If the file is a text file then you can use strings and grep to search for blocks that may contain your data. Specifically:

strings -a -t d < copy_of_slice | grep “text in document” 

This outputs the strings that contain “text in document” and their byte offsets you use these offsets to read the blocks.

73152472 moz-abmdbdirectory://abook.mab
136142157 fc9roaming.default.files.abook.mab
136151743 7moz-abmdbdirectory://abook.mab
136151779 !6fb-moz-abmdbdirectory://abook.mab

I use a shell function like this for a file system with an 8K block size:

function readblock
        dd bs=8k count=1 iseek=$(($1/ 8192)) < slice7

Since the file in this case was called slice7 to get the blocks.

$ readblock 73152472

then you have to use your imagination and skill to put the blocks back together. In my case the letter was recovered, sent and had the disired outcome.

Todays example is not looking so good. Firstly the victim had actually run suninstall over the drive and had no backup (stop giggling at the back) which had relabled the drive and then run newfs on the partition. Then when the dd was run the output file was wirtten onto the same disk so if the label did not match more damage was done. I might suggest that he run over the drive and then throw it into the pond just to make live interesting. It's a pity as since only the super blocks would have been written the chances of recovery where not that bad.

So to recap. Don't get in this situation. Backup everything. Use ZFS, use snapshots, lots of them.

However if you have lost your data and want to stand any chance of getting it back:

  1. Don't Panic.

  2. Quiese the file system. Powering off the system may well be your best option.

  3. Get a bit for bit copy of the disk that had the data. All slices. Do this while booted of release media.

  4. Hope you are lucky.


Thursday Dec 15, 2005

Letting users create ZFS file systems

Darren has just posted his fast bringover script that solves some of my desire to be able to have a file system per workspace. I'm not commenting on the script since it manages to trip one of my shell script peeves that of calling a program and then calling exit $?. What is wrong with exec? I'll keep taking the tablets.

However it does not solve my wanting to be able to let users be able to create their own ZFS file systems below a file system that they own.

Like I said in the email this can mostly be done via an RBAC script, well here it is:

#!/bin/ksh -p


if [ "$_" != "/usr/bin/pfexec" -a -x /usr/bin/pfexec ]; then
        exec /usr/bin/pfexec $0 $@

function get_owner
	echo $(ls -dln ${PARENT} | nawk '{ print $3 }')

function create_file_system
	typeset mpt name

	zfs list -H -t filesystem -o mountpoint,name,quota | \\
		 while read mpt name quota
		if [[ $mpt == $PARENT ]]
			zfs create ${DIR#/} && chown $uid $DIR && \\
				zfs set quota=${quota} ${DIR#/}
			exit $?
	echo no zfs file system $PARENT >&2
	exit 1

function check_quota
	typeset -i count
	typeset mpt name

	zfs list -H -t filesystem -o mountpoint,name | while read mpt name
		if [[ $(get_owner $name) == $uid ]]
			let count=count+1
	echo $count


test -f /etc/default/zfs_user_create && . /etc/default/zfs_user_create

if [[ $# -ne 1 ]]
	echo "Usage: $1 filesystem" >&2
	exit 1


if ! [[ -d $PARENT ]]
	echo "$0: Failed to make directory \\"$1\\"; No such file or directory" >&2
	exit 1

uid=$(id | sed -e s/uid=// -e 's/(.\*//')
owner=$(get_owner $1)

if [[ $uid != $owner ]]
	echo "$0: $1 not owner" >&2
	exit 1

if [[ $(check_quota) -gt ${MAX_FILE_SYSTEMS_PER_USER} ]]
	echo "too many file systems"
	exit 1


It has a hack in it to limit the number of file systems that a user can create just to stop them being silly. Then you just need the line in /etc/security/exec_attr:


Now any user can create a file system under a file system they already own. The file systems don't share a single quota which would be nice but for my purposes this will do.

Next trick to let them destroy them and take snapshots of them. The snapshots being the real reason I want all of this.


Thursday Dec 01, 2005

ZFS snapshots meet automounter and have a happy time together

There is a thread going over on zfs-interest where the following question was posed by biker (I live in hope that this is really “cyclist”):

How can I make a snapshot of home in zfs containing the data including the stuff within the user homes (home/ann, home/bob) - like a recursive snapshot.

The only way so far I could think of was
 - copy the directory structure (home/ann, home/bob) to /snap
 - initiate a snapshot of every dataset (home/ann, home/bob)
 - mount each snapshot to the counterpart under /snap
 - run the backup
 - remove the mounts
 - release the snapshots
 - clear /snap

If there is something like a recursive snapshot or user and group quota in the classical sense, the efford needed could be minimized, ...

It got me thinking that in the absence of a real solution this should be doable with a script. For the recursive backup script we have:

#!/bin/ksh -p

for filesystem in $(zfs list -H -o name -t filesystem)
        zfs snapshot $filesystem@$1

No prizes there but what biker wanted was a copy of the file system structure. The problem is that all those snapshots are each under the individual file systems .zfs/snapshot directory so are spread about.

If only we could mount all of them under one tree? By adding this line to /etc/auto_master:

/backup /root/auto_snapshot

and then this script as /root/auto_snapshot:

#!/bin/ksh -p


for filesystem in $(zfs list -H -o name -t filesystem)
if [[ -d /$filesystem/.zfs/snapshot/$1 ]]
        if [[ ${fs} = ${filesystem} ]]
        ANS="${ANS:-}${ANS:+ }/$fs localhost:/$filesystem/.zfs/snapshot/$1"
echo $ANS

Suddenly I can do this:

1071 # ./backup fullbackup
1072 # (cd /backup/fullbackup ; tar cf /mypool/root/backup.tar . )
1073 # df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0        7.3G   6.7G   574M    93%    /
/devices                 0K     0K     0K     0%    /devices
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                   829M   704K   829M     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
                       7.3G   6.7G   574M    93%    /lib/
fd                       0K     0K     0K     0%    /dev/fd
swap                   829M   136K   829M     1%    /tmp
swap                   829M    44K   829M     1%    /var/run
mypool                 9.5G   5.8G   2.9G    67%    /mypool
mypool/jfg             9.5G     8K   2.9G     1%    /mypool/jfg
mypool/keg             9.5G    16M   2.9G     1%    /mypool/keg
mypool/rlg             9.5G     8K   2.9G     1%    /mypool/rlg
mypool/stc             9.5G    14M   2.9G     1%    /mypool/stc
/mypool/cg13442        8.8G   5.8G   2.9G    67%    /home/cg13442
                       8.8G   3.9G   4.8G    45%    /backup/fullbackup
                       4.8G     8K   4.8G     1%    /backup/fullbackup/rlg
                       4.8G     8K   4.8G     1%    /backup/fullbackup/jfg
                       4.8G    14M   4.8G     1%    /backup/fullbackup/stc
1074 #

The tar backup file now contains the whole of the “fullback snapshot” and apart from the snapshot not really being atomic, since each file system is snapped in sequence this pretty much does what is wanted.

If you were really brave/foolish you could have the automounter executable maps generate the snapshots for you but that would be a receipe for filling the pool with snapshots. Deleting the snapshots is also a snip:

#!/bin/ksh -p

for filesystem in $(zfs list -H -o name -t filesystem)
        zfs destroy $filesystem@$1



This is the old blog of Chris Gerhard. It has mostly moved to


« April 2014