Tuesday May 20, 2008

Recoving OpenSolaris root and grub

Twenty years ago I fat fingered my workstation. Then it was using "newfs -n" rather than "newfs -N", shortly after the system paniced unable to find i-node 2. I had to restore. Restoring a diskless 3/50 running SunOS 3.5 was only less painful because I used to dump, when dump was dump not ufsdump, the network disk partitions of all the clients every night. Sorry I digress but it is still burned on my memory.

Today I fat fingered my laptop. I managed to type a script, at the command prompt, that ran with root priveleges and ended up doing "cd ../.. ; rm -r \*". That took it to /. I stopped it but the damage was already done. However since I am running OpenSolaris and had taken the precaution of taking a recursive snapshot of the pool after I last installed any software all I had to do was roll back the filesystems.

Specifically I had to roll back:

  • rpool - It contains the grub menu or at least it does once I recovered it.

  • rpool/ROOT - I'm not sure I had to roll this one back but it seemed wise.

  • rpool/ROOT/opensolaris - yes the real root file system. The lack of a kernel was certain to cause problems.

  • rpool/ROOT/opensolaris/opt - again I'm not sure I had to roll this one back but it seems best to keep the root file systems consistent.

I need to bring those auto snapshot scripts over.

I'll get to describe what I was doing later. It was very cool, well quite cool.

Thursday May 08, 2008

ZFS at Aberystwyth

I just arrived back from Aberystwyth, where at the invitation of the Campus Ambassador I was giving a talk on ZFS to the students and staff.

The thing about giving ZFS presentations and demonstrations is that they are really good fun. You don't have to overplay the point with gimics however fun they are. Or at least you don't if your audience is technical. For a start having disks fail gracefully (even if this is due to a hammer) is not new. The Solaris volume manager could cope with that. O.k it did not have double parity raid but again that is not the really cool part of ZFS. That is just cool.

The coolest part of ZFS is the ability to recover from silent data corruption. The dd over one side of a mirror demo or one disk in a RAIDZ or even 2 disks in RAIDZ2 and have ZFS survive and recover. That is really cool.

The other things that caused a noticeable drawing of breath from the audience were rollback and data sharing via snapshot and clone.

Faced with those things it is hard to underestimate the power of having those snapshots always available under the .zfs directory and the beautiful simplicity of the administration.

For an added bonus I ran the whole presentation and demo while booted from the OpenSolaris live cd. Demonstrating that if you want to play with this stuff you don't actually need to install Solaris. The only problem I had was that I could not get the nvidia driver to accept that the projector was capable of more than 640x520.

I really enjoyed the session and from the comments afterwards so did many others.

Thursday Apr 03, 2008

Work IT catching up with home IT

At long last my home directory in the Office has caught up with my home directory at home and the one on my laptop and now lives on ZFS. Even better the admins have delegated snapshot privileges for my home directory to me. So now I have a scrip that snapshots my home directory every time I insert my smart card:

#!/bin/ksh -p

now=$(date +%F-%T)

exec mkdir $HOME/.zfs/snapshot/user_snap_$now

This is then called using utaction:

utaction -c ~/bin/sh/snap 

Which is in turn started automatically via the session magic that gnome does (Preferences->Sessions->Start Up Programs).


You will notice that I use mkdir to create the snapshot this is great as it allows me to run the script on an NFS client but does prevent me from doing a recursive snapshot which if I had other file systems I would like.

Update. I just realised that my nautilus script is now useful at work. Cool.

Friday Dec 21, 2007

Recovering a zpool from it's iscsi mirror

While I have stopped using iscsi to back up my laptops as zfs send and receive have proved to be more reliable and convenient my Dell laptop broke when I was still using this. So the only back up of the bits on the disk I have was living on my server in a zdev. Also since the data is now so old I don't really need to restore it at all. However it would be rude not to try. So first following the instructions from my original post I have the iscsi target available:

: sigma TS 2 $; pfexec zpool import
  pool: home
    id: 9959281504147327308
 state: DEGRADED
status: One or more devices contains corrupted data.
action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: http://www.sun.com/msg/ZFS-8000-4J
config:

        home                                       DEGRADED
          mirror                                   DEGRADED
            c0d0s7                                 FAULTED  corrupted data
            c4t0100001731F649B400002A0046264D39d0  ONLINE
: sigma TS 3 $; pfexec zpool import home  
cannot mount '/export/home': directory is not empty
: sigma TS 4 $;  bin/sh/zfs_send-r home/users export/users  
zfs send home/users@day_13 |zfs receive export/users
zfs send -i home/users@day_13 home/users@day_16 |zfs receive export/users
zfs send -i home/users@day_16 home/users@day_21 |zfs receive export/users
zfs send -i home/users@day_21 home/users@day_28 |zfs receive export/users
zfs send -i home/users@day_28 home/users@day_03 |zfs receive export/users
zfs send -i home/users@day_03 home/users@day_10 |zfs receive export/users
zfs send -i home/users@day_10 home/users@hour_07 |zfs receive export/users
zfs send -i home/users@hour_07 home/users@hour_01 |zfs receive export/users
zfs send -i home/users@hour_01 home/users@day_25 |zfs receive export/users
zfs send -i home/users@day_25 home/users@hour_02 |zfs receive export/users
zfs send -i home/users@hour_02 home/users@hour_03 |zfs receive export/users
zfs send -i home/users@hour_03 home/users@hour_04 |zfs receive export/users
zfs send -i home/users@hour_04 home/users@hour_05 |zfs receive export/users
zfs send -i home/users@hour_05 home/users@hour_06 |zfs receive export/users
zfs send -i home/users@hour_06 home/users@hour_10 |zfs receive export/users
zfs send -i home/users@hour_10 home/users@hour_23 |zfs receive export/users
zfs send -i home/users@hour_23 home/users@hour_08 |zfs receive export/users
zfs send -i home/users@hour_08 home/users@hour_16 |zfs receive export/users
zfs send -i home/users@hour_16 home/users@hour_09 |zfs receive export/users
zfs send -i home/users@hour_09 home/users@hour_22 |zfs receive export/users
zfs send -i home/users@hour_22 home/users@hour_17 |zfs receive export/users
zfs send -i home/users@hour_17 home/users@hour_15 |zfs receive export/users

Nice to know it worked and I still have all the snapshots going back until the dawn of time, well 2005 anyway.

Wednesday Oct 10, 2007

They was robbed. (ZFS under the hood)

I managed to get along to hear Jason Banham and Jarod Nash doing their “ZFS under the hood bonnet presentation as the last break out before the CEC party. This was as expected a “deep dive” not for the faint hearted but they covered the material so clearly and concisely I can say this was the best break out at a CEC I have ever attended. A must for any aspiring kernel engineer.

I understand why they did not get the prize for the best presentation, some people even left before the end, but to those who already have an idea as to how a file system works and are really interested in the internals of ZFS this is a brilliant talk.

One thing to be careful of though. They gave you a chocolate if you asked a question so sit a the front, that why if anyone around you asks a question you are less likely to be hit by the chocolate as Jarod would tend to through them across the room.

Thursday Sep 06, 2007

nautilus meets zfs snapshots

After ZFS saved the day earlier in the week I wanted to get to the stage where the email to me was not required, at least if the user is on the a Solaris system.

So I've updated my zfs_versions script which you will recall prints out all the versions of a file that exist on a zfs file system. The new script has an additional flag so that it can better support a nautilus script that allows you to highlight a file, then it will list all the versions of that file, not all the snapshots, but all the distinct versions of the file.




Choose the show_versions script. In a fit of recursion you can see the example is for the show_versions script




Finally select the version you want:




It will then open a nautilus window in the directory that contains the version of the file. What is more it even works over NFS.

You need 2 scripts:

  1. zfs_versions, this must be in your path without the .tcl postfix.

  2. show_versions. This has to be stored in the .gnome2/nautilus-scripts directory in your home directory.

Make both scripts mode 755 and then run “nautilus -q”.

Things to note.

It considers the file to be the path to the file. So if you rename a file you will not see revisions with the old name. It would be really cool if there was a way to get all the versions of a file without resorting to doing an exhaustive search of the file system, but there is not.

I don't know how to make the available to all users without messing in each ~/.gnome2 directory. If you do then add a comment.

Hat tip to Sandip for alerting me to nautilus scripts.

Monday Sep 03, 2007

ZFS saves the day

I just got this email1 from a user of my home server:
I have need of your rinky dinky back up system.
There is a file in My_Documents\\Correspondence called "XXXXXX" which I have overwritten somehow in error. Can you get the orginal back? I created it last week (Monday I think) and have not updated it since, except to overwrite it today!!

Well it turns out I can:


# /home/cjg/bin/sh/zfs_versions 'XXXXXX.odt' 
/home/user/.zfs/snapshot/day_2007-08-27-01:01/My_Documents/Correspondence/XXXXXX.odt
/home/user/.zfs/snapshot/hour_2007-09-03-12:00/My_Documents/Correspondence/XXXXXX.odt
/home/user/.zfs/snapshot/minute_2007-09-03-12:30/My_Documents/Correspondence/XXXXXX.odt
/home/user/.zfs/snapshot/minute_2007-09-03-12:50/My_Documents/Correspondence/XXXXXX.odt
/home/user/.zfs/snapshot/hour_2007-09-03-13:00/My_Documents/Correspondence/XXXXXX.odt
# 

Choose your version.


1I've anonymized the email.

Wednesday Jun 13, 2007

Home server back to build 65

My home server is taking a bit of a battering of late. I keep tripping over bug 6566921 which I can work around by not running my zfs_backup script locally. I have an updated version which will send the snapshots over an ssh pipe to a remote system which in my case is my laptop. Obviously this just moves the panic from my server to the laptop but that is a very much better state of affairs. I'm currently building a fixed zfs module which I will test later.

However the final straw that has had me revert to build 65 is that smbd keeps core dumping. Having no reliable access to their data caused the family more distress than you would expect. This turns out to be bug 6563383 which should be fixed in build 67.

Thursday Jun 07, 2007

zfs_versions of a file

This post reminds me that I should have posted my zfs_versions script a while back. I did not as it had a theoretical bug where if you could move a new file into place which had the same age as the old file it would not see this as a new version. I've fixed that now at the expense of some performance and the script is called zfs_versions.

The script lists all the different versions of a file that live in the snapshots.

Here is the output:

: pearson FSS 124 $; /home/cjg/bin/sh/zfs_versions  ~/.profile 
/tank/fs/users/cjg/.zfs/snapshot/month_09/.profile
/tank/fs/users/cjg/.zfs/snapshot/smb2007-03-27-15:54/.profile
/tank/fs/users/cjg/.zfs/snapshot/minute_2007-06-06-17:30/.profile
: pearson FSS 125 $; 

Compare this to the number of snapshots:


: pearson FSS 128 $; ls -1 ~/.zfs/snapshot/\*/.profile | wc -l
     705
: pearson FSS 129 $; 

So I have 705 snapshots that contain my .profile file but actually only three of them contain different versions.

Update

The addition of the check to fix the theoretical bug slows down the script enough that the programmer in me could not let it lie. Hence I now have the same thing in TCL.

Update 2

See http://blogs.sun.com/chrisg/entry/new_wish_and_tickle

Friday Jun 01, 2007

Rolling incremental backups

Why do you take back ups?

  • User error

  • Hardware failure

  • Disaster Recovery

  • Admin Error

With ZFS using redundant storage and plenty of snapshots my data should be safe from the first two. However that still leaves two ways all my data could be lost if I don't take some sort of back up.

Given the constraints I have my solution is to use my external USB disk containing a standalone zpool and then use zfs send and receive via this script to send all the file systems I really care about to the external drive.

To make this easier I have put all the filesystems into another “container” filesystem which has the “nomount” option set so it is hidden from the users. I can then recursively send that file system to the external drive. Also to stop the users getting confused by extra file systems appearing and disappearing I have set the mount point on the external disk to “none”.

The script only uses the snapshots that are prefixed “day” (you can change that with the -p (prefix) option) so that it reduces the amount of work that the script does. Backing up the snapshots that happen every 10 minutes on this system does not seem worth while for a job I will run once a day or so.

The really cool part of this is that once I had the initial snapshot on the external drive every back up from now on will be incremental. A rolling incremental backup. How cool is that.

# time ./zfs_backup tank/fs/safe removable/safe      



real    12m10.49s

user    0m11.87s

sys     0m12.32s

# 
 zfs list  tank/fs/safe removable/safe      

NAME             USED  AVAIL  REFER  MOUNTPOINT

removable/safe  78.6G  66.0G    18K  none

tank/fs/safe    81.8G  49.7G    18K  /tank/fs

# 

The performance is slightly disappointing due to the number of transport errors that are reported by the usb2scsa layer but the data is really on the disk so I am happy.

Currently I don't have the script clearing out the old snapshots but will get that going later. The idea of doing this over ssh to a remote site is compelling when I can find a suitable remote site.

Thursday May 03, 2007

How many disk should a home server have (I'm sure that was a song).

My previous post failed to answer all the questions.

Specifically how many disks should a home server contain?

Now I will gloss over the obvious answer of zero, all your data should be on the net managed by a professional organisation, not least as I would not trust someone else with my photos however good they claim to be. Also any self respecting geek will have a server at home with storage, which at the moment means spinning rust.

Clearly you need more than one disk for redundancy and you have already worked out that the only sensible choice for a file system is ZFS, you really don't want to loose your data to corruption. It is also reasonable to assume that this system will have 6 drives or less. At the time of writing you can get a Seagate 750Gb SATA drive for £151.56 including VAT or a 320Gb for £43.99.

Here is the table showing the number of disks that can fail before you suffer data loss:

Number of disks

Mirror

Mirror with one hot spare

RaidZ

RaidZ with hot spare

Raidz2

Raidz2 with one hot spare

2

1

N/A

N/A

N/A

N/A

N/A

3

NA

2\*

1

N/A

N/A

N/A

4

1\*\*

N/A

1

2\*

2

N/A

5

NA

2\*

1

2\*

2

3

6

1\*\*

N/A

1

2\*

2

3

\* To not suffer data loss the second drive to fail has to not fail while the hot spare is being re silvered.

\*\* This is a worst case of both disks that form the mirror failing. It is possible that you could loose more than one drive and maintain the data.

Richard has some more numbers about mean time before data loss and the performance of various configurations from a more commercial point of view, including a 3 way mirror.

Now lets look at how much storage you get:

Number of disks of size X Gb

Mirror

Mirror with one hot spare

RaidZ

RaidZ with hot spare

Raidz2

Raidz2 with one hot spare

2

X

N/A

N/A

N/A

N/A

N/A

3

N/A

X

2X

N/A

N/A

N/A

4

2X

N/A

3X

2X

2X

N/A

5

N/A

2X

4X

3X

3X

2X

6

3X

N/A

5X

4X

4X

3X

The power consumption will be pretty much proportional to the number of drives, as will the noise and cost of purchase. For the Seagate drives I looked at the power consumption of the disks was identical for the 300Gb and 750Gb drives.

Since my data set would easily fit in a 320Gb disk (at the time of purchase) and that was the most economic at that point I chose the 2 way mirror. Also raidz2 was not available.

If I needed the space offered by 2X or more disks I would choose the RaidZ2 as that gives the best redundancy.

So the answer to the question is “it depends” but I hope the above will help you to understand your choices.

Tags:

Tuesday May 01, 2007

Chosing a server for home

I get asked often enough how I chose my home system, why I chose a single CPU system with just 2 disks that I'm going to put down the thought process here.

My priorities were in order:

  1. Data integrity.

    The data is important. My photographs, the kids home work and correspondence the family members have both in the form of letters and emails.

  2. Total cost of ownership.

    I have a realistic expectation that I will write the system off over 5 or more years, so the cost of the power it uses is important.

  3. Quantity of disk space.

    I needed at least 60G to move off the Qubes so add in a bit of a safety margin and 300G should keep us going for 5 years (unless some new technology like a helmet camera starts eating through the disk as if it cost nothing).

  4. Noise.

    It sits in a room where I work. I am used to using a Sun Ray which is silent so having it as quiet as possible is the ideal.

  5. Physical size.

    It was to replace a pair of Qube 3 which were beautifully small and sits on a window ledge. So a tower was not practical.

Storage

Given these constraints there was just one choice for file system, ZFS, so the system just had to support at least 2 disks so that they could be mirrored. The cost of running more than two disks combined with the fact that 300G drives are affordable and that the form factor of the box made having more disks less appealing even though that would have been faster. More spindles == more performance, mostly.

I'm not regretting this decision despite the fact that the system can be extremely unresponsive when doing a live upgrade. Since both drives contain the mirrors of both boot environments when running lumake to copy one boot environment to the other the disk performance is terrible as the heads seek back and forth, the same is true during the install as I have the DVD image on the same disks. Putting the image on the external USB drive does help, but the problem is not really bad enough that I bother. Having 4 disks would have mitigated this. I'm hopeful that when we get native queuing support in the sata driver this will improve slightly. Having root on ZFS should eventually eliminate the lumake copy step as that will be a clone.

Mother Board

The choice of mother board was defined by a desire to be able to support 4G of RAM so that if the system turned into a Sun Ray server, which it has, I have the option of a reasonable amount of memory to support two or three users and support for SATA disks since the price performance of those drives fitted my storage requirements. Obviously it had to be able to boot from these drives.

There was no need for good graphics but 2 network ports had to be available so if it came with one that would be idea. Gigabit networking would also help. I put all of those variables in and one of my blogless colleagues suggested the ASUS M2NPV-VM which was built around chipsets that the release of OpenSolaris that was current at the time should support. The only exception was the Nvidia graphics driver which at that time was not available. However since I did not need graphics that was not an issue. It had an on board 1GigaBit network so even with the addition of a second network card there are sill free slots if I need them.

CPU

The choice of CPU was based on cost and the knowledge that the Casper's powernow driver does not support multiple CPUs or Muilti core CPUs. For reasons of getting under the budget I had I chose the AMD Athlon(tm) 64 Processor 3500+ Socket AM2 which will run at 17.8 Watts when running at 1Ghz.

I know of people who are successfully using the following CPUS with power now on this mother board, however this does not constitute any kind of guarantee:

CPU

Earliest BIOS Version known to work

AMD Athlon 64 3800+

0 705 01/02/2007 (My colleague thinks it first started working on 0303 firmware, but is not certain.)

AMD Athon 64 3500+

0 705 01/02/2007

AMD Athlon 64 3000+

0 705 01/02/2007

If you know of other CPUs that work with the PowerNow driver on this motherboard let me know and I will update the table.

The compnents not appearing in this Blog

The system has a DVD RW but that was chosen on the whim that the supplier supplying the CPU, motherboard and case shipped that. The thing I don't have that might surprise some is a tape drive. Since I state that the number one goal was data integrity you would think a good back up would seem to be a requirement. However I have found my external USB disk drive, combined with being able to backup ZFS snapshots to DVD (hint here set the quota on your file systems to less than the size of a DVD, make back ups easier) means that while I know rebuilding would be very hard I'm sure I have all the photographs safe. My children's homework has such a short life span that anything other than the snapshots is unlikely to help.

Tags:

Friday Apr 20, 2007

ZFS pool in an ISCSI ZVOL

My last post about backing up ZFS on my laptop to iscsi targets exported from a server backed by ZFS Zvols on the server prompted a commend and also promoted me to think about whether this would be a worth while thing in the real world?

Initially I would say no, however it does offer the tantalizing possibility to allow the administrator of the system with the iscsi targets to take back ups of the pools without interfering with the contents of the pool at all.




It allows you to split the snapshots for users, which would all live in the client pool from the snapshots for administrators, essentially for disaster recovery which would all be in the server pool.

If the server went pop the recovery would be to create a new server and then restore the zvol which would then contain the whole client pool with all the client pool's snapshots. Similarly if the client pool were to become corrupted you could roll it back to a good state by rolling back the ZVOL on the server pool. Now clearly the selling point of ZFS is an always consistent on disk format so this is less of a risk than with other file systems (unless there are bugs) however the belt and braces approach seems appealing to the latent sysadmin in me who knows that the performance of a storage system that has lost your data is zero.

I'm going to see if I can build a server like this to see how well it performs but that won't be for at least a few weeks.

Tags:

Thursday Apr 19, 2007

NFS futures at LOSUG

We were privileged to have the legendary Calum Mackay talk at the London Open Solaris User Group last night on the topic of NFS futures. Including everything that is in the up coming NFS v4.1 specifications:

  • Parallel NFS aka pNFS

  • Directory delegations

  • Sessions.

He also covered the rest of what is going on with the NFS v4.0 work. In particular the namespace work that he has been doing which will provide Mirror mount support and Referrals.

Mirror mounts will change the way a client behaves when it encounters a directory that is a mount point for another file system on the server. For example given 2 file systems:

/export/home/cjg and /export/home/cjg/Documents

Currently if you mount /export/home/cjg from a server using NFS v2 or v3 on an NFS client and look in the Documents directory you should see an empty directory which if you then write into can cause loads of confusion and potentially more serious consequences.

However with NFSv4 & mirror mounts now you see something different. The client would automatically mount the sub file systems without recourse to the automounter. This is kind of cool as the layout I describe above is exactly what I want for my home directory. That way when gnome or firefox or thunderbird go pop and corrupt their start up file I can roll back to the last snapshot without it messing up my data that is in Documents.

Referrals have the potential to be as useful and I suspect also the potential to be as dangerous as symbolic links. They allow you to move a file system onto another server and yet the client application can continue to access the original path. The client kernel gets sent a referral and mounts the file system from the new location.

All in all an excellent evening.

Tags:

Wednesday Apr 18, 2007

Backing up laptop using ZFS over iscsi to more ZFS

After the debacle of the reinstall of my laptop zpool having to be rebuild and “restored” using zfs send and zfs receive I thought I would look for a better back up method. One that did not involve being clever with partitions on an external USB disk that are “ready” for when the whole disk is using ZFS.

The obvious solution is a play on one I had played with before. Store one half of the pool on another system. So welcome to ISCI.

ZFS volumes can now be shared using iscsi. So on the server create a volume with the “shareisci” property set to “on” and enable the iscsi target:

# zfs get  shareiscsi tank/iscsi/pearson   
NAME                PROPERTY    VALUE               SOURCE
tank/iscsi/pearson  shareiscsi  on                  inherited from tank/iscsi
  
# svcadm enable  svc:/system/iscsitgt       
# 

Now on the client tell the iscsi initiator where the server is:


5223 # iscsiadm add discovery-address 192.168.1.20
5224 # iscsiadm list discovery-address            
Discovery Address: 192.168.1.20:3260
5225 # iscsiadm modify discovery --sendtargets enable
5226 # format < /dev/null
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0d0 <DEFAULT cyl 3791 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
       1. c10t0100001731F649B400002A004625F5BEd0 <SUN-SOLARIS-1-11.00GB>
          /scsi_vhci/disk@g0100001731f649b400002a004625f5be
Specify disk (enter its number): 
5227 # 

Now attach the new device to the pool. I can see some security would be a good thing here to protect my iscsi pool. More on that later.


5229 # zpool status newpool                                       
  pool: newpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: scrub completed with 0 errors on Wed Apr 18 12:30:43 2007
config:

        NAME        STATE     READ WRITE CKSUM
        newpool     ONLINE       0     0     0
          c0d0s7    ONLINE       0     0     0

errors: No known data errors
5230 # zpool attach newpool c0d0s7 c10t0100001731F649B400002A004625F5BEd0
5231 # zpool status newpool                                              
  pool: newpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 0.02% done, 8h13m to go
config:

        NAME                                        STATE     READ WRITE CKSUM
        newpool                                     ONLINE       0     0     0
          mirror                                    ONLINE       0     0     0
            c0d0s7                                  ONLINE       0     0     0
            c10t0100001731F649B400002A004625F5BEd0  ONLINE       0     0     0

errors: No known data errors
5232 # 

The 8 hours to complete the resilver turns out to be hopelessly pessimistic and is quickly reduced to a more realistic, but still overly pessimistic 37 minutes. All of this over what is only a 100Mbit ethernet connection from this host. I'm going to try this on the Dell that has a 1Gbit network to see if that improves this even further. (Since the laptop has just been upgraded to build 62 the pool “needs” to be upgraded. However since upgrading the pool would then not be able to be imported on earlier builds I won't upgrade the pool version until both boot environments are running build 62 or above.)

I am left wondering how useful this could be in the real world. As a “nasty hack” you could have your ZFS based NAS box serving out volumes to your NFS which then have Zpools in them. Then on the NAS box you can snapshot and backup the the volumes which would actually give you a back up of the whole of the client pool, something many people want for disaster recovery reasons. Which is in effect what I have here.


Tags:


Saturday Apr 07, 2007

recovering my laptop using zfs send and receive

First I owe Windows and apology. It was not making itself the active partition, grub was due to me copying the wrong entry in there after Solaris deleted it.

However before I went on holiday grub decided it would cease working. It could not find the menu.lst at all on this laptop. After a bit of investigation and failing to recover it (due I think to the laptop having two Solaris fdisk partitions, one had the zpool in it and the other the boot partitions, Installgrub and grub itself did not like this, though what pushed it over the edge I'm not sure. Perhaps some change in build 60.)

Anyway I decided to reinstall solaris without the extra fdsik partition which was a hangover from when the system had Linux on it as well. It should be simple. ufsdump the current boot environment (BE) and then install the system and let the zpool resilver from the external USB disk and ufsrestore the BE. The install would only need to be an end user install as I was going to restore from backup anyway. Strictly I did not need to do the install at all. All would have been fine had I not decided to detach the internal disk from the zpool (after scrubing the pool) prior to the install but the install would sort out grub for me without me having to do to much fiddling.

Once I had reinstalled the system I could not attach the new partition to the pool as it was to small. This was all thanks to my “thinking ahead” when I created the USB partition. Since eventually the partition will grow to be 30Gb that was how big the external disk partition was. As soon as I detached the smaller partition it “grew” to fill the partition it had. Doh.

So now I had to fit a 30Gb pool into a 10Gb partition. Next time I won't detach the mirror first! Being in a hurry, to go on holiday, I just knocked together a short script that would take a source file system and using zfs send and zfs receive copy all of it's snapshots to the target file system. So first doing a recursive snapshot of the pool I then ran the script which and copied the file systems into the new pool I created on the laptop. I then had to fix up a few of the attributes of the file systems that were copied. I'm not immediately sure how to handle this in a script since some attributes are obvious (compression, sharenfs etc) but others are less so (mountpoint). Even with attributes like compression you have a problem in that the zfs receive creates the file system with inherited attributes so there is no way to set them before the file system is populated unless the inherited file system attribute is correct. When I say no way, I mean, no simple way. There clearly are ways by using temporary container filesystems which you create with the right options for the file system to inherit and then use zfs rename to move the file system to the correct location. However that was not required by me and would cease to be a simple short script.

Tags:

Monday Feb 26, 2007

Reverting to build 55

My home server is back at build 55. After reading the heads up message about ZFS on build 58 I wasted no time to send the system back to build 55 which was the last release I had installed prior to build 58. I miss the improvements in gnome that have turned up in the later releases but data integrity trumps everything.

Hopefully I will be able to get to build 59 (which despite what the heads up says I am told should contain the fix) later this week.

Tags:

Thursday Jan 18, 2007

A linux live CD with ZFS support

Now this I must try. http://partedmagic.com now claims to have ZFS support via FUSE on Linux.

I have now tried this. I wanted to be show a screen shot but when booted of the CD there is no network at the moment. So this is a typed in script:

# /usr/sbin/zfs-fuse 

Then in another window:

# zpool import -f -d /dev -A /tmp/mnt mypool
# zfs mount -a

A few nits, like zpool aborts before running zfs mount and you have to specify /dev so it finds the disk but hey this is alpha.

Tags:

Thursday Nov 23, 2006

A faster ZFS snapshot massacre

I moved the zfs snapshot script into the office and started running it on our build system. Being a cautious type when it comes to other people's data I ran the clean up script in “do nothing” mode so I could be sure it was not cleaning snapshots that it should not. After a while running like this we had over 150,000 snapshots of 114 file systems which meant that zfs list was now taking a long time to run.

So long in fact that the clean up script was not actually making forward progress against snapshots being created every 10 minutes. So I now have a new clean up script. This is functionally identical to the old one but a lot faster. Unfortunately I have now cleaned out the snapshots so the times are not what they were, zfs list was taking 14 minutes, however the difference is still easy to see.

When run with the option to do nothing the old script:

# time /root/zfs_snap_clean > /tmp/zfsd2

real    2m23.32s
user    0m21.79s
sys     1m1.58s
#

And the new:

# time ./zfs_cleanup -n > /tmp/zfsd

real    0m7.88s
user    0m2.40s
sys     0m4.75s
#

which is a result.


As you can see the new script is mostly a nawk script and more importantly only calls the zfs command once to get all the information about the snapshots:


#!/bin/ksh -p
#
# Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
#
# CDDL HEADER START
#
# The contents of this file are subject to the terms of the
# Common Development and Distribution License, Version 1.0 only
# (the "License").  You may not use this file except in compliance
# with the License.
#
# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
# or http://www.opensolaris.org/os/licensing.
# See the License for the specific language governing permissions
# and limitations under the License.
#
# When distributing Covered Code, include this CDDL HEADER in each
# file and include the License file at usr/src/OPENSOLARIS.LICENSE.
# If applicable, add the following below this CDDL HEADER, with the
# fields enclosed by brackets "[]" replaced with your own identifying
# information: Portions Copyright [yyyy] [name of copyright owner]
#
# CDDL HEADER END
#
#	Script to clean up snapshots created by the script from this blog
#	entry:
#
#	http://blogs.sun.com/chrisg/entry/cleaning_up_zfs_snapshots
#
#	or using the command given in this entry to create snapshots when
#	users mount a file system using SAMBA:
#
#	http://blogs.sun.com/chrisg/entry/samba_meets_zfs
#
#	Chris.Gerhard@sun.com 23/11/2006
#

PATH=$PATH:$(dirname $0)

while getopts n c
do
	case $c in
	n) DO_NOTHING=1 ;;
	\\?) echo "$0 [-n] [filesystems]"
		exit 1 ;;
	esac
done
shift $(($OPTIND - 1))
if (( $# == 0))
then
	set - $(zpool list -Ho name)
fi


export NUMBER_OF_SNAPSHOTS_boot=${NUMBER_OF_SNAPSHOTS:-10}
export DAYS_TO_KEEP_boot=${DAYS_TO_KEEP:-365}

export NUMBER_OF_SNAPSHOTS_smb=${NUMBER_OF_SNAPSHOTS:-100}
export DAYS_TO_KEEP_smb=${DAYS_TO_KEEP:-14}

export NUMBER_OF_SNAPSHOTS_month=${NUMBER_OF_SNAPSHOTS:-24}
export DAYS_TO_KEEP_month=365

export NUMBER_OF_SNAPSHOTS_day=${NUMBER_OF_SNAPSHOTS:-$((28 \* 2))}
export DAYS_TO_KEEP_day=${DAYS_TO_KEEP:-28}

export NUMBER_OF_SNAPSHOTS_hour=$((7 \* 24 \* 2))
export DAYS_TO_KEEP_hour=$((7))

export NUMBER_OF_SNAPSHOTS_minute=$((100))
export DAYS_TO_KEEP_minute=$((1))


zfs get -Hrpo name,value creation $@ | sort -r -n -k 2 |\\
	nawk -v now=$(convert2secs $(date)) -v do_nothing=${DO_NOTHING:-0} '
function ttg(time)
{
	return (now - (time \* 24 \* 60 \* 60));
}
BEGIN {
	time_to_go["smb"]=ttg(ENVIRON["DAYS_TO_KEEP_smb"]);
	time_to_go["boot"]=ttg(ENVIRON["DAYS_TO_KEEP_boot"]);
	time_to_go["minute"]=ttg(ENVIRON["DAYS_TO_KEEP_minute"]);
	time_to_go["hour"]=ttg(ENVIRON["DAYS_TO_KEEP_hour"]);
	time_to_go["day"]=ttg(ENVIRON["DAYS_TO_KEEP_day"]);
	time_to_go["month"]=ttg(ENVIRON["DAYS_TO_KEEP_month"]);
	number_of_snapshots["smb"]=ENVIRON["NUMBER_OF_SNAPSHOTS_smb"];
	number_of_snapshots["boot"]=ENVIRON["NUMBER_OF_SNAPSHOTS_boot"];
	number_of_snapshots["minute"]=ENVIRON["NUMBER_OF_SNAPSHOTS_minute"];
	number_of_snapshots["hour"]=ENVIRON["NUMBER_OF_SNAPSHOTS_hour"];
	number_of_snapshots["day"]=ENVIRON["NUMBER_OF_SNAPSHOTS_day"];
	number_of_snapshots["month"]=ENVIRON["NUMBER_OF_SNAPSHOTS_month"];
} 
/.\*@.\*/ { 
	split($1, a, "@");
	split(a[2], b, "_");
	if (number_of_snapshots[b[1]] != 0 &&
		++snap_count[a[1], b[1]] > number_of_snapshots[b[1]] &&
		time_to_go[b[1]] > $2) {
		str= sprintf("zfs destroy %s\\n", $1);
		printf(str);
		if (do_nothing == 0) {
			system(str);
		}
	}
}'

Tags:

Saturday Nov 11, 2006

ZFS snapshot massacre.

As the number of snapshots grows I started wondering how much space they are really taking up on the home server. This is pretty much also shows how much data gets modified after being initially created. I would guess not much as the majority of the data on the server would be:

  1. Solaris install images. Essentially read only.

  2. Photographs.

  3. Music mostly in the form of iTunes directories.

Running this command line get the result:

zfs get -rHp used $(zpool list -H -o name ) |\\
nawk '/@/ && $2 == "used" { tot++; total_space+=$3 ;\\
        if ( $3 == 0 ) { empty++ }} \\
END { printf("%d snapshots\\n%d empty snapshots\\n%2.2f G in %d snapshots\\n", tot, \\
        empty, total_space/(1024\^3), tot - empty ) }'
68239 snapshots
63414 empty snapshots
2.13 G in 4825 snapshots'
: pearson TS 15 $; zfs get used $(zpool list -H -o name )
NAME  PROPERTY  VALUE  SOURCE
tank  used      91.2G  -
: pearson TS 16 $;

So I only have 2.13G of data saved in snapshots out of 91.2 G of data. Not really a surprising result. The biggest user of space for snapshots is one file system. The one that contains planetcycling.org. As the planet gets updated every 30 minutes and the data is only indirectly controlled by me I'm not shocked by this. I would expect the amount to stabilize over time as the system and to that end I will note the current usage:


zfs get -rHp used tank/fs/web|\\
nawk '/@/ && $2 == "used" { tot++; total_space+=$3 ;\\
        if ( $3 == 0 ) { empty++ }} \\
END { printf("%d snapshots\\n%d empty snapshots\\n%2.2f G in %d snapshots\\n", tot,
        empty, total_space/(1024\^3), tot - empty ) }'
1436 snapshots
789 empty snapshots
0.98 G in 647 snapshots

All this caused me to look a bit harder at the zfs_snapshot_clean script I have as it appeared to be keeping some really old snapshots from some of the classes that I did not expect. Now while the 68000 snapshots were having no negative impact on the running of the system it it was not right. There were two issues. First it was sorting the list of snapshots using the snapshot creation time, which was correct, but it was sorting in reverse order which was not. Secondly I was keeping a lot more of the hourly snapshots than I intended.


After fixing this and running the script (you can download it from here) there was a bit of a snapshot massacre leading to a lot less snapshots:


zfs get -rHp used $(zpool list -H -o name ) |\\
nawk '/@/ && $2 == "used" { tot++; total_space+=$3 ;\\
        if ( $3 == 0 ) { empty++ }} \\
END { printf("%d snapshots\\n%d empty snapshots\\n%2.2f G in %d snapshots\\n", tot, \\
        empty, total_space/(1024\^3), tot - empty ) }'
25512 snapshots
23445 empty snapshots
2.20 G in 2067 snapshots

Only 25000 snapshots, much better, most of them remain empty.

Tags:

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today