Tuesday May 27, 2008

Liveupgrade UFS -> ZFS

It took a bit of work but I managed to pursuade my old laptop to live upgrade to nevada build 90 with ZFS root. First I upgraded to build 90 on ufs and then created a BE on zfs. The reason for the two step approach was to reduce the risk a bit. Bear in mind this is all new in build 90 and I am not an expert on the inner workings of live upgrade. So there are no guarantees.

The upgrade failed at the last minute with this error:

ERROR: File </boot/grub/menu.lst> not found in top level dataset for BE <zfs90>
ERROR: Failed to copy file </boot/grub/menu.lst> from top level dataset to BE <zfs90>
ERROR: Unable to delete GRUB menu entry for boot environment <zfs90>.
ERROR: Cannot make file systems for boot environment <zfs90>.

This bug has already been filed (6707013 LU fail to migrate root file system from UFS to ZFS)

However lustatus said all was well so I tried to activate it:

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
ufs90                      yes      yes    yes       no     -         
zfs90                      yes      no     no        yes    -         
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
ERROR: No such file or directory: cannot stat </etc/lu/ICF.2>
ERROR: cannot use </etc/lu/ICF.2> as an icf file
ERROR: Unable to mount the boot environment <zfs90>.
#

No joy. Can I mount it?

# lumount -n zfs90
ERROR: No such file or directory: cannot open </etc/lu/ICF.2> mode <r>
ERROR: individual boot environment configuration file does not exist - the specified boot environment is not configured properly
ERROR: cannot access local configuration file for boot environment <zfs90>
ERROR: cannot determine file system configuration for boot environment <zfs90>
ERROR: No such file or directory: error unmounting <tank/ROOT/zfs90>
ERROR: cannot mount boot environment by name <zfs90>
# 

With nothing to loose I copied the ICF file for the UFS BE and edited to look like what I suspected one for a ZFS BE would look like. I got lucky as I was right!

# ls /etc/lu/ICF.1
/etc/lu/ICF.1
# cat  /etc/lu/ICF.1
ufs90:/:/dev/dsk/c0d0s7:ufs:19567170
# cp  /etc/lu/ICF.1  /etc/lu/ICF.2
# vi  /etc/lu/ICF.2
# cat /etc/lu/ICF.2

zfs90:/:tank/ROOT/zfs90:zfs:0
# lumount -n zfs90                
/.alt.zfs90
# df
/                  (/dev/dsk/c0d0s7   ): 1019832 blocks   740833 files
/devices           (/devices          ):       0 blocks        0 files
/dev               (/dev              ):       0 blocks        0 files
/system/contract   (ctfs              ):       0 blocks 2147483616 files
/proc              (proc              ):       0 blocks     9776 files
/etc/mnttab        (mnttab            ):       0 blocks        0 files
/etc/svc/volatile  (swap              ): 1099144 blocks   150523 files
/system/object     (objfs             ):       0 blocks 2147483395 files
/etc/dfs/sharetab  (sharefs           ):       0 blocks 2147483646 files
/dev/fd            (fd                ):       0 blocks        0 files
/tmp               (swap              ): 1099144 blocks   150523 files
/var/run           (swap              ): 1099144 blocks   150523 files
/tank              (tank              ):24284511 blocks 24284511 files
/tank/ROOT         (tank/ROOT         ):24284511 blocks 24284511 files
/lib/libc.so.1     (/usr/lib/libc/libc_hwcap1.so.1): 1019832 blocks   740833 files
/.alt.zfs90        (tank/ROOT/zfs90   ):24284511 blocks 24284511 files
/.alt.zfs90/var/run(swap              ): 1099144 blocks   150523 files
/.alt.zfs90/tmp    (swap              ): 1099144 blocks   150523 files
# luumount zfs90
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
diff: /.alt.tmp.b-svc.mnt/etc/lu/synclist: No such file or directory

Generating boot-sign for ABE <zfs90>
ERROR: File </etc/bootsign> not found in top level dataset for BE <zfs90>
Generating partition and slice information for ABE <zfs90>
Boot menu exists.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fufs /dev/dsk/c0d0s7 /mnt

3. Run <luactivate> utility with out any arguments from the Parent boot 
environment root slice, as shown below:

     /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

Modifying boot archive service
Activation of boot environment <zfs90> successful.
#

Fixing boot sign

#file /etc/bootsign
/etc/bootsign:  ascii text
# cat  /etc/bootsign
BE_ufs86
BE_ufs90
# vi  /etc/bootsign
# lumount -n zfs90 /a
/a
# cat /a/etc/bootsign
cat: cannot open /a/etc/bootsign: No such file or directory
# cat /a/etc/bootsign
cat: cannot open /a/etc/bootsign: No such file or directory
# cp /etc/bootsign /a/etc
# vi  /a/etc/bootsign 
# cat /a/etc/bootsign
BE_zfs90
# 
# luumount /a
# luactivate ufs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
Activating the current boot environment <ufs90> for next reboot.
The current boot environment <ufs90> has been activated for the next reboot.
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
diff: /.alt.tmp.b-hNc.mnt/etc/lu/synclist: No such file or directory

Generating boot-sign for ABE <zfs90>
Generating partition and slice information for ABE <zfs90>
Boot menu exists.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fufs /dev/dsk/c0d0s7 /mnt

3. Run <luactivate> utility with out any arguments from the Parent boot 
environment root slice, as shown below:

     /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

moModifying boot archive service
Activation of boot environment <zfs90> successful.
# lumount -n zfs90 /a
/a
# cat /a/etc/bootsign
BE_zfs90
# luumount /a
# init 6

The system now booted off the ZFS pool. Once up I just had to see if I could create a second ZFS be as a clone of the first and if so haw fast was this.

# df /
/                  (tank/ROOT/zfs90   ):23834562 blocks 23834562 files

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
ufs90                      yes      no     no        yes    -         
zfs90                      yes      yes    yes       no     -         
# time lucreate -p tank -n zfs90.2
Checking GRUB menu...
System has findroot enabled GRUB
Analyzing system configuration.
Comparing source boot environment <zfs90> file systems with the file 
system(s) you specified for the new boot environment. Determining which 
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <zfs90.2>.
Source boot environment is <zfs90>.
Creating boot environment <zfs90.2>.
Cloning file systems from boot environment <zfs90> to create boot environment <zfs90.2>.
Creating snapshot for <tank/ROOT/zfs90> on <tank/ROOT/zfs90@zfs90.2>.
Creating clone for <tank/ROOT/zfs90@zfs90.2> on <tank/ROOT/zfs90.2>.
Setting canmount=noauto for </> in zone <global> on <tank/ROOT/zfs90.2>.
No entry for BE <zfs90.2> in GRUB menu
Population of boot environment <zfs90.2> successful.
Creation of boot environment <zfs90.2> successful.

real    0m38.40s
user    0m6.89s
sys     0m11.59s
# 

38 seconds to create a BE, something that would take over and hour with UFS.

I'm not foolish brave enough to do the home server yet so that is on nv90 with UFS. When the bug is fixed I'll give it a go.

Tuesday Mar 18, 2008

When is a good idea to modify an underlying mirror?

Following on from “When to run fsck” and “When to run quotacheck” here is another:

When to modify the individual sub mirrors that make up a mirrored volume?

Answer: Never.

With the Logical volume manger in Solaris you can build a mirror from two sub mirrors:

# metastat d0
d0: Mirror
    Submirror 0: d10
      State: Okay         
    Submirror 1: d11
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 20482875 blocks (9.8 GB)

d10: Submirror of d0
    State: Okay         
    Size: 20482875 blocks (9.8 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c1d0s0          0     No            Okay   Yes 


d11: Submirror of d0
    State: Okay         
    Size: 20482875 blocks (9.8 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c5d0s0          0     No            Okay   Yes 


Device Relocation Information:
Device   Reloc  Device ID
c1d0   Yes      id1,cmdk@AST3320620AS=____________3QF09GL1
c5d0   Yes      id1,cmdk@AST3320620AS=____________3QF0A1QD
# 

So here we have the mirror “d0” made up of devices “d10” and “d11”. Each of these devices can be addressed in the file system as /dev/md/rdsk/d0 /dev/md/rdsk/d10 and /dev/md/rdsk/d11 respectively. The block devices are also available if you so desire. While being able to address the underlying disk devices that make up a mirror is interesting and potentially useful it is only useful if you really know what you are doing.

Reading from the mirrors is o.k. Writing and that includes just mounting the file system is not. So if the device is idle you can do:


# cmp /dev/md/rdsk/d10 /dev/md/rdsk/d11


#

Which if it returns 01 gives you a feeling of confidence, although if you are this paranoid, and I am, then ZFS is a much better bet.


For example if the mirror contains a file system then mounting one side of the mirror and making modifications is a really really bad idea, even if the mirror is unmounted. Once you have made such a modification you would have to make sure the other side of the mirror had exactly the same change at the block level propagated to it. Realistically the only way to achieve that is for you to detach the other mirror and then reattach it so allow it to resync. If you really know what you are doing there are tricks you could do but I suspect those that really know what they are doing would not get into this mess in the first place.



1 If it does not then you have to look at how the mirror was constructed before you start to worry. If you did “metainit d0 –m d10 d11” or have grown the metadevice then the mirrors will never have been brought into sync. So only the blocks that have been written to since the operation will correctly comapare. Hence this is nothing to worry about. See I told you you do really have to know what you are doing.

Wednesday Dec 27, 2006

When to run quotacheck?

Not quite as often as seeing someone run fsck on a live UFS file system and then regretting it but often enough someone will run quotacheck on a live file system and be surprised by the results. As usual the clue is in the manual for quotacheck:

     quotacheck expects each file system to be checked to have  a
     quota  file  named  quotas in the root directory. If none is
     present, quotacheck will not check the file system.

     quotacheck accesses the character special device  in  calcu-
     lating  the  actual disk usage for each user. Thus, the file
     systems that are checked should be  quiescent  while  quota-
     check is running.


The first paragraph implies that the file system must be mounted (and it must). The second that it is inactive.


So when can you run quotacheck?


In single user mode. Mount the file system and then run it. If you are using UFS logging you should never need to run it if you manage your users correctly. That is to say if you create a users quota before they can create any file in the file system. If you want to retrospectively add quotas then you have to drop to single user, run quota check, then boot multi user.


Once you have quotas enabled and the system is up and running the kernel will keep track of the quotas so you don't need to check them and like the fsck case if you do check them you will just introduce a corruption.


Suddenly the ZFS model of a quota for a file system and a file system per user seems like a much better way.


Tags:

Wednesday Jun 07, 2006

Update to icheck.sh

I have updated my icheck.sh script so that it now finds fragments. If you request it look for a fragment it will find all the inodes that use the block that contains that fragment.

# ~cg13442/lang/sh/icheck -d /dev/rdsk/c0t0d0s0  dd6d7 2> /tmp/err
dd6d7 is a fragment address. Rounding to block 16#dd6d0
inode 1ade4: file block: 16#0 device block: 16#dd6d7
inode 1adc8: file block: 16#0 device block: 16#dd6d0
# find / -xdev \\( -inum $((16#1ade4)) -o -inum $(( 16#1adc8 )) \\) -print
/usr/apache/htdocs/manual/mod/mod_speling.html.ja.jis
/etc/passwd
#

The script is here.

Tags:

Monday Jun 05, 2006

Mapping disk blocks to UFS file blocks.

Ever since Solaris 2.0 people have been asking for a way to map from a block on a disk back to the file which contains that block. In SunOS 4 and earlier used to have icheck(8) but that was never available in Solaris 2.0.

The answer that is usually given is short: “Use fsdb”. However fsdb is slightly less than friendly and in fact doing an exhaustive search of a file system for a particular block would be close to impossible to do by hand.

I was left thinking this must be able to be scripted. Since my particular issue was on Solaris 8 I had the added constraint that it would have to be a shell script from the choice of shells on Solaris 8.

As an example of things you can do I have written a script that will drive fsdb and can be used to:

  1. Copy files out of unmounted file systems (with the caveat that they get padded to be whole number of blocks). I used this to test the script then compare the source file and target. I have left it in more amusement.

  2. Find which inode and offset contains a particular disk block (blocks get specified in hex):

    # icheck  -d  /dev/rdsk/c0d0s6  007e1590 007dd2a0 008bb6c0
    inode 5c94: file block: 16#80b device block: 16#007e1590
    inode 5c94: file block: 16#1ffff device block: 16#007dd2a0
    inode 5c94: file block: 16#7ffffff device block: 16#008bb6c0
    #

    This search can be directly limited to a single inode using the -i option and an inode number (in hex).

  3. Print the extents of a file. Again this is just mildly amusing but shows how well or badly UFS is doing laying out your files.

    # icheck -d /dev/rdsk/c0t0d0s0 -x -i 186e
    file off 0 dev off 6684 len 1279
    file off 1279 dev off 6683 len 1
    #


The user interface could live with being tidied up but my original goal has been satisfied.


The script itself is not for those with a weak stomach as it works around some bugs and features in fsdb. The script is here if you wish to see the full horror.


Tags:

Tuesday Mar 07, 2006

When you have lost your data.....

Today was not the first time, and won't be the last time that someone contacted me after loosing a load of data. Previous incidents have included when my partner was writing a letter on my Ultra1 at home and my daughter decided that the system would run much better without power.

On rebooting the file was zero length. This letter had taken days to craft so simply starting again was really low on the list of priorities. The file system on which it had lived was UFS and I could access the system over the network (I was on a business trip at the time), there were no backups...

The first thing about any situation like this is not to panic. You need to get the partition that used to contain the data quiesed so that no further damage is done. Unmount it if you can. Then take a backup of it using dd:

dd if=/dev/rdsk/c0t0d0s4 of=copy_of_slice bs=128k

Now you use the copy_of _slice to go looking for the blocks. If the file is a text file then you can use strings and grep to search for blocks that may contain your data. Specifically:


strings -a -t d < copy_of_slice | grep “text in document” 

This outputs the strings that contain “text in document” and their byte offsets you use these offsets to read the blocks.


73152472 moz-abmdbdirectory://abook.mab
136142157 fc9roaming.default.files.abook.mab
136151743 7moz-abmdbdirectory://abook.mab
136151779 !6fb-moz-abmdbdirectory://abook.mab


I use a shell function like this for a file system with an 8K block size:


function readblock
{
        dd bs=8k count=1 iseek=$(($1/ 8192)) < slice7
}


Since the file in this case was called slice7 to get the blocks.


$ readblock 73152472


then you have to use your imagination and skill to put the blocks back together. In my case the letter was recovered, sent and had the disired outcome.


Todays example is not looking so good. Firstly the victim had actually run suninstall over the drive and had no backup (stop giggling at the back) which had relabled the drive and then run newfs on the partition. Then when the dd was run the output file was wirtten onto the same disk so if the label did not match more damage was done. I might suggest that he run over the drive and then throw it into the pond just to make live interesting. It's a pity as since only the super blocks would have been written the chances of recovery where not that bad.


So to recap. Don't get in this situation. Backup everything. Use ZFS, use snapshots, lots of them.


However if you have lost your data and want to stand any chance of getting it back:

  1. Don't Panic.

  2. Quiese the file system. Powering off the system may well be your best option.

  3. Get a bit for bit copy of the disk that had the data. All slices. Do this while booted of release media.

  4. Hope you are lucky.


Tags:

Friday Apr 01, 2005

When to run fsck

Not when the file system is mounted!

I've been banging my head with this one of an on for a few weeks. I got an email from an engineer who was talking to a customer (who are always right) saying that when they run fsck on a live file system it would report errors:

    # fsck /
    \*\* /dev/vx/rdsk/rootvol
    \*\* Currently Mounted on /
    \*\* Phase 1 - Check Blocks and Sizes
    \*\* Phase 2 - Check Pathnames
    \*\* Phase 3 - Check Connectivity
    UNREF DIRECTORY I=5522736 OWNER=root MODE=40755
    SIZE=512 MTIME=Mar 31 13:07 2005
    CLEAR? y

    \*\* Phase 4 - Check Reference Counts
    \*\* Phase 5 - Check Cyl groups

    67265 files, 1771351 used, 68625795 free (14451 frags, 8576418 blocks, 0.0% fragmentation)

    \*\*\*\*\* FILE SYSTEM WAS MODIFIED \*\*\*\*\*

I kept telling them that running fsck on a live file system can and probably will generate these “errors”. The kernel's in memory copy of the file system is correct and eventually it will bring the on disk copy back in line. However by answering yes they have now corrupted the on disk copy of the file system and to make things worse the kernel does not know this so may not run fsck when the system boots. The warnings section of the fsck and fsck_ufs manual pages gives you a hint that this is a bad thing to do.

The reason they were running fsck was to check the consistency of the file system prior to adding a patch. The right way to do that would be to run pkgchk.

There are times when it is safe to run fsck on live file system, but they are rare and involve lockfs but before you do make sure you really understand what you are doing, my bet is that if you do know, you won't really want to.

I believe the message is now understood by all involved but I'm trying to make sure by adding it to the blog sphere.

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today