Wednesday Jan 06, 2010

ZFS resliver performance improved!

I'm not being lucky with the Western Digital 1Tb disks in my home server.

That is to say I'm extremely pleased I have ZFS as both have now failed and in doing so corrupted data which ZFS has detected (although the users detected the problem first as the performance of the drive became so poor).One of the biggest irritations about replacing drives, apart from having to shut the system down as I don't have hot swap hardware is waiting for the pool to resliver. Previously this has taken in excess of 24 hours to do.

However yesterday's resilver was after I had upgraded to build 130 which has some improvements to the resilver code:


: pearson FSS 1 $; zpool status tank
  pool: tank
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 6h28m with 0 errors on Wed Jan  6 02:04:17 2010
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            c21t0d0s7  ONLINE       0     0     0
            c20t1d0s7  ONLINE       0     0     0
          mirror-1     ONLINE       0     0     0
            c21t1d0    ONLINE       0     0     0  225G resilvered
            c20t0d0    ONLINE       0     0     0

errors: No known data errors
: pearson FSS 2 $; 
Only 6 ½ hours for 225G which while not close to the theoretical maximum is way better than 24 hours and the system was usable while this was going on.

Sunday Jan 03, 2010

Automatic virus scanning with c-icap & ZFS

Now that I have OpenSolaris running on the home server I thought I would take advantage of the virus scanning capabilities using the clamav instance I have running. After downloading, compiling and installing c-icap I was able to get the service up and running quickly using the instructions here.

However using a simple test of trying to copy an entire home directory I would see regular errors of the form:

Jan  2 16:18:49 pearson vscand: [ID 940187 daemon.error] Error receiving data from Scan Engine: Error 0

Which were accompanied by a an error to the application and the error count to vscanadm stats.

From the source it was clear that the recv1 was returning 0, indicating the stream to the virus scan engine had closed the connection. What was not clear was why?

So I ran this D to see if what was in the buffer being read would give a clue:


root@pearson:/root# cat vscan.d 
pid$target::vs_icap_readline:entry
{
        self->buf = arg1;
        self->buflen = arg2;
}
syscall::recv:return /self->buf && arg1 == 0/
{
        this->b = copyin(self->buf, self->buflen);
        trace(stringof(this->b));
}
pid$target::vs_icap_readline:return
/self->buf/
{
        self->buf=0;
        self->buflen=0;
}
root@pearson:/root# 

root@pearson:/root# dtrace -s  vscan.d -p $(pgrep vscand)
dtrace: script 'vscan.d' matched 3 probes
CPU     ID                    FUNCTION:NAME
  1   4344                      recv:return 
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

The clue was that the error comes back on the very first byte being read. The viruse scan engine is deliberately closing the connection after handling a request which since it had negotiated "keep-alive" it should not.

The solution2 was to set the MaxKeepAliveRequests entry in the c-icap.conf file to be -1 and therefore disable this feature.

1Why is recv being used to read one byte at a time? Insane, a bug will be filed.

2It is in my opinion a bug that the vscand can't cope gracefully with this. Another bug will be filed.

Tuesday May 26, 2009

Why everyone should be using ZFS

It is at times like these that I'm glad I use ZFS at home.


  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          mirror       ONLINE       0     0     0
            c20t0d0s7  ONLINE       6     0     4
            c21t0d0s7  ONLINE       0     0     0
          mirror       ONLINE       0     0     0
            c21t1d0    ONLINE       0     0     0
            c20t1d0    ONLINE       0     0     0

errors: No known data errors
: pearson FSS 14 $; 

The drive with the errors was also throwing up errors that iostat could report and from it's performance was trying heroicially to give me back data. However it had failed. It's performance was terrible and then it failed to give the right data on 4 occasions. Anyother file system would, if that was user data, just had deliviered it to the user without warning. That bad data could then propergate from there on, probably into my backups. There is certainly no good that could come from that. However ZFS detected and corrected the errors.


Now I have offlined the disk the performance of the system is better but I have no redundancy until the new disk I have just ordered arriaves. Now time to check out Seagate's warranty return system.

Sunday Apr 19, 2009

User and group quotas for ZFS!

This push will be very popular amoung those who are managing servers with thousands of users:

Repository: /export/onnv-gate
Total changesets: 1

Changeset: f41cf682d0d3

Comments:
PSARC/2009/204 ZFS user/group quotas & space accounting
6501037 want user/group quotas on ZFS
6830813 zfs list -t all fails assertion
6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr
6815592 panic: No such hold X on refcount Y from zfs_znode_move
6759986 zfs list shows temporary %clone when doing online zfs recv

User quotas for zfs has been the feature I have been asked about most when talking to customers. This probably relfects that most customers are simply blown away by the other features of ZFS and the only missing feature was user quotas if you have a large user base.

Tuesday Apr 14, 2009

zfs list -d

I've just pushed the changes for zfs list that give it a -d option to limit the depth to which recursive listings will go. This is of most use when you wish to list the snapshots of a given data set and only the snapshots of that data set.

PSARC 2009/171 zfs list -d and zfs get -d
6762432 zfs list --depth

Before this you could achieve this using a short pipe line which while it produced the correct results was horribly inefficient and very slow for datasets that had lots of descendents.

: v4u-1000c-gmp03.eu TS 6 $; zfs list -t snapshot rpool | grep '\^rpool@'
rpool@spam                         0      -    64K  -
rpool@two                          0      -    64K  -
: v4u-1000c-gmp03.eu TS 7 $; zfs list -d 1 -t snapshot              
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool@spam      0      -    64K  -
rpool@two       0      -    64K  -
: v4u-1000c-gmp03.eu TS 8 $; 

It will allow the zfs-snapshot service to be much more efficient when it needs to list snapshots. The change will be in build 113.

Sunday Apr 05, 2009

Recovering our Windows PC

I had reason to discover if my solution for backing up the windows PC worked. Apparently the PC had not been working properly for a while but no one had mentioned that to me. The symptoms were:

  1. No menu bar at the bottom of the screen. It was almost like the screen was the wrong size but how it was changed is/was a mystery.

  2. It was claiming it needed to revalidate itself as the hardware had changed, which it catagorically had not and I had 2 days to sort it out. Apparenty this message had been around for a few days (weeks?) but was ignored.

Now I'm sure I could have had endless fun reading forums to find out how to fix these things but it was Saturday night nd I was going cycling in the morning. So time to boot solaris and restore the back up. First I took a back up of what was on the disk, just in case I get a desire to relive the issue. I just needed one script to restore it over ssh. The script is:

: pearson FSS 14 $; cat /usr/local/sbin/xp_restore 
#!/bin/ksh 

exec dd of=/dev/rdsk/c0d0p1 bs=1k
: pearson FSS 15 $; 

and the command was:

$ ssh pc pfexec /usr/local/sbin/xp_restore < backup.dd

having chosen the desired snapshot. Obviously the command was added to /etc/security/exec_attr. Then just leave that running over night. In the morning the system booted up just fine, complained about the virus definitions being out of date and various things needing updates but all working. Alas doing this before I went cycling made me late enough to miss the peleton, if it was there.

Saturday Mar 28, 2009

snapshot on unlink?

This thread on OpenSolaris made me wonder how hard it would be to take a snapshot before any file is deleted. It turns out that using dtrace it is not hard at all. Using dtrace to monitor unlink and unlinkat calls and a short script to take the snapshots:

#!/bin/ksh93

function snapshot
{
	eval $(print x=$2)

	until [[ "$x" == "/" || -d "$x/.zfs/snapshot" ]]
	do
		x="${x%/\*}"
	done
	if [[ "$x" == "/" || "$x" == "/tmp" ]]
	then
		return
	fi
	if [[ -d "$x/.zfs/snapshot" ]]
	then
		print mkdir "$x/.zfs/snapshot/unlink_$1"
		pfexec mkdir "$x/.zfs/snapshot/unlink_$1"
	fi
}
function parse
{
	eval $(print x=$4)
	
	if [[ "${x%%/\*}" == "" ]]
	then
		snapshot $1 "$2$4"
	else
		snapshot $1 "$2$3/$4"
	fi
}
pfexec dtrace -wqn 'syscall::fsat:entry /pid != '$$' && uid > 100 && arg0 == 5/ {
	printf("%d %d \\"%s\\" \\"%s\\" \\"%s\\"\\n",
	pid, walltimestamp, root, cwd, copyinstr(arg2)); stop()
}
syscall::unlink:entry /pid != '$$' && uid > 100 / {
	printf("%d %d \\"%s\\" \\"%s\\" \\"%s\\"\\n",
	pid, walltimestamp, root, cwd, copyinstr(arg0)); stop()
}' | while read pid timestamp root cwd file
do
	print prun $pid
	parse $timestamp $root $cwd $file
	pfexec prun $pid
done

Now this is just a Saturday night proof of concept and it should be noted it has a significant performance impact and single threads all calls to unlink.

Also you end up with lots of snapshots:

cjg@brompton:~$ zfs list -t snapshot -o name,used | grep unlink

rpool/export/home/cjg@unlink_1238270760978466613                           11.9M

rpool/export/home/cjg@unlink_1238275070771981963                             59K

rpool/export/home/cjg@unlink_1238275074501904526                             59K

rpool/export/home/cjg@unlink_1238275145860458143                             34K

rpool/export/home/cjg@unlink_1238275168440000379                            197K

rpool/export/home/cjg@unlink_1238275233978665556                            197K

rpool/export/home/cjg@unlink_1238275295387410635                            197K

rpool/export/home/cjg@unlink_1238275362536035217                            197K

rpool/export/home/cjg@unlink_1238275429554657197                            136K

rpool/export/home/cjg@unlink_1238275446884300017                            350K

rpool/export/home/cjg@unlink_1238275491543380576                            197K

rpool/export/home/cjg@unlink_1238275553842097361                            197K

rpool/export/home/cjg@unlink_1238275643490236001                             63K

rpool/export/home/cjg@unlink_1238275644670212158                             63K

rpool/export/home/cjg@unlink_1238275646030183268                               0

rpool/export/home/cjg@unlink_1238275647010165407                               0

rpool/export/home/cjg@unlink_1238275648040143427                             54K

rpool/export/home/cjg@unlink_1238275649030124929                             54K

rpool/export/home/cjg@unlink_1238275675679613928                            197K

rpool/export/home/cjg@unlink_1238275738608457151                            198K

rpool/export/home/cjg@unlink_1238275800827304353                           57.5K

rpool/export/home/cjg@unlink_1238275853116324001                           32.5K

rpool/export/home/cjg@unlink_1238275854186304490                           53.5K

rpool/export/home/cjg@unlink_1238275862146153573                            196K

rpool/export/home/cjg@unlink_1238275923255007891                           55.5K

rpool/export/home/cjg@unlink_1238275962114286151                           35.5K

rpool/export/home/cjg@unlink_1238275962994267852                           56.5K

rpool/export/home/cjg@unlink_1238275984723865944                           55.5K

rpool/export/home/cjg@unlink_1238275986483834569                             29K

rpool/export/home/cjg@unlink_1238276004103500867                             49K

rpool/export/home/cjg@unlink_1238276005213479906                             49K

rpool/export/home/cjg@unlink_1238276024853115037                           50.5K

rpool/export/home/cjg@unlink_1238276026423085669                           52.5K

rpool/export/home/cjg@unlink_1238276041792798946                           50.5K

rpool/export/home/cjg@unlink_1238276046332707732                           55.5K

rpool/export/home/cjg@unlink_1238276098621721894                             66K

rpool/export/home/cjg@unlink_1238276108811528303                           69.5K

rpool/export/home/cjg@unlink_1238276132861080236                             56K

rpool/export/home/cjg@unlink_1238276166070438484                             49K

rpool/export/home/cjg@unlink_1238276167190417567                             49K

rpool/export/home/cjg@unlink_1238276170930350786                             57K

rpool/export/home/cjg@unlink_1238276206569700134                           30.5K

rpool/export/home/cjg@unlink_1238276208519665843                           58.5K

rpool/export/home/cjg@unlink_1238276476484690821                             54K

rpool/export/home/cjg@unlink_1238276477974663478                             54K

rpool/export/home/cjg@unlink_1238276511584038137                           60.5K

rpool/export/home/cjg@unlink_1238276519053902818                             71K

rpool/export/home/cjg@unlink_1238276528213727766                             62K

rpool/export/home/cjg@unlink_1238276529883699491                             47K

rpool/export/home/cjg@unlink_1238276531683666535                           3.33M

rpool/export/home/cjg@unlink_1238276558063169299                           35.5K

rpool/export/home/cjg@unlink_1238276559223149116                           62.5K

rpool/export/home/cjg@unlink_1238276573552877191                           35.5K

rpool/export/home/cjg@unlink_1238276584602668975                           35.5K

rpool/export/home/cjg@unlink_1238276586002642752                             53K

rpool/export/home/cjg@unlink_1238276586522633206                             51K

rpool/export/home/cjg@unlink_1238276808718681998                            216K

rpool/export/home/cjg@unlink_1238276820958471430                           77.5K

rpool/export/home/cjg@unlink_1238276826718371992                             51K

rpool/export/home/cjg@unlink_1238276827908352138                             51K

rpool/export/home/cjg@unlink_1238276883227391747                            198K

rpool/export/home/cjg@unlink_1238276945366305295                           58.5K

rpool/export/home/cjg@unlink_1238276954766149887                           32.5K

rpool/export/home/cjg@unlink_1238276955946126421                           54.5K

rpool/export/home/cjg@unlink_1238276968985903108                           52.5K

rpool/export/home/cjg@unlink_1238276988865560952                             31K

rpool/export/home/cjg@unlink_1238277006915250722                           57.5K

rpool/export/home/cjg@unlink_1238277029624856958                             51K

rpool/export/home/cjg@unlink_1238277030754835625                             51K

rpool/export/home/cjg@unlink_1238277042004634457                           51.5K

rpool/export/home/cjg@unlink_1238277043934600972                             52K

rpool/export/home/cjg@unlink_1238277045124580763                             51K

rpool/export/home/cjg@unlink_1238277056554381122                             51K

rpool/export/home/cjg@unlink_1238277058274350998                             51K

rpool/export/home/cjg@unlink_1238277068944163541                             59K

rpool/export/home/cjg@unlink_1238277121423241127                           32.5K

rpool/export/home/cjg@unlink_1238277123353210283                           53.5K

rpool/export/home/cjg@unlink_1238277136532970668                           52.5K

rpool/export/home/cjg@unlink_1238277152942678490                               0

rpool/export/home/cjg@unlink_1238277173482320586                               0

rpool/export/home/cjg@unlink_1238277187222067194                             49K

rpool/export/home/cjg@unlink_1238277188902043005                             49K

rpool/export/home/cjg@unlink_1238277190362010483                             56K

rpool/export/home/cjg@unlink_1238277228691306147                           30.5K

rpool/export/home/cjg@unlink_1238277230021281988                           51.5K

rpool/export/home/cjg@unlink_1238277251960874811                             57K

rpool/export/home/cjg@unlink_1238277300159980679                           30.5K

rpool/export/home/cjg@unlink_1238277301769961639                             50K

rpool/export/home/cjg@unlink_1238277302279948212                             49K

rpool/export/home/cjg@unlink_1238277310639840621                             28K

rpool/export/home/cjg@unlink_1238277314109790784                           55.5K

rpool/export/home/cjg@unlink_1238277324429653135                             49K

rpool/export/home/cjg@unlink_1238277325639636996                             49K

rpool/export/home/cjg@unlink_1238277360029166691                            356K

rpool/export/home/cjg@unlink_1238277375738948709                           55.5K

rpool/export/home/cjg@unlink_1238277376798933629                             29K

rpool/export/home/cjg@unlink_1238277378458911557                             50K

rpool/export/home/cjg@unlink_1238277380098888676                             49K

rpool/export/home/cjg@unlink_1238277397738633771                             48K

rpool/export/home/cjg@unlink_1238277415098386055                             49K

rpool/export/home/cjg@unlink_1238277416258362893                             49K

rpool/export/home/cjg@unlink_1238277438388037804                             57K

rpool/export/home/cjg@unlink_1238277443337969269                           30.5K

rpool/export/home/cjg@unlink_1238277445587936426                           51.5K

rpool/export/home/cjg@unlink_1238277454527801430                           50.5K

rpool/export/home/cjg@unlink_1238277500967098623                            196K

rpool/export/home/cjg@unlink_1238277562866135282                           55.5K

rpool/export/home/cjg@unlink_1238277607205456578                             49K

rpool/export/home/cjg@unlink_1238277608135443640                             49K

rpool/export/home/cjg@unlink_1238277624875209357                             57K

rpool/export/home/cjg@unlink_1238277682774484369                           30.5K

rpool/export/home/cjg@unlink_1238277684324464523                             50K

rpool/export/home/cjg@unlink_1238277685634444004                             49K

rpool/export/home/cjg@unlink_1238277686834429223                           75.5K

rpool/export/home/cjg@unlink_1238277700074256500                             48K

rpool/export/home/cjg@unlink_1238277701924235244                             48K

rpool/export/home/cjg@unlink_1238277736473759068                           49.5K

rpool/export/home/cjg@unlink_1238277748313594650                           55.5K

rpool/export/home/cjg@unlink_1238277748413593612                             28K

rpool/export/home/cjg@unlink_1238277750343571890                             48K

rpool/export/home/cjg@unlink_1238277767513347930                           49.5K

rpool/export/home/cjg@unlink_1238277769183322087                             50K

rpool/export/home/cjg@unlink_1238277770343306935                             48K

rpool/export/home/cjg@unlink_1238277786193093885                             48K

rpool/export/home/cjg@unlink_1238277787293079433                             48K

rpool/export/home/cjg@unlink_1238277805362825259                           49.5K

rpool/export/home/cjg@unlink_1238277810602750426                            195K

rpool/export/home/cjg@unlink_1238277872911814531                            195K

rpool/export/home/cjg@unlink_1238277934680920214                            195K

rpool/export/home/cjg@unlink_1238277997220016825                            195K

rpool/export/home/cjg@unlink_1238278063868871589                           54.5K

rpool/export/home/cjg@unlink_1238278094728323253                             61K

rpool/export/home/cjg@unlink_1238278096268295499                             63K

rpool/export/home/cjg@unlink_1238278098518260168                             52K

rpool/export/home/cjg@unlink_1238278099658242516                             56K

rpool/export/home/cjg@unlink_1238278103948159937                             57K

rpool/export/home/cjg@unlink_1238278107688091854                             54K

rpool/export/home/cjg@unlink_1238278113907980286                             62K

rpool/export/home/cjg@unlink_1238278116267937390                             64K

rpool/export/home/cjg@unlink_1238278125757769238                            196K

rpool/export/home/cjg@unlink_1238278155387248061                            136K

rpool/export/home/cjg@unlink_1238278160547156524                            229K

rpool/export/home/cjg@unlink_1238278165047079863                            351K

rpool/export/home/cjg@unlink_1238278166797050407                            197K

rpool/export/home/cjg@unlink_1238278168907009714                             55K

rpool/export/home/cjg@unlink_1238278170666980686                            341K

rpool/export/home/cjg@unlink_1238278171616960684                           54.5K

rpool/export/home/cjg@unlink_1238278190336630319                            777K

rpool/export/home/cjg@unlink_1238278253245490904                            329K

rpool/export/home/cjg@unlink_1238278262235340449                            362K

rpool/export/home/cjg@unlink_1238278262915331213                            362K

rpool/export/home/cjg@unlink_1238278264915299508                            285K

rpool/export/home/cjg@unlink_1238278310694590970                             87K

rpool/export/home/cjg@unlink_1238278313294552482                             66K

rpool/export/home/cjg@unlink_1238278315014520386                             31K

rpool/export/home/cjg@unlink_1238278371773568934                            258K

rpool/export/home/cjg@unlink_1238278375673503109                            198K

rpool/export/home/cjg@unlink_1238278440802320314                            138K

rpool/export/home/cjg@unlink_1238278442492291542                           55.5K

rpool/export/home/cjg@unlink_1238278445312240229                           2.38M

rpool/export/home/cjg@unlink_1238278453582077088                            198K

rpool/export/home/cjg@unlink_1238278502461070222                            256K

rpool/export/home/cjg@unlink_1238278564359805760                            256K

rpool/export/home/cjg@unlink_1238278625738732194                           63.5K

rpool/export/home/cjg@unlink_1238278633428599541                           61.5K

rpool/export/home/cjg@unlink_1238278634568579678                            137K

rpool/export/home/cjg@unlink_1238278657838186760                            288K

rpool/export/home/cjg@unlink_1238278659768151784                            223K

rpool/export/home/cjg@unlink_1238278661518121640                            159K

rpool/export/home/cjg@unlink_1238278664378073421                            136K

rpool/export/home/cjg@unlink_1238278665908048641                            138K

rpool/export/home/cjg@unlink_1238278666968033048                            136K

rpool/export/home/cjg@unlink_1238278668887996115                            281K

rpool/export/home/cjg@unlink_1238278670307970765                            227K

rpool/export/home/cjg@unlink_1238278671897943665                            162K

rpool/export/home/cjg@unlink_1238278673197921775                            164K

rpool/export/home/cjg@unlink_1238278674027906895                            164K

rpool/export/home/cjg@unlink_1238278674657900961                            165K

rpool/export/home/cjg@unlink_1238278675657885128                            165K

rpool/export/home/cjg@unlink_1238278676647871187                            241K

rpool/export/home/cjg@unlink_1238278678347837775                            136K

rpool/export/home/cjg@unlink_1238278679597811093                            199K

rpool/export/home/cjg@unlink_1238278687297679327                            197K

rpool/export/home/cjg@unlink_1238278749616679679                            197K

rpool/export/home/cjg@unlink_1238278811875554411                           56.5K

cjg@brompton:~$ 

Good job that snapshots are cheap. I'm not going to be doing this all the time but it makes you think what could be done.

Friday Mar 27, 2009

zfs list webrev

I've just posted the webrev for review for an RFE to “zfs list”:

PSARC 2009/171 zfs list -d and zfs get -d
6762432 zfs list --depth

This will allow you to limit the depth to which a recursive listing of zfs file systems will go. This is particularly useful if you only want to list the snapshots of the current file system.

The webrev is here:

http://cr.opensolaris.org/~cjg/zfs_list/zfs_list-d-2/

Comments welcome.

Saturday Feb 21, 2009

[Open]Solaris logfiles and ZFS root

This week I had reason to want to see how often the script that controls the access hours of Sun Ray users actually did work so I went off to look in the messages files only to discover that there were only four and they only went back to January 11.

: pearson FSS 22 $; ls -l mess\*
-rw-r--r--   1 root     root       12396 Feb  8 23:58 messages
-rw-r--r--   1 root     root      134777 Feb  8 02:59 messages.0
-rw-r--r--   1 root     root       53690 Feb  1 02:06 messages.1
-rw-r--r--   1 root     root      163116 Jan 25 02:01 messages.2
-rw-r--r--   1 root     root       83470 Jan 18 00:21 messages.3
: pearson FSS 23 $; head -1 messages.3
Jan 11 05:29:14 pearson pcplusmp: [ID 444295 kern.info] pcplusmp: ide (ata) instance #1 vector 0xf ioapic 0x2 intin 0xf is bound to cpu 1
: pearson FSS 24 $; 

I am certain that the choice of only four log files was not a concious decision I have made but it did make me ponder whether logfile management should be revisted in the light of ZFS root. Since clearly if you have snapshots firing logs could go back a lot futher:

: pearson FSS 40 $; head -1 $(ls -t /.zfs/snapshot/\*/var/adm/message\*| tail -1)
Dec 14 03:15:14 pearson time-slider-cleanup: [ID 702911 daemon.notice] No more daily snapshots left
: pearson FSS 41 $; 

It did not take long for this shell function to burst into life:

function search_log
{
typeset path
if [[ ${2#/} == $2 ]]
then
        path=${PWD}/$2
else
        path=$2
fi
cat $path /.zfs/snapshot/\*$path | egrep $1 | sort -M | uniq
}

Not a generalized solution but one that works when you root filesystem contains all your logs and if you remember to escape any globbing on the command line will search all the log files:

: pearson FSS 46 $; search_log block /var/adm/messages\\\* | wc
      51     688    4759
: pearson FSS 47 $; 

There are two ways to view this. Either it it great that the logs are kept and so I have all this historical data or it is a pain as getting red of log files becomes more of a chore, indeed this is encouraging me to move all the logfiles into their own file systems so that the management of those logfiles is more granular.

At the very least it seems to me that OpenSolaris should sort out where it's log files are going and end the messages going in /var/adm and move them to /var/log which then should be it's own file system.

Saturday Jan 03, 2009

Making the zfs snapshot service run faster

I've not been using Tim's auto-snapshot service on my home server as once I configured it so that it would work on my server I noticed it had a large impact on the system:

: pearson FSS 15 $; time /lib/svc/method/zfs-auto-snapshot \\
         svc:/system/filesystem/zfs/auto-snapshot:frequent

real    1m22.28s
user    0m9.88s
sys     0m33.75s
: pearson FSS 16 $;

The reason is two fold. First reading all the properties from the pool takes time and second it destroys the unneeded snapshots as it takes new ones. Something the service I used cheats with and does only very late at night. Looking at the script there are plenty of things that could be made faster and so I wrote a python version that could replace the cron job and the results , while and improvement were disappointing:

: pearson FSS 16 $; time ./zfs.py \\
         svc:/system/filesystem/zfs/auto-snapshot:frequent

real    0m47.19s
user    0m9.45s
sys     0m31.54s
: pearson FSS 17 $; 

still too slow to actually use. The time was dominated by cases where the script could not use a recursive option to delete the snapshots. The problem being that there is no way to list all the snapshots of a filesystem or volume but not it's decendents.


Consider this structure:

# zfs list -r -o name,com.sun:auto-snapshot tank
NAME                                  COM.SUN:AUTO-SNAPSHOT
tank                                  true
tank/backup                           false
tank/dump                             false
tank/fs                               true
tank/squid                            false
tank/tmp                              false

The problem here is that the script wants to snapshots and clean up “tank” but can't use recustion without backing up all the other file systems that have the false flag set and set for very good reason. Howeve If I did not bother to snapshot “tank” then tank/fs could be managed recusively and there would be no need for special handling. The above list does not reflect all the file systems I have but you get the picture. The results of making this change brings the timing for the service

: pearson FSS 21 $; time ./zfs.py \\
         svc:/system/filesystem/zfs/auto-snapshot:frequent

real    0m9.27s
user    0m2.43s
sys     0m4.66s
: pearson FSS 22 $; time /lib/svc/method/zfs-auto-snapshot \\
         svc:/system/filesystem/zfs/auto-snapshot:frequent

real    0m12.85s
user    0m2.10s
sys     0m5.42s
: pearson FSS 23 $; 

While the python module still gets better results than the korn shell script the korn shell script does not do so badly. However it still seems worthwhile spending the time to get the python script to be able to handle all the features of the korn shell script. More later.

Sunday Nov 23, 2008

Two pools on one drive?

Now I'm committed to ZFS root I'm left with a dilemma. Given the four drives I have in the system and that I have too much data and the drives are of different sizes so raid2Z is not an option even though it would give the greatest protection for the data the next best solution is some form of mirroring. Initially I simply had two pools which offers good redundancy and allows ZFS root to work but is suboptimal performance. If I could stripe the pool that would be better but then that does not work with ZFS root.

However since I used to run with a future proof Disk Suite, UFS based root I still have the space that used to contain the two boot environments that were on UFS into which I intended to grow the pool once they were not needed. What if I did not grow the pool but instead put a second pool on that partition? Then I would have a pool, “rpool” mirrored across part of the disk and then the data pool, “tank” mirrored over the rest of the boot drives and striped across a second mirror consisting of the entire second pair of drives.

Clearly the solution is suboptimal but given the constraints of ZFS root and the hardware I have would this perform better?

I should point out that the system as is does not perform badly, but I don't want to leave performance on the table if I don't have to. I'm not going to rush into this (that is I've not already done it) since growing the pool is a one way operation there being no way to shrink it again although at the moment I am minded to do it.

Comments welcome

Saturday Nov 22, 2008

Forced to upgrade

Build 103 and ZFS root have come to the home server. While I was travelling the system hit bug 6746456 which resulted in the system panicing every time it booted. So I was forced to return to build 100 and have now upgraded to build 103. Live upgrade using UFS would not work at all and since I have the space I've moved over to ZFS root. However the nautilus bug is still in build 103 so I'm either going to have to live with it, which is impossible, disable nautilus completely or work to get the time slider feature disabled until it is usable. Disabling nautilus while irritating is effectively what I have had to do now so could be the medium term solution.

The other news for the home server was the failure of the power supply. So it was good bye to the small Antec case that used to house the server since it did not really save any space a more traditional desk side unit has replaced it which also allows upto six internal drives. Since ZFS root will not support booting of stripes the extra two drives I have form a second pool.

# zpool list
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
pool2   294G  36.8G   257G    12%  ONLINE  -
tank    556G   307G   249G    55%  ONLINE  -
# 

The immediate effect of two pools is being able to have the Solaris image from which I upgraded on a different pair of disks from the ones being upgraded with a dramatic performance boost. The other is that I can let the automatic snapshot service take control of the other pool rather than add it to my old snapshot policy. Early on I realise I need to turn off snapshots on the swap volumes which are on both pools (to get some striping):

# zfs set com.sun:auto-snapshot=false pool2/swap
# 
zfs set com.sun:auto-snapshot=false tank/swap 
# 

should do it.

Friday Oct 24, 2008

Brief visit to build 100 at home.

I finally managed to upgrade the home server to build 100 by deleting all the zones and then upgrading. However a few minutes of running resulted in this bug:

6763600 nautilus becomes unusable on a system with 39000 snapshots

which is a pity as I was really excited about time slider. The positive part of this is that this is yet another feature that will be improved by users running OpenSolaris variants at home.

Alas while I could live with out nautilus (the only file chooser I recall on a vt220 was vsh and I never really got on with that) the other users at home can not so the system is back running build 96.

Monday Oct 06, 2008

Incremental back up of Windows XP to ZFS

I am forced to have a Windows system at home which thankfully only very occasionally gets used however even though everything that gets on it is virus scanned all email is scanned before it gets near it and none of the users are administrators I still like to keep it backed up.

Given I have a server on a network which has ZFS file systems with capacity I decided that I could do this just using the dd(1) command which I have written about before. Using that to copy the entire disk image to a ZFS file allows me to back the system up. However if I snapshot the back up file system and then back up again every block gets re written so takes up space on the server enven if they have not changed (roll on de dup). To stop this I have a tiny program that mmap()s the entire backup file and then only updates the blocks that have changed.

I call it syncer for no good reason:

#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <stdio.h>
#include <sys/time.h>
/\*
 \* Build by:
 \*              cc -m64 -o syncer syncer.c
 \*/

/\*
 \* Match this to the file system record size.
 \*/
#define BLOCK_SIZE (128 \* 1024)
#define KILO 1024
#define MEG (KILO \* KILO)
#define MSEC (1000LL)
#define NSEC (MSEC \* MSEC)
#define USEC (NSEC \* MSEC)

static long block_size;

char \*
map_file(const char \*file)
{
        int fd;
        char \*addr;
        struct stat buf;

        if ((fd = open(file, O_RDWR)) == -1) {
                return (NULL);
        }

        if (fstat(fd, &buf) == -1) {
                close(fd);
                return (NULL);
        }

        block_size = buf.st_blksize;

        addr = mmap(0, buf.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
        close(fd);
        return (addr);
}
off64_t
read_whole(int fd, char \*buf, int len)
{
        int count;
        int total = 0;

        while (total != len && 
                (count = read(0, &buf[total], len - total)) > 0) {
                total+=count;
        }
        return (total);
}
static void
print_amount(char \*str, off64_t value)
{
        if (value < KILO) {
                printf("%s %8lld ", str, value);
        } else if (value < MEG) {
                printf("%s %8lldK", str, value/(KILO));
        } else {
                printf("%s %8lldM", str, value/(MEG));
        }
}
int
main(int argc, char \*\*argv)
{
        char \*buf;
        off64_t offset = 0;
        off64_t update = 0;
        off64_t count;
        off64_t tcount = 0;
        char \*addr;
        long bs;
        hrtime_t starttime;
        hrtime_t lasttime;

        if (argc == 1) {
                fprintf(stderr, "Usage: %s outfile\\n", \*argv);
                exit(1);
        }
        if ((addr = map_file(argv[1])) == NULL) {
                exit(1);
        }
        bs = block_size == 0 ? BLOCK_SIZE : block_size;
        if ((buf = malloc(block_size == 0 ? BLOCK_SIZE : block_size)) == NULL) {
                perror("malloc failed");
                exit(1);
        }

        print_amount("Block size:", bs);
        printf("\\n");
        fflush(stdout);

        starttime = lasttime = gethrtime();
        while ((count = read_whole(0, buf, bs)) > 0) {
                hrtime_t thistime;
                if (memcmp(buf, addr+offset, count) != 0) {
                        memcpy(addr+offset, buf, count);
                        update+=count;
                }
                madvise(addr+offset, count, MADV_DONTNEED);
                offset+=count;
                madvise(addr+offset, bs, MADV_WILLNEED);
                thistime = gethrtime();
                /\*
                 \* Only update the output after a second so that is readable.
                 \*/
                if (thistime - lasttime > USEC) {
                        print_amount("checked", offset);
                        printf(" %4d M/sec ", ((hrtime_t)tcount \* USEC) /
                                (MEG \* (thistime - lasttime)));
                        print_amount(" updated", update);
                        printf("\\r");
                        fflush(stdout);
                        lasttime = thistime;
                        tcount = 0;
                } else { 
                        tcount += count;
                }
        }
        printf("                                            \\r");
        print_amount("Read: ", offset);
        printf(" %lld M/sec ", (offset \* NSEC) /
                (MEG \* ((gethrtime() - starttime)/MSEC)));
        print_amount("Updated:", update);
        printf("\\n");
        /\* If nothing is updated return false \*/
        exit(update == 0 ? 1 : 0);
}



Then a simple shell function to do the back up and then snapshot the file system:

function backuppc
{
	ssh -o Compression=no -c blowfish pc pfexec /usr/local/sbin/xp_backup | time ~/lang/c/syncer /tank/backup/pc/backup.dd && \\
	pfexec /usr/sbin/zfs snapshot tank/backup/pc@$(date +%F)
}

Running it I see that only 2.5G of data was actually written to disk, and yet thanks to ZFS I have a complete disk image and have not lost the previous disk images.


: pearson FSS 17 $; backuppc
665804+0 records in
665804+0 records out
Read:     20481M 9 M/sec Updated:     2584M 

real    35m50.00s
user    6m27.98s
sys     2m43.76s
: pearson FSS 18 $; 

Saturday Sep 13, 2008

Native CIFS and samba on the same system

I've been using samba at home for a while and now but would like to migrate over to the new CIFS implentation provided by solaris. Since there are somre subtle differences in what each service provides\* this means a slower migration.

Obviously you can't configure both services to run on the same system so to get around this I am going to migrate all the SMB services into a zone running on the server and then allow the global zone to act as the native CIFS service.

So I configured a zone called, rather dully, “samba” with loop back access to all the file systems that I share via SMB and added the additional priviledge “sys_smb” so that the daemons could bind to the smb service port.

zonecfg:samba> set limitpriv=default,sys_smb

The end command only makes sense in the resource scope.
zonecfg:samba> commit
zonecfg:samba> exit

Now you can configure the zone in the usual way to run samba. I simply copied the smb.conf and smbpasswd files from the global zone using zcp.


Once that was done and samba enabled in smf I could then enable the natives CIFS server in the global zone and have the best of both worlds.



\*) The principal difference I see is that the native smb service does not cross file systems mount points. So if you have a hierarchy of file systems you have to mount each one on the client. With samba you can just mount the root and it will see everything below.

Sunday Sep 07, 2008

Upgrading disks

Having run out of space in the root file systems and being close to full on the zpool the final straw was being able to get 2 750Gb sata drives for less than £100, that and knowing that sanpshots no longer cause re livering to restart which greatly simplifies the data migration. So I'm replacing the existing drives with new ones. Since the enclosure I have can only hold three drives this involved a two stage upgrade so that at no point was my data on less than two drives. First stage was to install one drive and label it:

partition> print
Current partition table (unnamed):
Total disk cylinders available: 45597 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm   39383 - 41992       39.99GB    (2610/0/0)    83859300
  1 unassigned    wm       0                0         (0/0/0)              0
  2     backup    wu       0 - 45596      698.58GB    (45597/0/0) 1465031610
  3 unassigned    wm       0                0         (0/0/0)              0
  4 unassigned    wm   36773 - 39382       39.99GB    (2610/0/0)    83859300
  5 unassigned    wm   45594 - 45596       47.07MB    (3/0/0)          96390
  6 unassigned    wm   36379 - 36772        6.04GB    (394/0/0)     12659220
  7 unassigned    wm       3 - 36378      557.31GB    (36376/0/0) 1168760880
  8       boot    wu       0 -     0       15.69MB    (1/0/0)          32130
  9 alternates    wm       1 -     2       31.38MB    (2/0/0)          64260

partition>

These map to the partitions from the original set up, only they are bigger. I'm confident that when the 40Gb root disks are to small I will have migrated to ZFS for root. So this looks like a good long term solution.

pearson # dumpadm -d /dev/dsk/c2d0s6
      Dump content: kernel pages
       Dump device: /dev/dsk/c2d0s6 (dedicated)
Savecore directory: /var/crash/pearson
  Savecore enabled: yes
pearson # metadb -a -c 3 /dev/dsk/c2d0s5
pearson # egrep c2d0 /etc/lvm/md.tab
d12 1 1 /dev/dsk/c2d0s0
d42 1 1 /dev/dsk/c2d0s4
pearson # metainit d12
d12: Concat/Stripe is setup
pearson # metainit d42
d42: Concat/Stripe is setup
pearson # metattach d0 d12
d0: submirror d12 is attached
pearson # 

Now wait until the disk has completed resyning. While you can do this in parallel this causes the disk heads to move more so overall it is slower. Left to just do one partition at a time it is really quite quick:

                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
cmdk0   357.2    0.0 18321.8    0.0  2.6  1.1   10.4  52  58 
cmdk1     0.0  706.4    0.0 36147.4  1.0  0.5    2.2  23  27 
cmdk2   350.2    0.0 17929.6    0.0  0.4  0.3    2.1  12  15 
md1      70.0   71.0 35859.2 36371.5  0.0  1.0    7.1   0 100 
md3       0.0   71.0    0.0 36371.5  0.0  0.3    3.8   0  27 
md15     35.0    0.0 17929.6    0.0  0.0  0.6   16.5   0  58 
md18     35.0    0.0 17929.6    0.0  0.0  0.1    4.3   0  15 
pearson # metastat d0 
d0: Mirror
    Submirror 0: d10
      State: Okay         
    Submirror 1: d11
      State: Okay         
    Submirror 2: d12
      State: Resyncing    
    Resync in progress: 70 % done
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 20482875 blocks (9.8 GB)

d10: Submirror of d0
    State: Okay         
    Size: 20482875 blocks (9.8 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c1d0s0          0     No            Okay   Yes 


d11: Submirror of d0
    State: Okay         
    Size: 20482875 blocks (9.8 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c5d0s0          0     No            Okay   Yes 


d12: Submirror of d0
    State: Resyncing    
    Size: 83859300 blocks (39 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c2d0s0          0     No            Okay   Yes 


Device Relocation Information:
Device   Reloc  Device ID
c1d0   Yes      id1,cmdk@AST3320620AS=____________3QF09GL1
c5d0   Yes      id1,cmdk@AST3320620AS=____________3QF0A1QD
c2d0   Yes      id1,cmdk@AST3750840AS=____________5QD36N5M
pearson # 

Once complete do the other root disk:

pearson # metattach d4 d42 
d4: submirror d42 is attached
pearson # 

Finally attach slice 7 to the zpool:

pearson # zpool attach -f tank c1d0s7 c2d0s7
pearson # zpool status
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 252h52m to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d0s7  ONLINE       0     0     0
            c5d0s7  ONLINE       0     0     0
            c2d0s7  ONLINE       0     0     0

errors: No known data errors
pearson # 

The initial estimate is more pessimistic than reality but it still took over 11hours to complete. The next thing was to shut the system down and replace one of the old drives with the new. Once this was done the final slices in use from the old drive can be detached and in the case of the meta devices cleared.

: pearson FSS 4 $; zpool status
  pool: tank
 state: ONLINE
 scrub: scrub completed after 11h8m with 0 errors on Sat Sep  6 20:58:05 2008
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c5d0s7  ONLINE       0     0     0
            c2d0s7  ONLINE       0     0     0

errors: No known data errors
: pearson FSS 5 $; 
: pearson FSS 5 $; metastat
d6: Mirror
    Submirror 0: d62
      State: Okay         
    Submirror 1: d63
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 12659220 blocks (6.0 GB)

d62: Submirror of d6
    State: Okay         
    Size: 12659220 blocks (6.0 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c5d0s6          0     No            Okay   Yes 


d63: Submirror of d6
    State: Okay         
    Size: 12659220 blocks (6.0 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c2d0s6          0     No            Okay   Yes 


d4: Mirror
    Submirror 0: d42
      State: Okay         
    Submirror 1: d43
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 83859300 blocks (39 GB)

d42: Submirror of d4
    State: Okay         
    Size: 83859300 blocks (39 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c5d0s4          0     No            Okay   Yes 


d43: Submirror of d4
    State: Okay         
    Size: 83859300 blocks (39 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c2d0s4          0     No            Okay   Yes 


d0: Mirror
    Submirror 0: d12
      State: Okay         
    Submirror 1: d13
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 83859300 blocks (39 GB)

d12: Submirror of d0
    State: Okay         
    Size: 83859300 blocks (39 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c5d0s0          0     No            Okay   Yes 


d13: Submirror of d0
    State: Okay         
    Size: 83859300 blocks (39 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        c2d0s0          0     No            Okay   Yes 


Device Relocation Information:
Device   Reloc  Device ID
c5d0   Yes      id1,cmdk@AST3750840AS=____________5QD36N5M
c2d0   Yes      id1,cmdk@AST3750840AS=____________5QD3EQEX
: pearson FSS 6 $; 

The old drive is still in the system but currently only has a metadb on it:

: pearson FSS 6 $; metadb -i
        flags           first blk       block count
     a m  p  luo        16              8192            /dev/dsk/c1d0s5
     a    p  luo        8208            8192            /dev/dsk/c1d0s5
     a    p  luo        16400           8192            /dev/dsk/c1d0s5
     a    p  luo        16              8192            /dev/dsk/c5d0s5
     a    p  luo        8208            8192            /dev/dsk/c5d0s5
     a    p  luo        16400           8192            /dev/dsk/c5d0s5
     a       luo        16              8192            /dev/dsk/c2d0s5
     a       luo        8208            8192            /dev/dsk/c2d0s5
     a       luo        16400           8192            /dev/dsk/c2d0s5
 r - replica does not have device relocation information
 o - replica active prior to last mddb configuration change
 u - replica is up to date
 l - locator for this replica was read successfully
 c - replica's location was in /etc/lvm/mddb.cf
 p - replica's location was patched in kernel
 m - replica is master, this is replica selected as input
 t - tagged data is associated with the replica
 W - replica has device write errors
 a - replica is active, commits are occurring to this replica
 M - replica had problem with master blocks
 D - replica had problem with data blocks
 F - replica had format problems
 S - replica is too small to hold current data base
 R - replica had device read errors
 B - tagged data associated with the replica is not valid
: pearson FSS 7 $; 

I'm tempted to leave the third disk in the system so that the disk suite configuration will always have a quorum if a single drive files. However since the BIOS only seems to be able to boot from the first disk drive this may be pointless.

I'm now keenly interested in bug 6592835 “resliver needs to go fasterâ€ since if a disk did fail I don't fancy waiting more than 24hours after I have sourced a new drive for the data to sync when the disks fill. The disk suite devices managed to drive the disk at over 40Mb/sec while ZFS achieved 5Mb/sec.

Wednesday Jun 25, 2008

Why check the digest of files you copy and what to do when they don't match

I'm always copying data from home to work and less often from work to home. Mostly these are disk images. I always check the md5 sum just out of paranoia. It turns out you can't be paranoid enough! The thing to remember if the check sums don't match is not to copy the file again but use rsync. It will bring over just the blocks that are corrupt.

: enoexec.eu FSS 43 $; scp thegerhards.com:/tank/tmp/diskimage.fat.bz2 .
diskimage.fat.bz2    100% |\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*|  1825 MB 11:10:31    
: enoexec.eu FSS 44 $; digest -a md5 diskimage.fat.bz2    
674f69eec065da2b4d3da4bf45c7ae5f
: enoexec.eu FSS 45 $; ssh thegerhards.com digest -a md5 /tank/tmp/diskimage.fat.bz2
191f26762d5b48e0010a575b54746e80
: enoexec.eu FSS 46 $; ls -l diskimage.fat.bz2
-rw-r-----   1 cg13442  staff    1913779931 Jun 25 08:56 diskimage.fat.bz2
: enoexec.eu FSS 47 $; rsync thegerhards.com:/tank/tmp/diskimage.fat.bz2 diskimage.fat.bz2            
: enoexec.eu FSS 48 $; digest -a md5 diskimage.fat.bz2                        
191f26762d5b48e0010a575b54746e80
: enoexec.eu FSS 49 $; 

Since my home directory is now on ZFS and I snapshot every time my card gets inserted into the Sun Ray I can now take a look at what went wrong. Using my zfs_versions script I can get a list of the different versions of the file from all the snapshots:


: enoexec.eu FSS 56 $; digest -a md5 $( zfs_versions diskimage.fat.bz2 | nawk '{ print $NF }')
(/home/cg13442/.zfs/snapshot/user_snap_2008-06-25-05:51:57/diskimage.fat.bz2) = 0a193e0e80dbf83beabca12de09702a0
(/home/cg13442/.zfs/snapshot/user_snap_2008-06-25-05:54:44/diskimage.fat.bz2) = 7aa78dba6a7556fe10115aa5fc345bad
(/home/cg13442/.zfs/snapshot/user_snap_2008-06-25-07:05:34/diskimage.fat.bz2) = c6a77429920f258dfca1dbbd5018a69c
(/home/cg13442/.zfs/snapshot/user_snap_2008-06-25-09:06:39/diskimage.fat.bz2) = 674f69eec065da2b4d3da4bf45c7ae5f
(/home/cg13442/.zfs/snapshot/user_snap_2008-06-25-09:38:22/diskimage.fat.bz2) = 191f26762d5b48e0010a575b54746e80
: enoexec.eu FSS 57 $;


So the last two files in the list represent the corrupted file and the good file:

: enoexec.eu FSS 57 $; cmp -l /home/cg13442/.zfs/snapshot/user_snap_2008-06-2>
cmp -l /home/cg13442/.zfs/snapshot/user_snap_2008-06-25-09:06:39/diskimage.fat.bz2 /home/cg13442/.zfs/snapshot/user_snap_2008-06-25-09:38:22/diskimage.fat.bz2 | head -10                 
84262913   0 360
84262914   0  14
84262915   0 237
84262916   0  25
84262917   0 342
84262918   0 304
84262919   0  41
84262920   0  12
84262921   0 372
84262922   0  20
: enoexec.eu FSS 58 $;

and there appear to be blocks of zeros.

: enoexec.eu FSS 58 $; cmp -l /home/cg13442/.zfs/snapshot/user_snap_2008-06-2>
cmp -l /home/cg13442/.zfs/snapshot/user_snap_2008-06-25-09:06:39/diskimage.fat.bz2 /home/cg13442/.zfs/snapshot/user_snap_2008-06-25-09:38:22/diskimage.fat.bz2 | nawk '$2 != 0 { print $0 } $2 == 0 { count++ } END { printf("%x\\n", count ) }'
23d8c
: enoexec.eu FSS 58 $;

or at least 0x23d8c bytes were zero that should not have been. Need to see if I can reproduce this.

Anyway the moral is always check the md5 digest and if it is wrong use rsync to correct it.

Friday Jun 20, 2008

Pushing grub

After doing my second ZFS to ZFS live upgrade on a laptop I realise I will be starting to test grub's ability to handle lots of different boot targets in it's boot menu:




Already I can see that grub has a scrolling feature I had never seen before!

Tuesday Jun 10, 2008

My first ZFS to ZFS live upgrade

My first live upgrade from ZFS to ZFS was as boring as you could wish for.


# luactivate zfs91
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <zfs90>

Generating boot-sign for ABE <zfs91>
Generating partition and slice information for ABE <zfs91>
Boot menu exists.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fzfs /dev/dsk/c0d0s0 /mnt

3. Run <luactivate> utility with out any arguments from the Parent boot 
environment root slice, as shown below:

     /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

Modifying boot archive service
Activation of boot environment <zfs91> successful.
# 
init 6
#

See all very dull. After it rebooted:

: pearson FSS 8 $; ssh sigma-wired
Last login: Tue Jun 10 12:51:59 2008 from pearson.thegerh
Sun Microsystems Inc.   SunOS 5.11      snv_91  January 2008
: sigma TS 1 $; su - kroot
Password: 
# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
ufs90                      yes      no     no        yes    -         
zfs90                      yes      no     no        yes    -         
zfs91                      yes      yes    yes       no     -         
#

Although I'm not sure I like this:

# zfs list -r tank/ROOT
NAME                                  USED  AVAIL  REFER  MOUNTPOINT
tank/ROOT                            7.90G  8.81G    18K  /export/ROOT
tank/ROOT@zfs90                        17K      -    18K  -
tank/ROOT/zfs90                      4.94M  8.81G  5.37G  /.alt.tmp.b-uK.mnt/
tank/ROOT/zfs91-notyet               7.89G  8.81G  5.39G  /
tank/ROOT/zfs91-notyet@zfs90         70.5M      -  5.37G  -
tank/ROOT/zfs91-notyet@zfs91-notyet  63.7M      -  5.37G  -
# 

I have got used to renaming my exising BE to be nvXX-notyet and then upgrading that. So with ZFS I created a BE called zfs91-notyet upgraded that and then renamed it back. It seems that the renaming of a BE does not rename the underlying filesystems. Easy to work around but is it a bug?

Tuesday May 27, 2008

Liveupgrade UFS -> ZFS

It took a bit of work but I managed to pursuade my old laptop to live upgrade to nevada build 90 with ZFS root. First I upgraded to build 90 on ufs and then created a BE on zfs. The reason for the two step approach was to reduce the risk a bit. Bear in mind this is all new in build 90 and I am not an expert on the inner workings of live upgrade. So there are no guarantees.

The upgrade failed at the last minute with this error:

ERROR: File </boot/grub/menu.lst> not found in top level dataset for BE <zfs90>
ERROR: Failed to copy file </boot/grub/menu.lst> from top level dataset to BE <zfs90>
ERROR: Unable to delete GRUB menu entry for boot environment <zfs90>.
ERROR: Cannot make file systems for boot environment <zfs90>.

This bug has already been filed (6707013 LU fail to migrate root file system from UFS to ZFS)

However lustatus said all was well so I tried to activate it:

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
ufs90                      yes      yes    yes       no     -         
zfs90                      yes      no     no        yes    -         
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
ERROR: No such file or directory: cannot stat </etc/lu/ICF.2>
ERROR: cannot use </etc/lu/ICF.2> as an icf file
ERROR: Unable to mount the boot environment <zfs90>.
#

No joy. Can I mount it?

# lumount -n zfs90
ERROR: No such file or directory: cannot open </etc/lu/ICF.2> mode <r>
ERROR: individual boot environment configuration file does not exist - the specified boot environment is not configured properly
ERROR: cannot access local configuration file for boot environment <zfs90>
ERROR: cannot determine file system configuration for boot environment <zfs90>
ERROR: No such file or directory: error unmounting <tank/ROOT/zfs90>
ERROR: cannot mount boot environment by name <zfs90>
# 

With nothing to loose I copied the ICF file for the UFS BE and edited to look like what I suspected one for a ZFS BE would look like. I got lucky as I was right!

# ls /etc/lu/ICF.1
/etc/lu/ICF.1
# cat  /etc/lu/ICF.1
ufs90:/:/dev/dsk/c0d0s7:ufs:19567170
# cp  /etc/lu/ICF.1  /etc/lu/ICF.2
# vi  /etc/lu/ICF.2
# cat /etc/lu/ICF.2

zfs90:/:tank/ROOT/zfs90:zfs:0
# lumount -n zfs90                
/.alt.zfs90
# df
/                  (/dev/dsk/c0d0s7   ): 1019832 blocks   740833 files
/devices           (/devices          ):       0 blocks        0 files
/dev               (/dev              ):       0 blocks        0 files
/system/contract   (ctfs              ):       0 blocks 2147483616 files
/proc              (proc              ):       0 blocks     9776 files
/etc/mnttab        (mnttab            ):       0 blocks        0 files
/etc/svc/volatile  (swap              ): 1099144 blocks   150523 files
/system/object     (objfs             ):       0 blocks 2147483395 files
/etc/dfs/sharetab  (sharefs           ):       0 blocks 2147483646 files
/dev/fd            (fd                ):       0 blocks        0 files
/tmp               (swap              ): 1099144 blocks   150523 files
/var/run           (swap              ): 1099144 blocks   150523 files
/tank              (tank              ):24284511 blocks 24284511 files
/tank/ROOT         (tank/ROOT         ):24284511 blocks 24284511 files
/lib/libc.so.1     (/usr/lib/libc/libc_hwcap1.so.1): 1019832 blocks   740833 files
/.alt.zfs90        (tank/ROOT/zfs90   ):24284511 blocks 24284511 files
/.alt.zfs90/var/run(swap              ): 1099144 blocks   150523 files
/.alt.zfs90/tmp    (swap              ): 1099144 blocks   150523 files
# luumount zfs90
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
diff: /.alt.tmp.b-svc.mnt/etc/lu/synclist: No such file or directory

Generating boot-sign for ABE <zfs90>
ERROR: File </etc/bootsign> not found in top level dataset for BE <zfs90>
Generating partition and slice information for ABE <zfs90>
Boot menu exists.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fufs /dev/dsk/c0d0s7 /mnt

3. Run <luactivate> utility with out any arguments from the Parent boot 
environment root slice, as shown below:

     /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

Modifying boot archive service
Activation of boot environment <zfs90> successful.
#

Fixing boot sign

#file /etc/bootsign
/etc/bootsign:  ascii text
# cat  /etc/bootsign
BE_ufs86
BE_ufs90
# vi  /etc/bootsign
# lumount -n zfs90 /a
/a
# cat /a/etc/bootsign
cat: cannot open /a/etc/bootsign: No such file or directory
# cat /a/etc/bootsign
cat: cannot open /a/etc/bootsign: No such file or directory
# cp /etc/bootsign /a/etc
# vi  /a/etc/bootsign 
# cat /a/etc/bootsign
BE_zfs90
# 
# luumount /a
# luactivate ufs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
Activating the current boot environment <ufs90> for next reboot.
The current boot environment <ufs90> has been activated for the next reboot.
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
diff: /.alt.tmp.b-hNc.mnt/etc/lu/synclist: No such file or directory

Generating boot-sign for ABE <zfs90>
Generating partition and slice information for ABE <zfs90>
Boot menu exists.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fufs /dev/dsk/c0d0s7 /mnt

3. Run <luactivate> utility with out any arguments from the Parent boot 
environment root slice, as shown below:

     /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

moModifying boot archive service
Activation of boot environment <zfs90> successful.
# lumount -n zfs90 /a
/a
# cat /a/etc/bootsign
BE_zfs90
# luumount /a
# init 6

The system now booted off the ZFS pool. Once up I just had to see if I could create a second ZFS be as a clone of the first and if so haw fast was this.

# df /
/                  (tank/ROOT/zfs90   ):23834562 blocks 23834562 files

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
ufs90                      yes      no     no        yes    -         
zfs90                      yes      yes    yes       no     -         
# time lucreate -p tank -n zfs90.2
Checking GRUB menu...
System has findroot enabled GRUB
Analyzing system configuration.
Comparing source boot environment <zfs90> file systems with the file 
system(s) you specified for the new boot environment. Determining which 
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <zfs90.2>.
Source boot environment is <zfs90>.
Creating boot environment <zfs90.2>.
Cloning file systems from boot environment <zfs90> to create boot environment <zfs90.2>.
Creating snapshot for <tank/ROOT/zfs90> on <tank/ROOT/zfs90@zfs90.2>.
Creating clone for <tank/ROOT/zfs90@zfs90.2> on <tank/ROOT/zfs90.2>.
Setting canmount=noauto for </> in zone <global> on <tank/ROOT/zfs90.2>.
No entry for BE <zfs90.2> in GRUB menu
Population of boot environment <zfs90.2> successful.
Creation of boot environment <zfs90.2> successful.

real    0m38.40s
user    0m6.89s
sys     0m11.59s
# 

38 seconds to create a BE, something that would take over and hour with UFS.

I'm not foolish brave enough to do the home server yet so that is on nv90 with UFS. When the bug is fixed I'll give it a go.

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today