Friday Oct 09, 2009

Preparing for OpenSolaris @ home

Since the "nevada" builds of Solaris next are due to end soon and for some time the upgrade of my home server has involved more than a little bit of TLC to get it to work I will be moving to an OpenSolaris build just as soon as I can.

However before I can do this I need to make sure I have all thesoftware to provide home service. This is really a note to myself to I don't forget anything.

  • Exim Mail Transfer Agent (MTA). Since I am using certain encryption routines, virus detection and spamassassin I was unable to use the standard MTA, sendmail, when the system was originaly built and have been using exim, from blastwave. I hope to build and use exim without getting all the cruft that comes from the Blastwave packaged. So far this looks like it will be simple as OpenSolaris now has OpenSSL.

  • An imapd. Currently I have a blastwave version but again I intend to build this from scratch again the addition of OpenSSL and libcrypto should make this easy.

  • Clamav. To protect any Windows systems and to generally not pass on viri to others clamav has been scanning all incoming email. Again I will build this from scratch as I already do.

  • Spamassassin. Again I already build this for nevada so building it for OpenSolaris will be easy.

  • Ddclient. Having dynamic DNS allows me to login remotely and read email.

  • Squeezecenter. This is a big issue and in the past has proved hard to get built thanks to all the perl dependacies. It is for that reason I will continue to run it in a zone so that I don't have to trash the main system. Clearly with all my digital music loaded into the squeezecentre software this has to work.

I'm going to see if I can jump through the legal hoops that will allow me to contribute the builds to the contrib repository via Source Juicer. However as this is my spare time I don't know whether the legal reviews will be funded.

Due to the way OpenSolaris is delivered I also need to be more careful about what I install. rather than being able to choose everything. First I need my list from my laptop. Then in addtion to that I'll need

  • Samba - pkg:/SUNWsmba

  • cups - pkg:/SUNWcups

  • OpenSSL - pkg:/SUNWopenssl

Oh and I'll need the Sun Ray server software.

Tuesday Jan 27, 2009

New version of scsi.d required for build 106

This version supports some more filters. Specifically you can now specify these new options:

  • MIN_BLOCK only report on IO to less than or equal to this value.

  • MAX_BLOCK only report on IO to blocks greater or equal to this value.

This is most useful for limiting your trace to particular block ranges, be they file system or as was the case that caused me to add this to see who is trampling on the disk label.

In this contrived example it was format:

pfexec /usr/sbin/dtrace -o /tmp/dt.$$ -Cs  scsi.d -D MAX_BLOCK=3 
<SNIP>
00058.529467684 glm0:-> 0x0a  WRITE(6) address 01:00, lba 0x000000, len 0x000001
, control 0x00 timeout 60 CDBP 60012635560 1 format(3985) cdb(6) 0a0000000100
00058.542945891 glm0:<- 0x0a  WRITE(6) address 01:00, lba 0x000000, len 0x000001
, control 0x00 timeout 60 CDBP 60012635560, reason 0x0 (COMPLETED) pkt_state 0x1
f state 0x0 Success Time 13604us

While this answered my question there are neater ways of answering the question just by using the IO provider:

: s4u-nv-gmp03.eu TS 68 $; pfexec /usr/sbin/dtrace -n 'io:::start / args[0]->b_blkno < 3 && args[0]->b_flags & B_WRITE / { printf("%s %s %d %d", execname, args[1]->dev_statname, args[0]->b_blkno, args[0]->b_bcount) }'
dtrace: description 'io:::start ' matched 6 probes
CPU     ID                    FUNCTION:NAME
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512

Also build 106 of nevada has changed the structure definition for scsi_address and in doing so this breaks scsi.d which has intimate knowledge of scsi_address structures. I have a solution that you can download but in writing it I also filed this bug:

679803 dtrace suffers from macro recursion when including scsi_address.h

which scsi.d has to work around. When that bug is resolved the work around may have to be revisited.

All versions of scsi.d are available here and this specific verison, version 1.16 here.

Thank you to Artem Kachitchkine for bringing the changes to scsi_address.h and their effects on scsi.d to my attention.

Thursday Aug 28, 2008

It scrubbed up good

Since the home server has been snapping regularly I have had to choose between snapshots and scrubbing and I chose snapshots. User error is more likely than hardware failures and scrubbing is really about seeing those errors sooner so you don't get a unrecoverable failure due to having two problems at once. However I would rather not have to choose.

So I was particularly pleased to see that build 94 contains the fix for this bug:

6343667 scrub/resilver has to start over when a snapshot is taken

So today the home server had it´s first scrub in years and it scrubbed up well:

: pearson FSS 5 $; pfexec zpool status
  pool: tank
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: scrub completed after 12h42m with 0 errors on Thu Aug 28 20:12:36 2008
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d0s7  ONLINE       0     0     0
            c5d0s7  ONLINE       0     0     0

errors: No known data errors
: pearson FSS 6 $; 

When I upgrade the pool, after the other live upgrade boot environment can support this pool version, there is the promise of a faster scrub but since this scrub happened during the day and I also backed up the pool using zfs_backup during the same time.

Tuesday Jun 24, 2008

Good Morning Build 92

Our Sun Ray upgrade strategy hiccuped and now has both the SPARC systems running the same release, which when it is nv92 is nothing to complain about:

: enoexec.eu FSS 1 $; uname -a

SunOS enoexec 5.11 snv_92 sun4v sparc SUNW,SPARC-Enterprise-T5220

: enoexec.eu FSS 2 $;

The reason for the hiccup was two fold. Once we had established that the T5220 was a perfect Sun Ray server and managed to find and file some really nasty bugs found because we were using it a far sighted director agreed to fund one long term. This left the old Sun Fire system looking like a very large bit of tin, burning lots of power and taking up a lot of space while only providing one Sun Ray server. So that has now been replaced with a V890:

: estale.eu FSS 1 $; uname -a
SunOS estale 5.11 snv_92 sun4u sparc SUNW,Sun-Fire-V890
: estale.eu FSS 2 $; 

Since this was fresh hardware it was freshly installed and was to be the build 92 server while the T5220 served build 91 and would at some point serve build 94. That was until we diagnosed that we were hitting a bug on the T5220 which made it stall sometime for minutes that is fixed in build 92 so we have both systems running build 92.

Saturday Jun 21, 2008

Return of automatic status setting in IM

At last the rest of the bits of “gaim” that disappeared from Solaris when it moved to be “pidgin” have returned in Nevada build 92. I'm talking about “purple-remote” which is the program that replaces “gaim-remote” and thus allows me once again to set my away message using “utaction” so when I disconnect from my Sun Ray session my IM status is automatically set as well.

If you take the script that I wrote last time and do a global edit changing “gaim-remote” to be “purple-remote” it will work. Something I realise now but did not then was that you only need one ut-action command to handle both connection and disconnection so this will do it:

utaction -d "purple-remote 'setstatus?status=away&message=Away from Sun Ray'" -c "${HOME}/bin/sh/ut-where"


Friday Jun 20, 2008

Pushing grub

After doing my second ZFS to ZFS live upgrade on a laptop I realise I will be starting to test grub's ability to handle lots of different boot targets in it's boot menu:




Already I can see that grub has a scrolling feature I had never seen before!

Wednesday Jun 18, 2008

Loading the kernel takes a lot longer

Interesting change when installing snv_91 over the net from earlier releases. There is a very considerable delay with the “spinning bar” running here:


Sun Fire V440, No Keyboard
Copyright 1998-2004 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.16.1, 16384 MB memory installed, Serial #55495765.
Ethernet address 0:3:ba:4e:cc:55, Host ID: 834ecc55.



Rebooting with command: boot net                                      
Boot device: /pci@1c,600000/network@2  File and args: 
/pci@1c,600000/network@2: 100 Mbps full duplex link up
Timeout waiting for ARP/RARP packet
4000 /pci@1c,600000/network@2: 100 Mbps full duplex link up
/

Previously you would get the SunOS version message quite quickly.


SunOS Release 5.11 Version snv_91 64-bit

Now it takes many, more than five, minutes (with the 100Mbps link) to load over NFS. So be patient!

Tuesday Jun 10, 2008

My first ZFS to ZFS live upgrade

My first live upgrade from ZFS to ZFS was as boring as you could wish for.


# luactivate zfs91
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <zfs90>

Generating boot-sign for ABE <zfs91>
Generating partition and slice information for ABE <zfs91>
Boot menu exists.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fzfs /dev/dsk/c0d0s0 /mnt

3. Run <luactivate> utility with out any arguments from the Parent boot 
environment root slice, as shown below:

     /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

Modifying boot archive service
Activation of boot environment <zfs91> successful.
# 
init 6
#

See all very dull. After it rebooted:

: pearson FSS 8 $; ssh sigma-wired
Last login: Tue Jun 10 12:51:59 2008 from pearson.thegerh
Sun Microsystems Inc.   SunOS 5.11      snv_91  January 2008
: sigma TS 1 $; su - kroot
Password: 
# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
ufs90                      yes      no     no        yes    -         
zfs90                      yes      no     no        yes    -         
zfs91                      yes      yes    yes       no     -         
#

Although I'm not sure I like this:

# zfs list -r tank/ROOT
NAME                                  USED  AVAIL  REFER  MOUNTPOINT
tank/ROOT                            7.90G  8.81G    18K  /export/ROOT
tank/ROOT@zfs90                        17K      -    18K  -
tank/ROOT/zfs90                      4.94M  8.81G  5.37G  /.alt.tmp.b-uK.mnt/
tank/ROOT/zfs91-notyet               7.89G  8.81G  5.39G  /
tank/ROOT/zfs91-notyet@zfs90         70.5M      -  5.37G  -
tank/ROOT/zfs91-notyet@zfs91-notyet  63.7M      -  5.37G  -
# 

I have got used to renaming my exising BE to be nvXX-notyet and then upgrading that. So with ZFS I created a BE called zfs91-notyet upgraded that and then renamed it back. It seems that the renaming of a BE does not rename the underlying filesystems. Easy to work around but is it a bug?

Tuesday May 27, 2008

Liveupgrade UFS -> ZFS

It took a bit of work but I managed to pursuade my old laptop to live upgrade to nevada build 90 with ZFS root. First I upgraded to build 90 on ufs and then created a BE on zfs. The reason for the two step approach was to reduce the risk a bit. Bear in mind this is all new in build 90 and I am not an expert on the inner workings of live upgrade. So there are no guarantees.

The upgrade failed at the last minute with this error:

ERROR: File </boot/grub/menu.lst> not found in top level dataset for BE <zfs90>
ERROR: Failed to copy file </boot/grub/menu.lst> from top level dataset to BE <zfs90>
ERROR: Unable to delete GRUB menu entry for boot environment <zfs90>.
ERROR: Cannot make file systems for boot environment <zfs90>.

This bug has already been filed (6707013 LU fail to migrate root file system from UFS to ZFS)

However lustatus said all was well so I tried to activate it:

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
ufs90                      yes      yes    yes       no     -         
zfs90                      yes      no     no        yes    -         
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
ERROR: No such file or directory: cannot stat </etc/lu/ICF.2>
ERROR: cannot use </etc/lu/ICF.2> as an icf file
ERROR: Unable to mount the boot environment <zfs90>.
#

No joy. Can I mount it?

# lumount -n zfs90
ERROR: No such file or directory: cannot open </etc/lu/ICF.2> mode <r>
ERROR: individual boot environment configuration file does not exist - the specified boot environment is not configured properly
ERROR: cannot access local configuration file for boot environment <zfs90>
ERROR: cannot determine file system configuration for boot environment <zfs90>
ERROR: No such file or directory: error unmounting <tank/ROOT/zfs90>
ERROR: cannot mount boot environment by name <zfs90>
# 

With nothing to loose I copied the ICF file for the UFS BE and edited to look like what I suspected one for a ZFS BE would look like. I got lucky as I was right!

# ls /etc/lu/ICF.1
/etc/lu/ICF.1
# cat  /etc/lu/ICF.1
ufs90:/:/dev/dsk/c0d0s7:ufs:19567170
# cp  /etc/lu/ICF.1  /etc/lu/ICF.2
# vi  /etc/lu/ICF.2
# cat /etc/lu/ICF.2

zfs90:/:tank/ROOT/zfs90:zfs:0
# lumount -n zfs90                
/.alt.zfs90
# df
/                  (/dev/dsk/c0d0s7   ): 1019832 blocks   740833 files
/devices           (/devices          ):       0 blocks        0 files
/dev               (/dev              ):       0 blocks        0 files
/system/contract   (ctfs              ):       0 blocks 2147483616 files
/proc              (proc              ):       0 blocks     9776 files
/etc/mnttab        (mnttab            ):       0 blocks        0 files
/etc/svc/volatile  (swap              ): 1099144 blocks   150523 files
/system/object     (objfs             ):       0 blocks 2147483395 files
/etc/dfs/sharetab  (sharefs           ):       0 blocks 2147483646 files
/dev/fd            (fd                ):       0 blocks        0 files
/tmp               (swap              ): 1099144 blocks   150523 files
/var/run           (swap              ): 1099144 blocks   150523 files
/tank              (tank              ):24284511 blocks 24284511 files
/tank/ROOT         (tank/ROOT         ):24284511 blocks 24284511 files
/lib/libc.so.1     (/usr/lib/libc/libc_hwcap1.so.1): 1019832 blocks   740833 files
/.alt.zfs90        (tank/ROOT/zfs90   ):24284511 blocks 24284511 files
/.alt.zfs90/var/run(swap              ): 1099144 blocks   150523 files
/.alt.zfs90/tmp    (swap              ): 1099144 blocks   150523 files
# luumount zfs90
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
diff: /.alt.tmp.b-svc.mnt/etc/lu/synclist: No such file or directory

Generating boot-sign for ABE <zfs90>
ERROR: File </etc/bootsign> not found in top level dataset for BE <zfs90>
Generating partition and slice information for ABE <zfs90>
Boot menu exists.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fufs /dev/dsk/c0d0s7 /mnt

3. Run <luactivate> utility with out any arguments from the Parent boot 
environment root slice, as shown below:

     /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

Modifying boot archive service
Activation of boot environment <zfs90> successful.
#

Fixing boot sign

#file /etc/bootsign
/etc/bootsign:  ascii text
# cat  /etc/bootsign
BE_ufs86
BE_ufs90
# vi  /etc/bootsign
# lumount -n zfs90 /a
/a
# cat /a/etc/bootsign
cat: cannot open /a/etc/bootsign: No such file or directory
# cat /a/etc/bootsign
cat: cannot open /a/etc/bootsign: No such file or directory
# cp /etc/bootsign /a/etc
# vi  /a/etc/bootsign 
# cat /a/etc/bootsign
BE_zfs90
# 
# luumount /a
# luactivate ufs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
Activating the current boot environment <ufs90> for next reboot.
The current boot environment <ufs90> has been activated for the next reboot.
# luactivate zfs90
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <ufs90>
diff: /.alt.tmp.b-hNc.mnt/etc/lu/synclist: No such file or directory

Generating boot-sign for ABE <zfs90>
Generating partition and slice information for ABE <zfs90>
Boot menu exists.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fufs /dev/dsk/c0d0s7 /mnt

3. Run <luactivate> utility with out any arguments from the Parent boot 
environment root slice, as shown below:

     /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

moModifying boot archive service
Activation of boot environment <zfs90> successful.
# lumount -n zfs90 /a
/a
# cat /a/etc/bootsign
BE_zfs90
# luumount /a
# init 6

The system now booted off the ZFS pool. Once up I just had to see if I could create a second ZFS be as a clone of the first and if so haw fast was this.

# df /
/                  (tank/ROOT/zfs90   ):23834562 blocks 23834562 files

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
ufs90                      yes      no     no        yes    -         
zfs90                      yes      yes    yes       no     -         
# time lucreate -p tank -n zfs90.2
Checking GRUB menu...
System has findroot enabled GRUB
Analyzing system configuration.
Comparing source boot environment <zfs90> file systems with the file 
system(s) you specified for the new boot environment. Determining which 
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <zfs90.2>.
Source boot environment is <zfs90>.
Creating boot environment <zfs90.2>.
Cloning file systems from boot environment <zfs90> to create boot environment <zfs90.2>.
Creating snapshot for <tank/ROOT/zfs90> on <tank/ROOT/zfs90@zfs90.2>.
Creating clone for <tank/ROOT/zfs90@zfs90.2> on <tank/ROOT/zfs90.2>.
Setting canmount=noauto for </> in zone <global> on <tank/ROOT/zfs90.2>.
No entry for BE <zfs90.2> in GRUB menu
Population of boot environment <zfs90.2> successful.
Creation of boot environment <zfs90.2> successful.

real    0m38.40s
user    0m6.89s
sys     0m11.59s
# 

38 seconds to create a BE, something that would take over and hour with UFS.

I'm not foolish brave enough to do the home server yet so that is on nv90 with UFS. When the bug is fixed I'll give it a go.

Friday Apr 25, 2008

Why I run the latest Solaris build on my laptop

While on my holidays I was blissfully disconnected from the internet but still had to have my laptop on hand to empty my camera. This allowed me to trip over this bug that I filed on my return 6691387 which is already destined to be fixed in build 89. Result.

Tuesday Apr 22, 2008

Good Morning Build 87

On arriving back from my well earned holiday (yes I had a very relaxing time thank you) I see that not only is there a nice new Solaris release on our server but the server itself is no longer a power hungry SunFire box but an Eco friendly Niagra based T5220 and today is Earth Day. It's almost like we planned it!


: enoexec.eu FSS 1 $; uname -a
SunOS enoexec 5.11 snv_87 sun4v sparc SUNW,SPARC-Enterprise-T5220
: enoexec.eu FSS 2 $; 


We have actually been here before with respect to the Huron server as we tried one out for a few weeks around the time of build 83. I left the blogging of it to a colleague as it was not 100% successful or was 100% successful depending on your point of view. The system would crash regularly and finally a bug (which unfortunately is not published as it contains the way to panic an un-patched system running nevada so is marked as a “security” issue. I think this is a bit harsh) was filed and rapidly fixed. So if you are using a system to shake out bugs, which is why I use a nevada Sun Ray server at work, then this was 100% successful.

The performance of the system is well good enough. That is to say the only clue I had that something was “up” was that the performance meter I run, I know I should not but I do, get over it, looks “inverted”. We are not making a dent on the threads.

Friday Apr 11, 2008

We can be quick

Yesterday I managed to get the fix for 6686086 putback into the nevada gate and as the bug reflects the fix will be in build 88. It does show that it is possible to get things done quickly if there is the will. 24 hours from filing the bug to the fix being delivered. Yes the fix is trivial but the “paper work” still has to be done.

Thank you to the reviewers and Evaluator for being prompt.

Sunday Apr 06, 2008

What does the home server do?

I was recently asked what the home server serves. So here is the list:

  1. NAS server. NFS and CIFS (via SAMBA). There is a single Windows system in the house which is increasingly not switched on. NFS for the two laptops that frequent the network. All supported via ZFS on two 400Gb drives with literally thousands of snapshots,44170. Space is beginning to get short thanks to the 10Mega pixel SLR camera so in the not to distant future a disk upgrade will be required.

  2. Sun Ray server. There are (currently) three Sun Rays. One acts as a photo frame and has no keyboard or mouse. The other two provide real interactive use. I can foresee a situation where we have two more Sun Rays.

  3. Email server. SMTP and IMAP via exim and imapd respectively. Clearly this implies spamassassin and and antivirus scanner, clamAV.

  4. SlimServer. I've just run up a slim server to get better access to internet radio stations. Having a radio player that I can hook up to the hi-fi that is not DAB, ie crap1, would be good. I feel a squeezebox coming soon.

Just occasionally and every time I ran up VirtualBox the system would struggle to cope prior to the CPU upgrade even when using the Fair Share Schedler. Since the upgrade it has not had any problems with having us all using it.




1It is nice to see that I am not alone in realising DAB is crap.

Monday Mar 03, 2008

Good Morning Build 84

Always nice to “arrive” at work and find your server running a shiny new build:

: enoexec.eu FSS 1 $; uname -a
SunOS enoexec 5.11 snv_84 sun4u sparc SUNW,Sun-Fire
: enoexec.eu FSS 2 $; 

Even as I type this I noticed that there are still issues in the world of gnome with gnome-panel dumping core:

Mar  3 08:27:16 enoexec genunix: [ID 603404 kern.notice] NOTICE: core_log: gnome-panel[11701] core dumped: /var/cores/usr/bin/gnome-panel.global.14442
: enoexec.eu FSS 7 $; pstack core-enoexec-gnome-panel.11701
core 'core-enoexec-gnome-panel.11701' of 11701:	gnome-panel --sm-config-prefix /gnome-panel-GZaWDp/ --sm-client-id 118
 00000000 ???????? (847f98, 3, a87e20, 0, a48cd8, 0)
 bb7a2ac4 invoke_notifies (a7a0c0, 3, 8ce808, 0, 148a0, bb7b7300) + 68
 bb7a2b50 emit_events_in_idle (a40f18, bb7b7ac0, bb7b7300, 7c0, 14814, 400) + 68
 c3dc0010 g_main_dispatch (127f00, 0, 0, 0, c3e63eb4, 127f08) + 1e4
 c3dc16b8 g_main_context_dispatch (1, 1, c3e65ea4, 2, c3e65ea8, 127f00) + c8
 c3dc1bd8 g_main_context_iterate (1, 1, 1, 127f00, 1f, 1f) + 49c
 c3dc24dc g_main_loop_run (5a9390, 0, 0, f9fc8, 5a9398, 1) + 3e4
 c05a6b34 gtk_main (0, 0, 0, be5c, c07f1538, 5a9390) + d8
 00037e70 main     (7, 163c58, 80000, 803e0, 803f8, 8040c) + 190
 00033740 _start   (0, 0, 0, 0, 0, 0) + 108
: enoexec.eu FSS 8 $; 

Time to find or file a bug...found it is 6666675. Given that number it would be rude not to check out bug 6666666, alas it turns out to be an “internal” bug and not surprisingly has no special significance. It should have been the uber bug of doom bwahahahahah.

Saturday Feb 16, 2008

Upgrade to build 83 fails.

On upgrading the home server to nv83 the system would not make it to milestone multi-user. Instead it would fail and eventually time out leaving a number of critical services not running. Specifially dhcp and the graphical login rendering all the Sun Rays and also the console pretty much useless. After a short amount of digging I tried to revert to build 82 but the luactivate command hung running “matastat -i” which was stuck here:

1687:   /usr/sbin/metastat -i
 fef04647 pollsys  (806df78, 1, 8046608, 0)
 feeb5ea2 poll     (806df78, 1, 5265c00) + 52
 fed13c43 read_vc  (806f258, 8082038, 2328) + 117
 fed2b7b8 fill_input_buf (806fe18, 0) + 44
 fed2b864 get_input_bytes (806fe18, 80466d4, 4, 0) + 88
 fed2b8d7 set_input_fragment (806fe18) + 33
 fed2b26b xdrrec_getbytes (806f2a0, 804672c, 4) + 7b
 fed2b109 xdrrec_getint32 (806f2a0, 804674c) + 69
 fed2b1b1 xdrrec_getlong (806f2a0, 80467d0) + 15
 fed2a188 xdr_u_int (806f2a0, 80467d0) + 30
 fed1a31d xdr_replymsg (806f2a0, 80467d0) + 3b9
 fed1317c clnt_vc_call (80671a8, 6, fee11010, 8046880, fee11064, 804688c) + 238
 fee10096 mdrpc_getset_2 (8046880, 804688c, 80671a8) + 3e
 fedb5819 clnt_getset (fee42e89, 0, 9, 8046980, 80469c0) + d1
 feddc7c9 getsetbynum (9, 80469c0) + 39
 fedcb007 metasetnosetname (9, 80469c0) + 43
 fedf6549 meta_smf_getmask (8047490, 80472bc, feffb7cc, 0, 0, 0) + 159
 080534ec main     (2, 8047300, 804730c) + 3a8
 08052ce6 _start   (2, 80474f8, 804750b, 0, 804750e, 8047537) + 7a

Scanning http://bugs.opensolaris.org this looks like 6656879 so I'm back on build 82 and will be watching that bug carefully. Once again I'm thankful for live upgrade making the return trip to build 82 just a case of rebooting.

In other news I had reason to be thankful for the hourly snapshots that the server takes allowing me to recover the “.mozilla” directory for one of the users after they pressed the wrong button when asked the question about firefox not having exited cleanly.

Saturday Feb 09, 2008

cron code delivered.

At last I have handed over the cron changes to support different timezones to Darren who is sponsoring the effort. I've learned a lot in the process so far of trying to do this work from “outside” of Sun. Mostly that the time required to do even a very small project like this is very great and there are times when you can't just put it down if you are busy. This makes it very difficult when doing this in your own “spare” time and can lead to some spectacularly late nights. The other problems were around keeping a build system running at home. The sometimes long times between working on this resulted in considerable effort to keep up with the various flag days. I also had some tangles with mercurial that did not help.

The ARC case was quite painless even if there were elements of Bike Shed Syndrome in it with real dangers of even greater feature creep. Having actually experienced ARCs internally I was probably better prepared for this than a real external engineer.

I got some really great feed back during the code reviews which has resulted in a better end result.

Now I'm just sitting back and waiting.

Friday Feb 08, 2008

Good Morning Build 81 - part 2

The responsible engineer for the portfs bug 6659309 sent me some new binaries and now our system is back running build 82 but with a new portfs module. Just to be 100% certain the bug is fixed I ran this D that show that we would have paniced without the patch:

: estale.eu FSS 18 $; pfexec /usr/sbin/dtrace -n 'fbt::port_pfp_setup:entry {>
pfexec /usr/sbin/dtrace -n 'fbt::port_pfp_setup:entry {
	self->pfs = 1;
}
fbt::port_alloc_event_local:return / self->pfs == 1 && arg1 == 0 /  {
	self->pfs = 2;
}
fbt::port_pfp_setup:return /arg1 != 0 && self->pfs == 2/ {
	printf("port_pfp_setup failed %d: We would have crashed!", arg1);
}
fbt::port_pfp_setup:return /self->pfs / {
	self->pfs = 0
}'
dtrace: description 'fbt::port_pfp_setup:entry ' matched 4 probes
CPU     ID                    FUNCTION:NAME
  0  45054            port_pfp_setup:return port_pfp_setup failed 22: We would have crashed!
  0  45054            port_pfp_setup:return port_pfp_setup failed 22: We would have crashed!

So that pretty much tells us the bug is fixed. Well that and the fact that the system is not crashing!


However metacity is. I wonder if the two are related. Time to investigate this core dump:

: estale.eu FSS 11 $; pstack  core-estale-metacity.9140
core 'core-estale-metacity.9140' of 9140:       metacity --sm-save-file 1187603434-7917-1684782535.ms
 c40cab10 _lwp_kill (6, 0, 5, 6, ffffffff, 6) + 8
 c4054b44 abort    (1, 1, 6, c4153940, fbc5c, 0) + 108
 c3e4a550 g_logv   (ba58c, 6, 5, c3ee5404, 4, c3ee3404) + 484
 c3e4a57c g_log    (ba58c, 4, ba598, ba5c4, 1a4, ba7d0) + 1c
 00073f5c meta_window_new_with_attrs (19fa88, 1180001, ba400, ffbfe56c, fb7188, 0) + 3d8
 00073b5c meta_window_new (19fa88, 1180001, 0, daddcafe, 12, fe8) + 78
 00034780 event_callback (ffbfea28, 19fa88, 0, 33400, 14, 1180001) + f74
 00071d64 filter_func (0, ba000, 243068, 1212d0, 3380c, 243068) + 54
 c3f4d308 gdk_event_apply_filters (71d10, 11f1630, 0, 243048, 317c98, ffbfea28) + 24
 c3f4def8 gdk_event_translate (1af020, ffbfea28, ffbfea28, 0, 11f1630, 1a4e3c0) + 44
 c3f4fbc4 _gdk_events_queue (1af020, 0, 5, 0, 1b2228, ffbfea28) + ac
 c3f4fda8 gdk_event_dispatch (197a50, 0, 0, c3f9c8c0, 4cb54, 1af020) + 40
 c3e3ffc4 g_main_dispatch (171f00, 0, 0, 0, c3ee3404, 171f08) + 1e4
 c3e4166c g_main_context_dispatch (1, 1, c3ee53fc, 2, c3ee5400, 171f00) + c8
 c3e41b8c g_main_context_iterate (1, 1, 1, 171f00, 27, 28) + 49c
 c3e42490 g_main_loop_run (113dc0, 0, 0, 141fc8, 113dc8, 1) + 3e4
 000495f4 main     (b2000, b1c00, b1c00, 1, 48c00, 49400) + 4c0
 0002b160 _start   (0, 0, 0, 0, 0, 0) + 108
: estale.eu FSS 12 $; 
: estale.eu FSS 13 $; mdb core-estale-metacity.9140
Loading modules: [ libumem.so.1 libc.so.1 libuutil.so.1 ld.so.1 ]
> ba58c/s
0xba58c:        metacity
> ba598/s
0xba598:        file %s: line %d: assertion failed: (%s)
> ba5c4/s
0xba5c4:        window.c
> ba7d0/s
0xba7d0:        window->screen
> 

Tuesday Feb 05, 2008

Good Morning Build 81, or not.

I did not even get a chance to login to the Sun Ray server running build 82 before it had crashed twice. So all was not well. A bit of digging and it was looking like a problem somewhere in portfs with kmem corruption. Since the problem was easily reproducible (boot system login and use for a few hours) I got the lab staff to set kmem_flags to 0xf in /etc/system and boot again.

Sure enough this morning there were two more crash dumps with variations of this in the message buffers:

kernel memory allocator: 
duplicate free: buffer freed twice
buffer=60063bfed60  bufctl=300f08886b8  cache: kmem_alloc_32
previous transaction on buffer 60063bfed60:
thread=300f43dac60  time=T-0.000269600  slab=300f08761e0  cache: kmem_alloc_32
kmem_cache_free+30
port_pcache_remove_fop+44
port_pfp_setup+198
port_associate_fop+2b8
portfs+2c8

panic[cpu512]/thread=300f43dac60: 
kernel heap corruption detected

> $c
vpanic(12ac480, 5, 2c8, 1, 18de000, 12ac400)
kmem_error+0x4e8(18de000, 3000005ae08, 60063bfed60, 12ac400, 12ac478, 
2afdfbc8220)
port_associate_fop+0x408(16, 7, 4a330, 16, 4a330, 2a10424d968)
portfs+0x2c8(1, 0, 7, 2a0, 0, 4a330)
syscall_trap32+0xcc(1, a, 7, 4a330, 10000006, 4a330)
> 

Looking at the code it appears that if port_pfp_setup encounters an error it frees the some kernel memory twice. Specifically it frees the memory pointed to by the cname local variable in port_associate_fop twice. Hence the random panics. The diffs for the fix are:


\*\*\* port_fop.c  Fri Oct 26 08:58:01 2007
--- /tmp/cg13442/port_fop.c     Tue Feb  5 14:04:21 2008
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
\*\*\* 1306,1311 \*\*\*\*
--- 1306,1312 ----
                if (error = port_pfp_setup(&pfp, pp, vp, pfcp, object,
                    events, user, cname, clen, dvp)) {
                        mutex_exit(&pfcp->pfc_lock);
+                       cname = NULL;
                        goto errout;
                }

I have just files this bug:

6659309: port_associate_fop frees a buffer twice if port_pfp_setup returns an error.

What I don't know is why we suddenly started seeing the bug. Is it that build 82 exercise event ports more or that the bug has been revealed by some other change? Either way it make me nervous for my home server running, you guessed it, build 82! At least next time someone asks why we bother running a Sun Ray server on the latest greatest nevada bits I have a preprepared place to send them. It is here.

Sunday Jan 27, 2008

Good Morning Build 81

After a short delay (things are very busy at the moment) build 81 has hit the Sun Ray server:

: enoexec.eu FSS 1 $; uname -a
SunOS enoexec 5.11 snv_81 sun4u sparc SUNW,Sun-Fire
: enoexec.eu FSS 2 $; 

All seems well so far. Again StarOffice decided to take me through the, “Do you agree with the T&C's and have you Registered dialogue” which is irritating that it can't remember but not critical.

Monday Dec 10, 2007

Good Morning Build 79

Another fortnight another build hits our Sun Ray server:


: estale.eu FSS 1 $; uname -a
SunOS estale 5.11 snv_79 sun4u sparc SUNW,Sun-Fire
: estale.eu FSS 2 $; 


All seems well so far.

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today