Tuesday Jan 26, 2010

Workaround for runaway metacity

Sun Ray on OpenSolaris build 131 requires the same workarounds I previously mentioned.

There is one more that helps with both 130 and 131. With the new gdm set up the login screen now runs "metacity" and occasionally this can get into a loop just consuming CPU. The trigger is that metacity has been sent a signal to terminate but then tries to be a bit too clever and goes into the loop. I've filed this bug so that it can be fixed.

Happily once again you can work around this with a bit of dtrace:

#!/usr/sbin/dtrace -qws

proc:::signal-handle
/ execname == "metacity" && args[0] == 15 / {
        system("logger -t metacity.d -p daemon.notice killing metacity[%d]", pid); 
        raise(9)
}

Sunday Jan 17, 2010

More Sun Ray on OpenSolaris build 130 workarounds

One more thing for the Sun Ray on build 130. Whether this is the last remains to be seen.

Now that gdm is run via smf gnome-session is now run via ctrun(1) so that it gets it's own contract and therefore any problems it has do not result in gdm being restarted.

However the Sun Ray sessions are not started that way. Hence I was seeing all users logged out if just one session had a problem:

So a rather trivial failure such as this:

Jan 16 11:10:47 pearson genunix: [ID 603404 kern.notice] NOTICE: core_log: metacity[7448] core dumped: /var/cores/core.metacity.7448 

would result in gdm restarted:


[ Jan 16 00:06:08 Method "start" exited with status 0. ]
[ Jan 16 11:10:47 Stopping because process dumped core. ]
[ Jan 16 11:10:47 Executing stop method ("/lib/svc/method/svc-gdm stop"). ]
[ Jan 16 11:10:47 Method "stop" exited with status 0. ]

which in turn means all the users were logged out. Ooops.

The solution was simple but like the previous workarounds leaves your warranty in tatters!

# mv /usr/bin/gnome-session /usr/bin/gnome-session.orig
# cat > /usr/bin/gnome-session << EOF
> #!/bin/ksh -p
> exec ctrun -l child \\$0.orig \\$@
> EOF
# chmod 755 /usr/bin/gnome-session

This results in all your gnome sessions having their own contract as their parent is ctrun:

: pearson FSS 33 $; ptree $(pgrep -o -u gdm gnome-session)
22433 /usr/sbin/gdm-binary
  22440 /usr/lib/gdm-simple-slave --display-id /org/gnome/DisplayManager/Displa
    22965 ctrun -l child /usr/bin/gnome-session.orig --autostart=/usr/share/gdm
      22967 /usr/bin/gnome-session.orig --autostart=/usr/share/gdm/autostart/Lo
        23062 /usr/lib/gdm-simple-greeter
        23063 gnome-power-manager
: pearson FSS 34 $; 

and means that any failures are now ring-fenced to just that session.

Monday Jan 11, 2010

More Sun Ray on OpenSolaris build 130

As I have previously mentioned I have Sun Ray "working" on OpenSolaris build 130 at home. There are some minor tweaks required to get things working close to perfectly.

If you are doing this you are already running OpenSolaris build 130 and Sun Ray which is completely unsupported. These changes are also completely unsupported. There was not warranty but if there was one you will void it.

First take a back up. Since you are running OpenSolaris and therefore have ZFS take snapshot of the file system that contains /opt/SUNWut and also use beadm to create a snapshot of the boot environment.

Now to get to a point where you can login on a Sun Ray DTU you need to do this:

ln -s /usr/lib/xorg/libXfont.so.1 /opt/SUNWut/lib
ln -s /usr/lib/xorg/libfontenc.so.1 /opt/SUNWut/lib
rm /usr/lib/xorg/modules/extensions/GL
ln -s ../../../../../var/run/opengl/server \\
/usr/lib/xorg/modules/extensions/GL
mkdir /etc/opt/SUNWut/X11
echo "catalogue:/etc/X11/fontpath.d" > /etc/opt/SUNWut/X11/fontpath
usermod -d /var/lib/gdm gdm

However the utwho command won't work and if you want to use utaction as root you need to follow the instructions in my last post.

Now utwho is extremely useful and for me a requirement as it is used by my access hours script so I wanted to get that working. As with the issues with utaction the first problem is that the script that sets this up expects to run as root but now everything is running as the user "gdm". Again the solution is RBAC.


Follow the instructions from my last post to set up a GDM profile and make the user gdm use it. Then add these lines to /etc/security/exec_attr:

GDM:solaris:cmd:::/etc/opt/SUNWut/gdm/SunRayInit/Default:uid=0
GDM:solaris:cmd:::/etc/opt/SUNWut/gdm/SunRayPostSession/Default:uid=0

and then edit the two file listed above to add in the bold lines below. The example is /etc/opt/SUNWut/gdm/SunRayInit/Default:

#!/bin/ksh
# iterate over the helpers directory
#
# ident "@(#)InitDefault.sh     1.5 09/07/31 SMI"
#
# Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
#

PATH=/usr/bin/X11:/usr/X11R6/bin:/opt/X11R6/bin:/usr/bin:$PATH

if [[ "$_" != "/usr/bin/pfexec" ]] && [[ -x /usr/bin/pfexec ]]

then

        exec /usr/bin/pfexec $0 $@

fi


for i in /etc/opt/SUNWut/gdm/SunRayInit/helpers/\*
do
        if [ -x $i ]; then
            . $i
        fi
done

exit 0

Finally, and quite whether this is required I'm not sure, but the reset-dpy script will not work properly either so make these changes to fix it:


\*\*\* /opt/SUNWut/.zfs/snapshot/month_2009-12-01-01:02/lib/xmgr/gdm/reset-dpy     Tue Oct 20 01:32:31 2009
--- ./reset-dpy Mon Jan 11 13:59:30 2010
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
\*\*\* 65,70 \*\*\*\*
--- 65,71 ----
        dpys=`gdmdynamic -l | /bin/sed -e 's/:\\([\^,]\*\\),[\^;]\*/\\1/g' -e 's/;/ /g' `
        for dpy in $dpys
        do
+         dpy=${dpy#:}
            if [[ $dpy -ge $MINDISP && $dpy -le $MAXDISP ]]; then
            rm "$PRESESSION_DIR/:$dpy"
              rm "$POSTSESSION_DIR/:$dpy"
root@pearson:/etc/opt/SUNWut/xmgr#       

Now all will work:

: pearson FSS 2 $; utwho -c 
 12.0 Payflex.500a094f00130100         user2      192.168.1.228   P8.00144f57a46f
: pearson FSS 3 $; utwho
 12 Payflex.500a094f00130100             user2     
 14 Payflex.500a094d00130100             user1     
 18 Payflex.500a094c00130100             user3     
 19 Payflex.500a094e00130100             user4     
: pearson FSS 4 $; 

However you have voided your warranty!

Update: 12/1/2010 These problems should be fixed in build 132 so the workarounds should not be needed then.

Sunday Jan 10, 2010

Sometimes being on the bleeding edge you get cut.

While Sun Ray and OpenSolaris build 130 are functional they are not happy. In particular the changes to gdm have resulted in many of the functions of the Sun Ray software not working. Small things like utwho(1) no longer work.

More importantly the login scripts running as the user "gdm" have stopped my scripts that adjust the user shares and stop firefox when users disconnect from working. Since this results in the system being 100% busy all the time an urgent workaround was required.

The workaround uses lots of undocumented features, so I don't expect it to keep working long term but at least it will keep me going until the next upgrade.

The problem of not running as root is trivially solved by using RBAC and then calling utaction via pfexec(1) adding these lines to each of the files:

root@pearson:/root# egrep /etc/security/\*_attr
/etc/security/exec_attr:GDM:solaris:cmd:::/opt/SUNWut/bin/utaction:uid=0
/etc/security/prof_attr:GDM:::Do not assign to users. Profile for GDM so it can run utaction as root:help=Utaction.help
root@pearson:/root# 

Then using usermod to add the GDM profile to the gdm user:

root@pearson:/root# usermod -P GDM gdm

The now the utaction you call from your PostLogin script will be run as root. However instead of passing in the user name, which when the PostLogin script runs you don't know, pass in the name of the Sun Ray session_proc file and read the UID out of there. I have:


function read_session_proc
{
        typeset IFS="="
        typeset key val
        while read key val
        do
                if [[ "$key"="uid" ]]
                then
                        typeset IFS=:
                        typeset u spam
                        getent passwd $val | read u spam
                        print $u
                        break
                fi
        done
}
if [[ "${1#/}" != $1 ]] && [[ -f $1 ]]
then
        USER=$(read_session_proc < $1)
else
        USER=$1
fi

In the adjust shares scripts and this in the PostLogin script (/etc/opt/SUNWut/gdm/SunRayPostLogin/Default):

#!/bin/sh
#
# ident "@(#)PostLoginDefault.sh        1.1 04/05/06 SMI"
#
AD=/usr/local/sbin/adjustshares.workaround
.130

d=${DISPLAY#\*:}
d=${d%.\*}
LOGNAME=/tmp/SUNWut/session_proc/$d
/usr/bin/ctrun -l child -i none /usr/bin/pfexec /opt/SUNWut/bin/utaction -i -c "$AD $LOGNAME 50" -d "$AD $LOGNAME" &


Update: I have added the ctrun otherwise if any of the actions called by utaction dump core then everyone gets logged out. Clearly the core dumps need to be resolved but there is no reason to log everyone out.

Wednesday Jan 06, 2010

ZFS resliver performance improved!

I'm not being lucky with the Western Digital 1Tb disks in my home server.

That is to say I'm extremely pleased I have ZFS as both have now failed and in doing so corrupted data which ZFS has detected (although the users detected the problem first as the performance of the drive became so poor).One of the biggest irritations about replacing drives, apart from having to shut the system down as I don't have hot swap hardware is waiting for the pool to resliver. Previously this has taken in excess of 24 hours to do.

However yesterday's resilver was after I had upgraded to build 130 which has some improvements to the resilver code:


: pearson FSS 1 $; zpool status tank
  pool: tank
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 6h28m with 0 errors on Wed Jan  6 02:04:17 2010
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            c21t0d0s7  ONLINE       0     0     0
            c20t1d0s7  ONLINE       0     0     0
          mirror-1     ONLINE       0     0     0
            c21t1d0    ONLINE       0     0     0  225G resilvered
            c20t0d0    ONLINE       0     0     0

errors: No known data errors
: pearson FSS 2 $; 
Only 6 ½ hours for 225G which while not close to the theoretical maximum is way better than 24 hours and the system was usable while this was going on.

Sunday Jan 03, 2010

Automatic virus scanning with c-icap & ZFS

Now that I have OpenSolaris running on the home server I thought I would take advantage of the virus scanning capabilities using the clamav instance I have running. After downloading, compiling and installing c-icap I was able to get the service up and running quickly using the instructions here.

However using a simple test of trying to copy an entire home directory I would see regular errors of the form:

Jan  2 16:18:49 pearson vscand: [ID 940187 daemon.error] Error receiving data from Scan Engine: Error 0

Which were accompanied by a an error to the application and the error count to vscanadm stats.

From the source it was clear that the recv1 was returning 0, indicating the stream to the virus scan engine had closed the connection. What was not clear was why?

So I ran this D to see if what was in the buffer being read would give a clue:


root@pearson:/root# cat vscan.d 
pid$target::vs_icap_readline:entry
{
        self->buf = arg1;
        self->buflen = arg2;
}
syscall::recv:return /self->buf && arg1 == 0/
{
        this->b = copyin(self->buf, self->buflen);
        trace(stringof(this->b));
}
pid$target::vs_icap_readline:return
/self->buf/
{
        self->buf=0;
        self->buflen=0;
}
root@pearson:/root# 

root@pearson:/root# dtrace -s  vscan.d -p $(pgrep vscand)
dtrace: script 'vscan.d' matched 3 probes
CPU     ID                    FUNCTION:NAME
  1   4344                      recv:return 
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

The clue was that the error comes back on the very first byte being read. The viruse scan engine is deliberately closing the connection after handling a request which since it had negotiated "keep-alive" it should not.

The solution2 was to set the MaxKeepAliveRequests entry in the c-icap.conf file to be -1 and therefore disable this feature.

1Why is recv being used to read one byte at a time? Insane, a bug will be filed.

2It is in my opinion a bug that the vscand can't cope gracefully with this. Another bug will be filed.

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Categories
Archives
« January 2010 »
MonTueWedThuFriSatSun
    
1
2
4
5
7
8
9
12
13
14
15
16
18
19
20
21
22
23
24
25
27
28
29
30
31
       
Today