Wednesday May 24, 2006

Why have I got two automountd processes?

I've been working for ages on how to resolve an issue reported as automountd hangs when using executeable automount maps. This is logged as bug 4522909. The problem that when the automountd attempts to do a mount, it triggers a lookup on the mountpoint. This is done by another thread in the automountd. While we're waiting for that to complete we call auto_wait4mount which in turn blocks all signals by calling signintr(). This also makes the thread unstoppable by incrementing lwp_nostop. As this is an executeable map, the "other thread" has to fork1() in order to run the map. This in turn tries to STOP all threads in the process to get them in to a known state before forking the 1 thread we care about. As the mount thread is unstoppable this never completes. The fix was to allow a thread to be stopped, even if lwp_nostop is set, if the thread stopping it is in the same process. However this has show up a couple of mistaken assumptions in NFS land which mean that more work is needed there to allow an RFS call to be restarted (6306343.) This is obviously a bit of a pain as the changes required to fix 6306343 may be quite large and require a change that is too risky for a patch so an alternative approach is needed. After a good amount of discussion we concluded the lowest risk solution was to create a door server for the automountd process to talk to. This door server would handle all requests to fork the process to do an exec. Obviously as the lookup is handled in the main automountd process the deadlock is avoided. Now if you read the door_call man page it talks extensively about attaching to the door_server file descriptor as there is an assumption that the fd is somewhere in the file name space. If it is then anyone with sufficient privilege can write to it and you can end up with all sorts of rubbish written in. However if you fork() a child inherits the fs'd of the parent so the simple sollution is to have the automountd process set up the door_server() itself before it becomes multithreaded and then create a child to behave like the old automountd did. But with calls to the door_server to get the fork()/exec() stuff to work. Hence you get two automountd processes. Unfortunately this is all in the patch releases as it isn't needed for OpenSolaris, so you'll never see my excellent code, but I still thought it was worth writing it up. I'll fill in the patch versions as they come out So far we have Solaris 8 108994-56 SunOS 5.8_x86: LDAP2 client, libc, libthread and libnsl libraries patch 108993-56 SunOS 5.8: LDAP2 client, libc, libthread and libnsl libraries patch Solaris 9 117468-12 SunOS 5.9_x86: nfs patch 113318-26 SunOS 5.9: nfs patch Solaris 10 T118833-18 SunOS 5.10: Kernel Update T118855-15 SunOS 5.10_x86: Kernel Update Technorati Tag:
Technorati Tag:

Tuesday Oct 18, 2005

So what the heck is anonymous memory

As part of some work I've been doing I've had to talk a lot about anonymous memory. So I thought I'd write it down while I remember what its all about. anonymous memory is memory that is managed by segvn but is not really directly associated with a file. It's used for things like process stacks, heap, or COW (copy on write) pages. A good example of this is if you fork a process. All the addresses in the second process actually map back to the same bits of physical memory (the same pages). However if youre child process was then to do something different with the memory (eg. the child went off and manipulated an array in memory) the VM subsytem would copy those pages and change the mappings in the child process to point to the new pages. This new memory would be anonymous memory, and the child process would merrily make the changes to the array, unaware it now had new "physical" memory it was talking to. In a bit more detail, you've probably heard people talking about "anon_maps" and "vnode, offsets". Or at least if you're interested in VM and have been trawling the code you probably have. So I'll try and put a simplistic view of what they mean. Typically when you mmap a file, you are give it a vnode, offset and length. Segvn manages mmaped files and stores this in its private data structure. So when you walk through the address space of the process looking for a virtual address, you'll see that the segment contains that address and be able to map it to the vnode and where in that file using the offset. As mentioned above anonymous memory is not associated with files, but with perhaps swap (you'd get to uses anon memory if you mmaped /dev/zero BTW). So the purpose of anonymous memory or the anon layer is to fake up a vnode and offset for segvn to find the data on a swap device (actually by going through swapfs which I haven't yet looked at). The core of the anon layer is the anon_map structure. This is stored in the segvn_data structure and points you to the anon_hdr for this segment. Each anon_hdr is linked to an array on anon structures which are the swapfs implementation of how to find a vnode and offset on a swap file system (or it might be on a real swap device if you bypass swapfs, I think I'll need another blog entry for swapfs after this). So after some swapfs magic we can find the backing store for this anonmap. You might have noticed a bit if handwaiving in the middle there. What was that about an array of anon structures? Why do we need one of them? Well each anon strucure represents a page of memory. Our segment may be more than one page in size (try pmap $$ and you may well see some anon segments of more than 8K on sparc) so would need more than one of these anon structure to describe it. So we have this array. Technorati Tag:
Technorati Tag:

Wednesday Jun 29, 2005

mutex_enter() in an interrupt handler?

I've always assumed that interrupt handlers were absoluteley there to do the minimum work and get out as soon as possible. A question came to me today that challenged that view. The question was "Can I mutex_enter the pidlock while in an interrupt context?" My instant reaction was no way, you should never block in an interrupt context. However it seems I was wrong. Take a look at fas_intr(). You can see virtually the first thing it does is grab a mutex. This got me thinking. There is a fundemental difference between grabbing a lock in an interrupt handler that is private to that code and using a lock which is in general use throughout the kernel. Take this example. Something is holding the pidlock. But for whatever reason (it's timeslice has run out) it has be scheduled off CPU. So we have our foo card installed in our Solaris machine and an interrupt comes in for it. foointr() has been written to grab the mutex for the pidlock to make sure it has a stable view of the process table. So it tries to grab the pidlock and it has to block till the owner frees it. Could be ages, and the interupt wouldn't be serviced in a timely manner. I'd like to say I dredged through the code and found all this out by reading the code, but infact I just read the manual page. This says that you can acquire a mutex, kernel or high level interupt context. However there is a restriction that the mutex must have been initialised as type DRIVER_MUTEX, and have the ddi_iblock_cookie (returned from ddi_get_iblock_cookie(9f)) as the arg to mutex_init(9f). So there you have it. you can use a mutex in a high level interupt handler, just not ones like pidlock Technorati Tag:
Technorati Tag:

Thursday Jun 23, 2005

Debugging automounter problems

I've been working on a fix to an automounter problem for some time. So I'm just testing the final fix and sanity testing the binaries on my laptop. Suddenly my home directory doesn't get mounted. This is of course very strange. It's been working on that laptop for ages so I think it must be my fix.
So naturally the first thing I do is take my binaries off and try again. Few that didn't work either so at least my fix is good. Trouble is I still have an autmounter that doesn't work.
Some things do work however. eg.
$ cd /net//
But somethings don't
$ cd /work/cwb
ksh: /work/cwb: permision denied
$ cd /home/cwb
ksh: /home/cwb: not found
This was rather strange.
So, you may or may not already know this, but there is a way of seeing the diagnostic output from automountd by running the command
$ ls -l /net/=9
/net/=9: No such file or directory
Well actually you can substitute the 9 for another number corresponding to a debug level. Traditionally this sent output to the console window. So I dilligentley start up a console terminal window and it stays resolutely blank. After a little investigation it appears that in converting autofs to SMF, these messages are now sent to the smf log
svcs -l \\\*autofs\\\*
fmri         svc:/system/filesystem/autofs:default
name         Automounter
enabled      true
state        online
next_state   none
state_time   Fri 17 Jun 2005 09:45:35 BST
logfile      /var/svc/log/system-filesystem-autofs:default.log
restarter    svc:/system/svc/restarter:default
contract_id  47 
dependency   require_all/none svc:/system/filesystem/local (online)
dependency   require_all/restart svc:/milestone/name-services (online)
dependency   optional_all/none svc:/network/nfs/client (online)
So I take a look in /var/svc/log/system-filesystem-autofs:default.log and low and behold we have the answer.
t1      Automountd: trace level = 9
t1        do_lookup1: action=2 wildcard=FALSE error=2
t1      LOOKUP REPLY    : status=2
t10     LOOKUP REQUEST: Wed Jun 22 17:50:02 2005
t10       name=cwb[] map=auto_home opts=nobrowse path=/home direct=0
t10       PUSH /etc/auto_home
t10             Executable map: map=/etc/auto_home key=cwb
t10             read_execout: forking .....
t10             read_execout: map=/etc/auto_home key=cwb line=
t10       POP /etc/auto_home
t10       do_lookup1: action=2 wildcard=FALSE error=2
t10     LOOKUP REPLY    : status=2
t10     LOOKUP REQUEST: Wed Jun 22 17:50:02 2005
t10       name=cwb[] map=auto_home opts=nobrowse path=/home direct=0
t10       PUSH /etc/auto_home
t10             Executable map: map=/etc/auto_home key=cwb
t10             read_execout: forking .....
t10             read_execout: map=/etc/auto_home key=cwb line=
t10       POP /etc/auto_home
t10       do_lookup1: action=2 wildcard=FALSE error=2
t10     LOOKUP REPLY    : status=2
Note the PUSH of auto_home. In that section it appears that the map is executeable. But in this case auto_home is just a normal file
$ cat /etc/auto_home
# Home directory map for automounter
cwb     deadarm:/export/home/&
So clearly something has happened to make the automounter believe this is an executeable map (see man automount(1m) for more details of automount maps and executeable ones). Check the permissions on /etc/auto_\*
$ ls -l /etc/auto_\*
-rwxr-xr-x   1 root     root         339 Apr  9 18:11 /etc/auto_direct
-rwxr-xr-x   1 root     bin           66 Jun 23 14:44 /etc/auto_home
-rwxr-xr-x   1 root     bin          128 Jun 22 17:55 /etc/auto_master
So somehow I've managed to get execute permission on these files, so the automounter tries to run them. Hence I get no mounts. Clearing the exec bit
$ chmod 644 /etc/auto_\*
And it all returns to normal. Now make sure I turn off the debugging so as not to fill up my log files :)
$ ls -l /net/=0
/net/=0: No such file or directory
Putting the debug messages in the the smf log file is great. It makes it much easier to debug remotely and you don't have to rty and capture the console output as you're testing. My laptop is now working fine, and my fix is still good. Of course I still don't know why my map files had the mode changed. I suspect I mistyped. I'll pkgchk the system later to see if there are any other strange permissions I've set... Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Jun 22, 2005

UK OpenSolaris User group

On Monday we had the first meeting of the UK OpenSolaris user group. It was interesting to meet some of the other people (ie. non Sun people) who want to get OpenSolaris working. There was a presentation from Simon Phipps about why OpenSource is necessary and good, which was interesting. We also had a free form discussion about anything. The vast majority of this was covering Licensing which is something I have no experise, or to be honest interest in. However there were two really interesting technical points.
1. How will I know exactly which source made up my [Open]Solaris system: I put Open in brackets because this also pertains to comercial solaris and is a problem we've had inside sun for many years. We know what shipped as base Solaris 10 for example, but then we ask customers to apply patches (or they choose to because they have a problem). These patches are generated from what is called the "Patch Gate", which is branched off the main gate at FCS (at which point the main gate is frozen). All patches are generated from this and then the target binaries are put in to the patch and applied to the system. Copies of the source are also take and put in the "Source Patch" which is only available internally. Obviously we can keep a record of the SCCS versions of the code we put in to the patches so that we can go back later and find out. But these are not readily available from the binaries. This would be a good thing to be able to find out from a running system and is probably something we should put in to OpenSolaris to allow users to identify exactley what is on their system.
2. Can we mount ext[23]fs on Solaris: The answer to this is no. At the moment we have no implementation of the ext[23] file systems. It would be useful for migrating people to Solaris, however as a company our development is targetting ZFS. That's not to say it shouldn't be done, so put it up for discussion and see who bites. Try posting to the disucssion forum Or perhaps the file systems one I know that is primarily for ufs, but theres some good filesystems people in that comunity.
I'm now looking forward to the next meeting. Hopefully we will get even more people there and can have some targetted technical discussions
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005

How does my signal handler get to run

With OpenSolaris a reality, I can talk about some of the areas I've had to trouble shoot in the past. I started trying to write blogs and soon found I needed to reference the source code to illustrate the point. This is one area I've had to look at many times when tracking down problems. I've pieced together my understanding from reading the source code, however while writing this blog entry I've used a simple program and some "D" to illustrate what's going on.

Here's the simple program to copy some memory around, receive an alarm signal and run a signal handler.

$  cat alarm.c
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>

void alarm_handler();

        char \*src, \*dest;
        sigset_t set;
        struct sigaction sigact;

        sigact.sa_handler = alarm_handler;
        sigaddset(&sigact.sa_mask, SIGALRM);

        sigact.sa_flags = SA_SIGINFO;

        src = malloc (4096);
        dest = malloc (4096);
        memset(src, '@', 4096);
        while(1) {
                memcpy (src, dest, 4096);

void alarm_handler()
        char txt[]="In alarm handler";

        write (1, txt, sizeof(txt));

memset and memcpy are implemented by using the block load and store functions which use the floating point registers. This was used to emulate a customer's problem.
We know that the signal is actually delivered by the kernel when it is safe to do so, which is usually when you transition from kernel to userland or back (ie entering or leaving a system call). The act of sending a signal simply marks the t_astflag in the thread structure for trap to process later, so we can write a little D to trace anything from trap_cleanup (where the astflag is checked). We also need to check whether we should be checking for a signal at this time (this was new to me as well)
$ cat trap.d
#!/usr/sbin/dtrace -Fs
        curthread->_tu._ts._t_astflag == 1 &&
        curthread->_tu._ts._t_sig_check == 1 &&
        execname == "alarm"
        trace (timestamp);
        self->trace = 1;

        self->trace == 1
        trace (timestamp);
        self->trace = 0;

        self->trace == 1
        trace (timestamp);

        self->trace == 1
        trace (timestamp);

So we run these two together so we can see what happens. I'll annotate this in-line

$ trap.d
dtrace: script 'trap.d' matched 46375 probes
  1  -> trap_cleanup                             94182830223096
  1   | trap_cleanup:entry                       94182830229763
  1    -> sigcheck                               94182830231096
  1    <- sigcheck                               94182830232763

sigcheck is actually part of the ISSIG_PENDING macro which is used to establish if there is a signal there and if we are allowed to do anything with it.
  1    -> issig                                  94182830234346
  1    <- issig                                  94182830235763
  1    -> issig_forreal                          94182830237763
  1      -> schedctl_finish_sigblock             94182830239763
  1      <- schedctl_finish_sigblock             94182830241263
  1      -> fsig                                 94182830244096
  1        -> sigdiffset                         94182830246013
  1        <- sigdiffset                         94182830247346
  1      <- fsig                                 94182830249263
  1      -> fsig                                 94182830250263
  1        -> sigdiffset                         94182830251096
  1        <- sigdiffset                         94182830251930
  1        -> lowbit                             94182830253430
  1        <- lowbit                             94182830254513
  1      <- fsig                                 94182830255263
  1      -> sigdeq                               94182830257513
  1      <- sigdeq                               94182830259430
  1      -> isjobstop                            94182830261013
  1      <- isjobstop                            94182830262430
  1      -> undo_watch_step                      94182830264346
  1      <- undo_watch_step                      94182830266096
  1    <- issig_forreal                          94182830267180
All of this is issig_forreal() a function which extracts the next pending signal and puts it in to p_cursig. We then call psig ()
  1    -> psig                                   94182830268763
  1      -> schedctl_finish_sigblock             94182830269680
  1      <- schedctl_finish_sigblock             94182830270596
Blocks all signals if requested by the userlevel code setting tdp->sc_shared->sc_sigblock

  1      -> sigorset                             94182830273596
  1      <- sigorset                             94182830275013

sendsig() is platform dependant. This is for the sparc version so we go in to sendsig32()
  1      -> sendsig32                            94182830277430
  1        -> flush_user_windows_to_stack        94182830279263
  1        <- flush_user_windows_to_stack        94182830280346
Make sure the stack has all been flushed to the stack so we can safely change the stack pointer. We're going to have to contruct a stack frame for our signal handler so the current registers have to have somewhere to go on the stack. So we need to know how big a stack frame is, we do this by incrementing minstacksz. First of all the floating point registers if the fpu is enabled then the general registers, then the extra register set.
  1        -> xregs_getsize                      94182830282596
  1        <- xregs_getsize                      94182830283930
Watch point handling
  1        -> watch_disable_addr                 94182830285846
  1          -> avl_numnodes                     94182830287096
  1          <- avl_numnodes                     94182830288263
  1        <- watch_disable_addr                 94182830289346
From here we're copying the context (all the register information) in to the area on the stack we've just set up.
  1        -> kmem_alloc                         94182830290513
  1          -> kmem_cache_alloc                 94182830291846
  1          <- kmem_cache_alloc                 94182830293430
  1        <- kmem_alloc                         94182830294680
  1        -> savecontext32                      94182830296430
  1          -> flush_user_windows_to_stack      94182830298680
  1          <- flush_user_windows_to_stack      94182830299513
  1          -> getgregs32                       94182830301763
  1            -> mkpsr                          94182830304680
  1            <- mkpsr                          94182830306096
  1          <- getgregs32                       94182830307596
  1          -> getfpregs                        94182830309430
  1            -> lwp_getdatamodel               94182830310513
  1            <- lwp_getdatamodel               94182830311430
  1          <- getfpregs                        94182830315096
  1          -> fpuregset_nto32                  94182830316513
  1          <- fpuregset_nto32                  94182830318680
  1        <- savecontext32                      94182830320013
  1        -> xregs_setptr32                     94182830321846
  1        <- xregs_setptr32                     94182830323013
  1        -> kmem_alloc                         94182830324013
  1          -> kmem_cache_alloc                 94182830325096
  1          <- kmem_cache_alloc                 94182830326263
  1        <- kmem_alloc                         94182830327180
  1        -> xregs_get                          94182830328846
  1          -> xregs_getgregs                   94182830331263
  1          <- xregs_getgregs                   94182830333013
  1          -> xregs_getgfiller                 94182830334763
  1          <- xregs_getgfiller                 94182830336180
  1        <- xregs_get                          94182830337346
  1        -> xregs_getfpregs                    94182830339346
  1          -> xregs_getfpfiller                94182830341513
  1          <- xregs_getfpfiller                94182830343346
  1        <- xregs_getfpregs                    94182830344846
  1        -> kmem_free                          94182830347013
  1        <- kmem_free                          94182830347930
  1        -> kmem_cache_free                    94182830348930
  1        <- kmem_cache_free                    94182830349846
  1        -> kmem_free                          94182830352263
  1        <- kmem_free                          94182830353013
  1        -> kmem_cache_free                    94182830353930
  1        <- kmem_cache_free                    94182830354846
  1        -> watch_disable_addr                 94182830356513
  1          -> avl_numnodes                     94182830357430
  1          <- avl_numnodes                     94182830358430
  1        <- watch_disable_addr                 94182830359263
  1        -> ucopy                              94182830361013
  1        <- ucopy                              94182830362680
  1      <- sendsig32                            94182830364680
Interestingly we see nothing in the dtrace output for the crucial bit of the code that sets the registers up to run the signal handler. This is because it is simply done by manipulating the register within sendsig32. Here is a snippit of the code that does this
1451        /\*
1452         \* Set up user registers for execution of signal handler.
1453         \*/
1454        rp->r_sp = (uintptr_t)fp;
1455        rp->r_pc = (uintptr_t)hdlr;
1456        rp->r_npc = (uintptr_t)hdlr + 4;
1457        /\* make sure %asi is ASI_PNF \*/
1458        rp->r_tstate &= ~((uint64_t)TSTATE_ASI_MASK << TSTATE_ASI_SHIFT);
1459        rp->r_tstate |= ((uint64_t)ASI_PNF << TSTATE_ASI_SHIFT);
1460        rp->r_o0 = sig;
1461        rp->r_o1 = (uintptr_t)sip_addr;
1462        rp->r_o2 = (uintptr_t)&fp->uc;

  1    <- psig                                   94182830366263
  1  <- trap_cleanup                             94182830368013
So once we return to userland the registers we're using are the ones set at the end of sendsig32. As a result we start running in the signal handler. Simple really. With dtrace it is very simple to see the flow of code through a function. Combined with the opensolaris source browser you can quickly learn about new areas. Essential when tracking down bugs and customer problems.
Technorati Tag:
Technorati Tag:
Technorati Tag:

Chris W Beal-Oracle


« June 2016