dsvclockd(1M): Using Doors to Implement Inter-Process Readers/Writer Locks

dsvclockd(1M): Using Doors to Implement Inter-Process Readers/Writer Locks

As a long-time and rabid Solaris developer, the most personally satisfying part of the OpenSolaris launch is being able to finally share some of the creative (and borderline insane) solutions we've devised to real-world problems.

Here, I cover perhaps the most perverse use of Doors in Solaris: as the basis for a robust, multi-threaded, inter-process, inter-machine, readers/writer lock implementation. So, dust off that Morrison Hotel LP, and kill the lights ...

Mojo Rising

One groovy flavor of interprocess communication in Solaris is doors, which first appeared in Sun Lab's SpringOS, was implemented in Solaris 2.5 by Jim Voll, and rose to fame as the mechanism which made our nscd scream. My exposure to doors goes back to 1996, when I stumbled on the nascent facility while trussing [1] getent to root-cause a name-service problem -- the truss was quickly followed by a man door_call, which returned a curious manpage containing a grave warning and a fitting note:

  DESCRIPTION

     This family of system calls provide a new flavor of interprocess
     communication between client and server processes.  The doors mechanism
     is not yet available for public consumption.
 
  WARNING

     Please do not attempt to reverse-engineer the interface and program to
     it. If you do, your program will almost certainly fail to run on future
     versions of Solaris, and may even be broken by a patch. This document
     does not constitute an API.  Doors may not exist or may have a completely
     different set of semantics in a future release.

  NOTES

     This manual page is here solely for the benefit of anyone who noticed
     door_call() in truss(1) output and thought, "Gee, I wonder what that
     does..."
Of course, despite the warning, doors did continue to exist and these days the fundamentals of doors are well-documented, both in our own manpages and in the late Richard Stevens's Unix Network Programming: Volume Two. In fact, if you are not familiar with the basics of doors, I'd suggest at least browsing the door_create() and door_call() manpages before proceeding.

Synchronization Gone Wild

The original DHCP server that appeared in Solaris 2.6 was heavily based on CMU's bootpd server and, among other things, dreadfully slow. As such, not too long ago we embarked on a project to kick off the training wheels and generally make it scale to an enterprise-level workload. The result was a heavily multithreaded DHCP server, and a pluggable datastore infrastructure (see dhcp_modules(5)) -- this infrastructure is summarized in Dave Miner's DHCP Server Tour blog entry, and the implementation of SUNWbinfiles is worthy of an extended future blog entry as well.

One significant problem that arose during the implementation was how to synchronize access to the underlying DHCP data (e.g., lease information). Specifically, to ensure correctness, the multithreaded DHCP server and the DHCP administrative tools needed to synchronize with one another before accessing or modifying the underlying DHCP data.

Our first thought was to use cross-process (USYNC_PROCESS) readers-writer locks (see rwlock_init). However:

  • USYNC_PROCESS readers-writer locks are not robust. This means that if a process unexpectedly terminates with the lock held, all other processes may end up stuck forever -- this is clearly unacceptable [2].
  • USYNC_PROCESS readers-writer locks require setting up a shared memory region to store the locks in each process. Coordinating the management of this region can prove subtle, not to mention that some administrators have been known to lower the maximum shared memory size to unusably low levels. In addition, at the time it was unclear whether the use of shared memory was kosher from the native layer of Java applications (the DHCP administrative tools are all in Java).
  • USYNC_PROCESS readers-writer locks could not help us solve a related thorny issue: coordination with processes on other machines. Specifically, a little-known but supported configuration was to place the DHCP data on an NFS server and allow multiple DHCP servers (and multiple sets of tools) to access or modify that data. As luck would have it, this particular configuration was used extensively by Sun's own IT department [3] .

Another thought was to use file locking API's, but lockf() does not provide shared locks, and flock() is not MT-safe (as I found out the hard way). Furthermore, not all DHCP data requiring synchronization was required to be file-based -- and while that does not preclude use of files as strictly a locking mechanism, it does make it awkward.

The Birth of dsvclockd

Speeding home through a particularly dismal New England winter night, a promising and perverse solution dawned on me: the door_call() routine is synchronous, so why couldn't door_call() itself be used to provide the our desired rw_rdlock()/rw_wrlock() semantics?

A few (mostly sleepless) days later, I had a functioning prototype -- the premise was actually quite simple: a new daemon (later dubbed dsvclockd) arbitrated access to each of the underlying DHCP data tables [4]. Thus, for a random application to request read or write access to a given DHCP table, it need only issue a door_call() to dsvclockd, and pass it the following information:

  • The name of the DHCP table to access.
  • Whether it needed read (shared) or write (exclusive) access.
  • Whether it was willing to block if the DHCP table was not available for immediate access.
  • Whether it needed inter-machine synchronization[3].
Of course, this is all hidden behind a couple of functions inside libdhcpsvc: dsvcd_rdlock(), dsvcd_wrlock() - which are simple wrappers round a shared routine called dsvcd_lock():
   static int
   dsvcd_rdlock(dsvc_synch_t \*sp, void \*\*unlock_cookiep)
   {
           return (dsvcd_lock(sp, DSVCD_RDLOCK, unlock_cookiep));
   }
   
   static int
   dsvcd_wrlock(dsvc_synch_t \*sp, void \*\*unlock_cookiep)
   {
           return (dsvcd_lock(sp, DSVCD_WRLOCK, unlock_cookiep));
   }
Thus, DHCP table consumers can simply call dsvcd_rdlock() or dsvcd_wrlock() [5] to acquire the appropriate lock, completely unaware that doors are being used under the hood to coordinate their requests -- all they know is that if these routines return successfully, they have the requested access.

Of course, dsvcd_lock() does the actual grunt work of marshalling the needed arguments into a structure passed across the door, and issuing the door_call() to dsvclockd:

   static int
   dsvcd_lock(dsvc_synch_t \*sp, dsvcd_locktype_t locktype, void \*\*unlock_cookiep)
   {
        door_arg_t              args;
        dsvcd_lock_request_t    request;
        dsvcd_reply_t           reply;
        door_desc_t             \*descp;
        int                     unlockfd;
        int                     i;
        dsvcd_synch_t           \*dsp = sp->s_data;

        if (dsp->s_lockfd == -1)
                return (DSVC_NO_LOCKMGR);

        request.lrq_request.rq_version  = DSVCD_DOOR_VERSION;
        request.lrq_request.rq_reqtype  = DSVCD_LOCK;
        request.lrq_locktype            = locktype;
        request.lrq_nonblock            = sp->s_nonblock;
        request.lrq_crosshost           = dsp->s_crosshost;
        request.lrq_conver              = sp->s_datastore->d_conver;

        (void) strlcpy(request.lrq_loctoken, sp->s_loctoken,
            sizeof (request.lrq_loctoken));
        (void) strlcpy(request.lrq_conname, sp->s_conname,
            sizeof (request.lrq_conname));

        args.data_ptr   = (char \*)&request;
        args.data_size  = sizeof (dsvcd_lock_request_t);
        args.desc_ptr   = NULL;
        args.desc_num   = 0;
        args.rbuf       = (char \*)&reply;
        args.rsize      = sizeof (dsvcd_reply_t);

        if (door_call(dsp->s_lockfd, &args) == -1) {
                /\*
                 \* If the lock manager went away, we'll get back EBADF.
                 \*/
                return (errno == EBADF ? DSVC_NO_LOCKMGR : DSVC_SYNCH_ERR);
	}
On the dsvclockd side, the request comes into svc_lock(), which sanity-checks the arguments, looks up the appropriate DHCP table, and then calls either cn_rdlock() or cn_wrlock()[6] to perform the actual lock -- here's the tail end of svc_lock() which acquires the appropriate lock and sends a reply back:
       /\*
         \* Acquire the actual read or write lock on the container.
         \*/
        dhcpmsg(MSG_DEBUG, "tid %d: %s locking %s", thr_self(),
            lreq->lrq_locktype == DSVCD_RDLOCK ? "read" : "write", cn->cn_id);

        if (lreq->lrq_locktype == DSVCD_RDLOCK)
                reply.rp_retval = cn_rdlock(cn, lreq->lrq_nonblock);
        else if (lreq->lrq_locktype == DSVCD_WRLOCK)
                reply.rp_retval = cn_wrlock(cn, lreq->lrq_nonblock);

        dhcpmsg(MSG_DEBUG, "tid %d: %s %s lock operation: %s", thr_self(),
            cn->cn_id, lreq->lrq_locktype == DSVCD_RDLOCK ? "read" : "write",
            dhcpsvc_errmsg(reply.rp_retval));

        ds_release_container(ds, cn);
        if (reply.rp_retval != DSVC_SUCCESS) {
                ud_destroy(ud, B_FALSE);
                (void) close(door_desc.d_data.d_desc.d_descriptor);
                (void) door_return((char \*)&reply, sizeof (reply), NULL, 0);
                return;
        }

        while (door_return((char \*)&reply, sizeof (reply), &door_desc, 1)
            == -1 && errno == EMFILE) {
		/\* ... \*/
 	}

Doors Through Doors: Implementing Unlock

Those of you keeping score at home have no doubt noticed a large hole in the preceding discussion: how does one unlock a previously locked DHCP table? While we could just implement an unlock operation on the same door used for locking, doors affords us a far more elegant, and as we will soon see, downright robust solution.

In particular, one little-used but powerful feature of doors is the ability to pass file descriptors through them. This enables a door server to act as an authentication mechanism, granting file descriptors to privileged resources once the door client has flashed the appropriate credentials. In our case, it also can be used to ensure that only door clients that have pending locks on a given DHCP table can unlock it -- and that a given lock can never be unlocked more than once by accident.

Specifically, as part of performing the lock operation, dsvclockd allocates another door -- the "unlock" door -- which is returned to the client through the original door as part of the lock operation. Thus, there is an open unlock door for every held reader or writer on a lock. This descriptor can be seen in the final door_return() above (door_desc).

On the client-side, this descriptor eventually bubbles up through dsvcd_lock(), and is then hidden inside the unlock_cookiep that is returned to the dsvcd_rdlock() or dsvcd_wrlock() caller. The dsvcd_unlock() routine converts this cookie back into a descriptor and issues the door_call():

  static int
  dsvcd_unlock(dsvc_synch_t \*sp, void \*unlock_cookie)
  {
          door_arg_t              args;
          dsvcd_unlock_request_t  request;
          dsvcd_reply_t           reply;
          int                     unlockfd = (int)unlock_cookie;
  
          request.urq_request.rq_version = DSVCD_DOOR_VERSION;
          request.urq_request.rq_reqtype = DSVCD_UNLOCK;
  
          args.data_ptr   = (char \*)&request;
          args.data_size  = sizeof (dsvcd_unlock_request_t);
          args.desc_ptr   = NULL;
          args.desc_num   = 0;
          args.rbuf       = (char \*)&reply;
          args.rsize      = sizeof (dsvcd_reply_t);
  
          if (door_call(unlockfd, &args) == -1) {
                  /\*
                   \* If the lock manager went away while we had a lock
                   \* checked out, regard that as a synchronization error --
                   \* it should never happen under correct operation.
                   \*/
                  return (DSVC_SYNCH_ERR);
          }

DOOR_UNREF_MULTI: Robust Unlocking

So that's certainly elegant, but what about the robustness I promised? Recall that one of our original objections to USYNC_PROCESS readers/writer locks was that if an application holding a lock crashed, other applications were effectively locked out. As it turns out, we can build on the descriptor-based unlocking model to solve this problem, too.

Specifically, doors provides a special mechanism whereby a door's associated procedure is automatically invoked by the kernel when there is only one reference left to it (see DOOR_UNREF and DOOR_UNREF_MULTI in door_create()). Since each lock has its own unlock door, we can make use of this to automatically be notified if a client holding a lock crashes. Here's the logic in dsvclockd's svc_unlock() associated with this case -- note that since even correctly unlocked doors have to be closed, we have to be careful to disambiguate a crash from normal operation.

        /\*
         \* First handle the case where the lock owner has closed the unlock
         \* descriptor, either because they have unlocked the lock and are
         \* thus done using the descriptor, or because they crashed.  In the
         \* second case, print a message.
         \*/
        if (req == DOOR_UNREF_DATA) {
                /\*
                 \* The last reference is ours; we can free the descriptor.
                 \*/
                (void) mutex_unlock(&ud->ud_lock);
                ud_destroy(ud, B_TRUE);

                /\*
                 \* Normal case: the caller is closing the unlock descriptor
                 \* on a lock they've already unlocked -- just return.
                 \*/
                if (cn == NULL) {
                        (void) door_return(NULL, 0, NULL, 0);
                        return;
                }

                /\*
                 \* Error case: the caller has crashed while holding the
                 \* unlock descriptor (or is otherwise in violation of
                 \* protocol).  Since all datastores are required to be
                 \* robust even if unexpected termination occurs, we assume
                 \* the container is not corrupt, even if the process
                 \* crashed with the write lock held.
                 \*/
                switch (cn_locktype(cn)) {
                case DSVCD_RDLOCK:
                        dhcpmsg(MSG_WARNING, "process exited while reading "
                            "`%s'; unlocking", cn->cn_id);
                        (void) cn_unlock(cn);
                        break;

                case DSVCD_WRLOCK:
                        dhcpmsg(MSG_WARNING, "process exited while writing "
                            "`%s'; unlocking", cn->cn_id);
                        dhcpmsg(MSG_WARNING, "note that this write operation "
                            "may or may not have succeeded");
                        (void) cn_unlock(cn);
                        break;

                case DSVCD_NOLOCK:
                        dhcpmsg(MSG_CRIT, "unreferenced unheld lock");
                        break;
                }

                (void) door_return(NULL, 0, NULL, 0);
                return;
        }
Finally, let's take a look at the code that allocates the unlock descriptor in dsvclockd -- I've left it for last because, as the comments make clear, it's particularly subtle:
        /\*
         \* We need another door descriptor which is passed back with the
         \* request.  This descriptor is used when the caller wants to
         \* gracefully unlock or when the caller terminates abnormally.
         \*/
        ud = ud_create(cn, &reply.rp_retval);
        if (ud == NULL) {
                ds_release_container(ds, cn);
                (void) door_return((char \*)&reply, sizeof (reply), NULL, 0);
                return;
        }

        /\*
         \* We pass a duped door descriptor with the DOOR_RELEASE flag set
         \* instead of just passing the descriptor itself to handle the case
         \* where the client has gone away before we door_return().  Since
         \* we duped, the door descriptor itself will have a refcount of 2
         \* when we go to pass it to the client; if the client does not
         \* exist, the DOOR_RELEASE will drop the count from 2 to 1 which
         \* will cause a DOOR_UNREF_DATA call.
         \*
         \* In the regular (non-error) case, the door_return() will handoff
         \* the descriptor to the client, bumping the refcount to 3, and
         \* then the DOOR_RELEASE will drop the count to 2.  If the client
         \* terminates abnormally after this point, the count will drop from
         \* 2 to 1 which will cause a DOOR_UNREF_DATA call.  If the client
         \* unlocks gracefully, the refcount will still be 2 when the unlock
         \* door server procedure is called, and the unlock procedure will
         \* unlock the lock and note that the lock has been unlocked (so
         \* that we know the DOOR_UNREF_DATA call generated from the client
         \* subsequently closing the unlock descriptor is benign).
         \*
         \* Note that a DOOR_UNREF_DATA call will be generated \*any time\*
         \* the refcount goes from 2 to 1 -- even if \*we\* cause it to
         \* happen, which by default will happen in some of the error logic
         \* below (when we close the duped descriptor).  To prevent this
         \* scenario, we tell ud_destroy() \*not\* to cache the unlock
         \* descriptor, which forces it to blow away the descriptor using
         \* door_revoke(), making the close() that follows benign.
         \*/
        door_desc.d_attributes = DOOR_DESCRIPTOR|DOOR_RELEASE;
        door_desc.d_data.d_desc.d_descriptor = dup(ud->ud_fd);
        if (door_desc.d_data.d_desc.d_descriptor == -1) {
                dhcpmsg(MSG_ERR, "cannot dup unlock door; denying %s "
                    "lock request", cn->cn_id);
                ud_destroy(ud, B_TRUE);
                ds_release_container(ds, cn);
                reply.rp_retval = DSVC_NO_RESOURCES;
                (void) door_return((char \*)&reply, sizeof (reply), NULL, 0);
                return;
        }
Anyone who has been inspired to implement something of like kin: pay particular attention to the final paragraph of the block comment above -- it took me days to track down that bloody bug!

The End

The dsvclockd daemon integrated in Solaris 9, has been in widespread use by both internal and external customers ever since, and has been incident-free to this point[7].

While far from a comprehensive tour of dsvclockd, I hope this overview has illustrated the power of the Solaris doors IPC mechanism, and more importantly has caused you to think about how to use (or abuse!) Solaris doors to solve a thorny inter-process problem or two of your own.

Footnotes:

[1] The use of truss as a verb follows the spirit Roger Faulkner (the author of truss) intended. To wit, he once admitted that one of the primary reasons for naming the tool "truss" was that one is rarely in a good frame of mind when using it -- thus, technicalities aside, truss should be considered a four-letter word, and "oh, truss it!" a cathartic confession.
[2] In the meantime, a USYNC_PROCESS_ROBUST mutex has been added to Solaris. This could probably be in-turn used to implement a USYNC_PROCESS_ROBUST readers/writer lock -- but the other issues still remain.
[3] The discussion of how dsvclockd implements inter-machine synchronization will have to wait for another time -- though brave readers are directed to the block comment atop container.c, which starts unabashedly with:

	/\*
	 \* Container locking code -- warning: serious pain ahead.
	 \*/

[4] These DHCP tables are known as "containers" in the source code -- a term that sadly has since become extremely overloaded, especially due to Zones.
[5] In fact, there is an additional abstraction layer that allows pluggable locking strategies to be added -- but we have yet to implement any additional strategies. If I had to do it over again, I'd torch that layer.
[6] Of course, the devil's in the details and the implementation of these routines are proof -- in addition to containing the aforementioned host-locking code, there are some other parlor tricks of note, such as the use of stack-local condition variables in cn_wait_for_lock(). You may also be curious why there is not a USYNC_THREAD readers/writer lock at the core of dsvclockd's implementation -- the problem is that we cannot guarantee that the same thread that write-locked a given DHCP table will be the one that unlocks it (since a random door server worker thread is tasked with the unlock operation).
[7] Though there have been several general doors bugfixes applied to it -- see 4786633and 6223254. Also, the development of dsvclockd uncovered several critical doors bugs -- see 4321534 and 4349741.

Technorati Tag:
Technorati Tag:

Comments:

Hey Meem,
This code is one of the coolest things I've seen until now. It took a while for me to grasp it as a whole.
I bet you got more of these clever solutions at your hand.
So, please, keep on posting such beauties
;->

Posted by Patrick Bachmann on June 30, 2005 at 11:42 AM EDT #

Automatic Blog N Ping Script. Create more exposure for your blogs and websites. Full script, easy to install. http://www.cyberincomesolutions.com/BlogNPing/index.htm

Posted by chamika on March 28, 2008 at 03:15 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

meem

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder