Friday Sep 19, 2008

"bug buddy" is no friend of mine

In academia, when the name of a subject matter contains the word "science," it's a fair bet that there's little actual science involved, if any at all. It looks like a similar principle is involved with software: if it calls itself someone's "buddy," it's a fair bet that it's unfriendly.

This certainly applies to GNOME's "bug buddy." This utility intervenes on crash of a GNOME application, and tries its level best to format up a problem report. Unfortunately, its best just isn't anywhere near good enough in most cases, and by design it prevents a core dump from happening. That latter attribute means that it actually \*prevents\* someone from fixing the problem, as core dumps contain far more information than just the stack trace that "bug buddy" would have logged on its best day, and thus it annoys me to no end.

After much tinkering, I've found a way to turn this bit of nonsense off in Solaris when logging in via dtlogin. The environment for GNOME is set by /usr/dt/config/Xinitrc.jds. Fortunately, like all good X applications, it looks at /etc/dt/config first, so we can use this to set a new variable.

Create /etc/dt/config/Xinitrc.jds with this for contents:

#!/bin/sh
export GNOME_DISABLE_CRASH_DIALOG=1
exec /usr/dt/config/Xinitrc.jds "$@"
Mark the file as executable (chmod +x), and the problem is solved.

For gdm, it's sufficient to add this to /etc/profile:

GNOME_DISABLE_CRASH_DIALOG=1
export GNOME_DISABLE_CRASH_DIALOG
Unlike dtlogin, gdm reads /etc/profile by default.

Tuesday Jun 14, 2005

Non-Root Builds

In ancient times,1 it was necessary to be root in order to build ON (the "operating system and networking" part of Solaris). Every once in a while, this would have tragic consequences, such as vaporizing an important build server with one stroke of "rm -rf /" invoked by setting $ROOT to "/". But mostly it served as an annoyance factor: you couldn't delete your own workspaces (because they'd have UID 0 files in them), you had to give out the keys to the farm to all your build server users, you couldn't necessarily figure out who was using the machine, and so on.

As part of Solaris 10, I removed this restriction. Much remains to be done, as some still needlessly cling to the root-to-build lifestyle, but many use this new feature. This is the story of how it was done.

Build history

The old build system was centered around the concept of the "proto area." This is pointed to by $ROOT, and is typically set to "proto/root_$MACH/" in the current workspace.

The proto area is intended to be almost an exact image of the system being constructed. Each "make install" rule copies its resulting binaries and configuration files to the final location they'd have on the installed system, treating "$ROOT" as the root directory. There's thus "$ROOT/etc", "$ROOT/lib", "$ROOT/usr/bin", and so on.

The "almost" part is that we also deliver a few internal test objects, internal libraries, and other detritus to the proto area, so it's more like a superset of the final system ... minus the things that come from other consolidations.

This proto area is used for multiple purposes:

  • It is used during the build itself, with "-I" (include path) and "-L" (library path) set to point into this area. This way, we can build our libraries and executables using the bits in the workspace rather than whatever (usually older) bits happen to be on the build server.

  • It is used to simplify the construction of System V packages, which are used to deliver the software. The pathnames used in the prototype(4) files can be set to the actual installed location, rather than having to use "=" everywhere.

  • It is used to construct cpio archives (called "BFU images") that can be quickly installed on top of test systems. See the bfu shell script for details.

  • Some developers copy things directly out of $ROOT onto a running system during development, and having the objects in the right part of the tree makes that simpler. (Caution: don't try this at home, kids. We're what you'd call "experts.")

  • Finally, many developers expect to be able to do "cd $ROOT" and see the files that would be installed on the system before actually doing an install, in order to verify that their changes worked as intended.

In order to do all that, though, the developer needs to be root when he2 invokes "make install" in order to create the usual root-owned files, set-uid binaries, and other things that will appear on the system.

Alternatives

Various proposals were considered when figuring out how to eliminate the use of root during the build process. A few of those were:

  • Use RBAC so that ordinary developers can invoke "make," and then the installer itself (blessed by the RBAC profile) automatically runs as root. This fails as a complete solution because there's still a UID root process running, and the user who invoked it can still torch the build system with an errant directory path. It's also complicated because the build server administrator will need to add explicit permissions for these users

  • Create a new filesystem that ordinary developer can mount over $ROOT, and that has no restrictions on setting UIDs and the like. This solves the RBAC problem, but creates two new ones: ordinary users could possibly use this to side-step build system security, and it's an awfully complicated solution to a simple problem.

  • Create special install tools that all Makefiles must use, and that record file owner/group information in a separate hidden file. Then run a new tool (as root) over that saved information and fix up the proto area to have the desired owners and groups. This is potentially troublesome for build performance, and still doesn't entirely get rid of root.

An additional, hidden design flaw exists in the old build system and all of the above answers. When the developer installs a file in the proto area, he must make sure that it gets installed with the right UID, GID, and mode bits, usually by manipulating Makefile variables. But he also must add that very same information into the packaging prototype files. If the two are out of sync, then needless (and sometimes hard to understand) build failures result. As a basic rule, copying the same information to two different places is just bad design.

Solution

Instead, I chose to go a different route:

  1. When the build process is run as root, we will install into the proto area as we always have in the past, with all the right user IDs and other bits. This is for those "legacy" developers who insist on running the build as root and getting the same results.

  2. When it's run as a non-root UID, things would "mostly" still be installed with the right modes (permissions), but with the user and group set to the developer's default.

  3. To create BFU (cpio) archives, I wrote a new utility called cpiotranslate. This takes a cpio data stream on standard input and copies it to standard output after modifying the owner, group, and mode in the archive headers according to the data read from the packaging database.

  4. Because a few developers like to use the new tools to build very old workspaces, I included a new option "o" in NIGHTLY_OPTIONS that restores the old build behavior by disabling cpiotranslate. This shouldn't be used on workspaces that are based on Solaris 10 or newer.

Packaging, of course, never cared about the owner, group, or mode of the files in the proto area, because it has its own internal information for that.

This solution means that the files in the proto area no longer have the right owner/group set on them. Developers who insist on seeing this will have to take an additional step after building. For this purpose, I modified protocmp, which is ordinarily used by the build process to validate the file modes in the proto area, so that it can restore the correct owner/group/mode to proto area files as recorded in the packaging database. To invoke this functionality (you must be root), run protocmp -GU -d $SRC/pkgdefs $ROOT. This uses essentially the same logic as cpiotranslate, just applied to files rather than a cpio stream.

The cpiotranslate utility plugs into the existing mkbfu and makebfu pair. Neither of these is really intended for ordinary use. Instead, nightly normally invokes them when needed during the build process. The upper-half makebfu script sets up some environment variables and then invokes mkbfu like this:

        mkbfu   -f "cpiotranslate -e $pkg/$exc $bpkgargs $pkg" \\
                $zflag $ROOT $CPIODIR
else
        mkbfu $zflag $ROOT $CPIODIR

The first case is for non-root builds, and the second is for traditional root builds. The "-f" argument specifies a "filter" that is inserted into the usual "find | cpio | gzip" string when building the archives.

An interesting side-effect of the non-root build process is that several directories and files had to be made owner-readable and owner-writable. We'd previously relied on root's ability to access directories and files regardless of mode in order to build the proto area; e.g., creating a directory with mode 0555 and then copying in a file. This doesn't work for ordinary users. Fortunately, setting mode bits 0300 is essentially a no-op for root. The ARC documentation to describe these changes included:

Name				Old	New	Note	Stability
----------------------------	----	----	----	------------------
usr/bin/ct			4111	4511	A	Stable
usr/bin/cu			4111	4511	A	Stable
usr/bin/uucp			4111	4511	A	Standard (SCD-2.0)
usr/bin/uuglist			4111	4511	A	Stable
usr/bin/uuname			4111	4511	A	Stable
usr/bin/uustat			4111	4511	A	Standard (SCD-2.0)
usr/bin/uux			4111	4511	A	Standard (SCD-2.0)
usr/lib/pt_chmod		4111	4511	A	Project Private
usr/lib/spell			555	755	B	Stable
usr/lib/uucp/bnuconvert		110	510	A	Project Private
usr/lib/uucp/remote.unknown	4111	4511	A	Project Private
usr/lib/uucp/uucheck		110	510	A	Stable
usr/lib/uucp/uucico		4111	4511	A	Stable
usr/lib/uucp/uucleanup		110	510	A	Stable
usr/lib/uucp/uusched		4111	4511	A	Stable
usr/lib/uucp/uuxqt		4111	4511	A	Stable
usr/sadm/install/scripts	555	755	B	Sun Private
var/spool/cron/crontabs/adm	400	600	C	Evolving
var/spool/cron/crontabs/root	400	600	C	Evolving
var/spool/cron/crontabs/sys	400	600	C	Evolving

The changes break down into three categories, indicated by the "Note"
field above.  These are:

  A - executables to which the "read by owner" 0400 permission has
      been added.  The reason this is done is so that the package
      builds, which read these files from the proto area, will work
      when the build process isn't running as root.  The current
      package build process assumes that the invoker is root, so that
      even 'unreadable' files are readable.  Since the contents of the
      executables are already visible in packages and patches, it's
      assumed that allowing the file owner to read the file contents
      is not a security risk.

  B - directories to which the "write by owner" 0200 permission has
      been added.  The build process assumes that the invoker is root,
      and thus installing files into unwritable directories is
      possible.  In both cases above, the actual directory owner is
      root, so adding write permission (which root already enjoys)
      changes nothing about the overall access model.

  C - files to which the "write by owner" 0200 permission has been
      added.  In most cases, the build process does not require that
      files in the proto area be writable.  However, in the case of
      these three files, the build process incrementally manipulates
      these files (">> file") during the build.  It thus relies on
      root's ability to write to unwritable files, and won't work when
      the invoker is not root.  Since these files are always owned by
      root, this adds no new permissions, and no bypass of
      cron.{allow,deny} is possible.

The one slightly controversial part of this change is 'C'.  This
change means that a foolhardy administrator running as 'root' who runs
'vi' on one of the /var/spool/cron/crontabs files (instead of doing
"crontab -e" as documented) will see no error when doing ":w", where
he once would get a "read only file" warning.  Of course, that same
administrator running as root could always have typed ":w!" to
overwrite the file, so the additional exposure seems slight.

One drawback to the whole non-root plan is that, as many people start using the non-root build system, root builds (which still rely on the duplicated information for file ownership) will start rotting on the vine. They won't be routinely tested, and the (now unnecessary) Makefile goop that makes them work will likely be broken by accident by people who test their changes with only a non-root build. This effect is minimized by continuing to set the mode in the proto area, even though it's not needed, but can't easily be eliminated.

The potential for silent discrepancies means that the option to build as root does need to be removed from the system. Someday.

Aftermath

No good deed should go unpunished. As expected, there was a fair amount of fallout from these changes. In order to test, I did root and non-root builds with various settings of "nightly" flags to produce likely build scenarios. I caught many potential problems and fixed them, but I didn't catch all of them.

The changes broke the obscure "realmode" build, as documented in 4801654. The "realmode" builds were used to build special x86 boot bits.

The problem reported in 4864878 was due to the stderr handling in the BFU build scripts. It was at least in part an existing problem, but non-root build made it worse.

And 4858563, which broke old-style root builds, was a foolish typo on my part.

As a possible warning to others, once you touch one of these sorts of things, other folks seem to assume you "own" it and can debug their problems.


Footnotes

1. This refers to Solaris 9 and before.

2. I use the standard English masculine gender pronoun to refer to persons of unknown gender because all of the alternatives are just plain awkward. Much like this footnote. Unless otherwise noted, no actual chauvinism is intended. (Thanks, Mr. Knuth!)


Technorati Tag:
Technorati Tag:

Close Hang (or 4028137)

Every once in a while, a bug sticks with you long enough that you remember the ID number without having to think about it, and you find yourself mentioning the number in conversation as though it were universally known. 4028137 is one of those bugs. Like a public works project gone horribly wrong, many people had contributed to this bug over many years without actually coming much closer to a real solution. This entry describes how we got a solution.

The cause of the problem

This problem has been around since we switched over to a System V code base, and it's deceptively simple. Serial ports (especially those used for console lines) seem to "hang" in a state where the process using the port cannot be killed; it becomes <defunct> and hangs around forever. Worse, the serial port itself cannot be used again until the system is rebooted. Other processes trying to open the port become wedged in the specfs layer. Those innocent victims are the first thing the user notices, and they look like this in mdb(1):

cv_wait_sig(0x61f83db8,0x61f83db0,0x62a26860,0x0,0x1,0x66817710)
spec_open(0x61f83dc0,0x2202,0x64402d20,0x61f83dc4,0x1d,0x740000) + 158
cnopen(?) + 7c
dev_open(0x3190192c,0x2202,0x2,0x64402d20,0x10439000,0x61f83dc4)
spec_open(0x61f82aa0,0x2202,0x64402d20,0x61f82aa4,0x0,0x0) + 5c0
vn_open(0x0,0x0,0x2302,0x3f01b6,0x31901a74,0x0) + 2dc
copen(0x38de0,0x2302,0x1b6,0x37400,0xef7a3680,0x196e8) + 84

That's spec_open attempting to do LOCK_CSP_SIG(). It's stuck because we have SLOCKED set, which means that there's another thread somewhere in open or close on this same stream. Quickly scanning through the ::threadlist output for spec_open and spec_close, we see this culprit:

cv_wait_sig+0x160 (0x61022dd2,0x61022db0,0x61015048,0x0,0x60fbf5d8,0x60ff7c40)
qwait_sig+0x180 (0x61022dc2,0x40,0xc0006,0x61022dd2,0x61022db0,0x1)
ldtermclose+0x264 (0x61022d00,0x4,0x60d48ee0,0x60aaaf48,0x60d48f4c,0x80000)
qdetach+0x12c (0x61022d00,0x1,0x3,0x6049d9e0,0x1046c1b8,0x61022d58)
strclose+0x4b0 (0x6102219e,0x610215b0,0x61021578,0x61022d00,0x61022164,0x610215b4)
device_close+0x6c (0x60e8e464,0x3,0x6049d9e0,0x4,0x60e8e4f4,0x740001)
spec_close+0xd8 (0x60e8e464,0x3,0x60e8e564,0x60e8e56c,0x60e8e564,0x6049d9e0)
closef+0xa8 (0x60b4e428,0x0,0x60ff7c40,0x1,0x3,0x60e8e464)
closeall+0x4c (0x1,0x0,0x61015614,0x60b4e428,0x4,0x6100b420)
exit+0x134 (0x1,0x0,0x60fbf5d8,0x610151f0,0xef7948a8,0x0)
syscall_trap+0x194 (0x0,0x0,0xef791000,0x0,0xef7948a8,0xffffffff)

So how does this happen? In mdb we see that the ldterm(7M) STREAMS module is stuck waiting on a reply from the driver that never arrives. Turning to the code, the ldtermclose routine sends down a TCSBRK ioctl to the driver in order to drain the remaining output data before allowing the close to complete. A quick look at the driver shows that it's stuck on flow control and can't drain the data. This code before the fix looked like this (eliding a few lines of noise):

	/\*
	 \* Wait for output to drain completely from the modules and
	 \* driver below us before completing the close.  Doing so
	 \* guarantees that the proper tty semantics -- in particular,
	 \* flow control processing -- still apply for the remaining
	 \* pending output.
	 \*
	 \* We do this by sending a TCSBRK ioctl downstream and waiting
	 \* for the driver's response.  (When issued with a nonzero
	 \* argument, this ioctl does nothing other than wait for
	 \* output to drain, which is precisely what we want.)
	 \*/
	/\*
	 \* Set up the ioctl.
	 \*/
[...]
	iocb = (struct iocblk \*)bp->b_rptr;
	iocb->ioc_count = sizeof (int);

	\*(int \*)datap->b_rptr = 1;	/\* arbitrary nonzero value \*/
	datap->b_wptr += sizeof (int);
	bp->b_cont = datap;

	/\*
	 \* Fire it off.
	 \*/
	tp->t_state |= TS_IOCWAIT;
	tp->t_iocid = iocb->ioc_id;
	(void) putq(WR(q), bp);

	/\*
	 \* Now wait for it.  Let our read queue put routine wake us
	 \* up when it arrives.
	 \*/
	while (tp->t_state & TS_IOCWAIT) {
		if (!qwait_sig(q))
			break;
	}

So why does ldterm do this? The code first appears in the 1.11 version of the driver from 1990, and it seems the real answer is lost to time, but the likely reason is that some serial drivers written back then may have discarded data on close, and that's not what was wanted. Rather than fix all of the drivers to drain when expected, someone added this ioctl in the ldterm path as a "global" fix.

That's not the whole story, though. Ldterm is stuck in qwait_sig(7F), which should have allowed us to send a signal to wake the process back up. Why isn't that working? Looking further up the stack, we see exit calling closeall which gets us here. The problem is that the closeall invocation happens after exit has disabled all signal reception. We just can't wake up on signals here.

It turns out that both of these are "right." A serial stream must, according to the standards, drain the output data when close is called, and must behave as though close were called on all open descriptors when exit is called. But it's also true that once exit is called, the process is torn down, and there's no legitimate way to deliver any signals that might interrupt the stream-close handling because there's no user left to which to send them.

Towards a solution

From here, the fix isn't clear. Over the years, several possibilities were proposed, including:

  • Just remove the TCSBRK from ldterm. While the use of TCSBRK here was likely flawed from the start (the standards don't say that you must have the ldterm module pushed in order for drain-on-close to work right), removing this is not a fix. This doesn't work because some serial drivers do correctly drain data during close(9E) handling, and those will be just as stuck during exit as ldterm once was. Worse, any broken drivers that fail to drain data as expected will now have latent bugs exposed to the user.

  • Move the call to closeall before the place where signals are disabled. This doesn't work because it means that an application could unexpectedly get a signal invocation after having called exit, and after all atexit routines have returned, which violates the expected semantics.

  • "Fix" cv_wait_sig so that it wakes up on a signal even if there's no user thread available. This could possibly be made to work but has at least one ugly (and decidedly non-UNIX) side-effect: the user would have to issue SIGKILL multiple times to get it to work.

  • Don't drain the data at exit time, and instead just flush it as implied by the standards. This doesn't work at all. A simple cat /etc/motd > /dev/cua/a would never work right: when cat exited, its stdout would be closed, and the output would be discarded.

  • Keep the driver structures allocated and draining data even though the user is allowed to complete the close. This won't work because it means that we'll be lying to the application, and telling it that we've transmitted data that may never actually make it out onto the wire.

In short, none of these is really right, and all cause other problems. We need a better answer. There are too many constraints here to achieve perfection, so we need to shoot for the least awful behavior. The answer I concocted looks like this:

  1. When the driver's close(9E) routine is called by a user thread invoking close(2), the user's thread should wait forever for data to drain. If a signal is received, though, we can discard the output data and finish the close. (Note that the user can himself use TCSBRK with a non-zero argument to drain the output stream without losing any enqueued data on signal reception.)
  2. When it's called by exit(2), the thread should wait as long as we're still making forward progress. If a "long time" (described below) passes without transmitting data, then we just assume we'll never finish, and we can give up.
  3. Ldterm needs to figure out what sort of underlying driver is present. If it's one that's known to handle close properly (as documented above), then it should just terminate like any good STREAMS module. If not, then it must resort to the legacy TCSBRK behavior.
  4. A set of special conditions exist under which draining output is just impossible, and we should always discard instead without waiting at all. These need to be handled in each serial driver:
    • User sent TIOCSBRK without TIOCCBRK before closing. We'll be in break state "forever," so no more output is possible.
    • User sent TCXONC with argument 0 ("suspend output") before closing.
    • User set the O_NDELAY or O_NONBLOCK flags on the descriptor being closed. We cannot block to drain the data.

This gives us the chance to allow for reasonable behavior out of well-designed applications (those that carefully close descriptors when done with them and use the existing termio(7I) ioctls to good effect), but prevents things such as a ham-handed administrator using SIGKILL on tip(1) from rendering serial ports permanently unusable.

To do all this, we need two new features: a new DDI function (ddi_can_receive_sig(9F)) to determine if it's possible to receive signals in the current context (i.e., we're not being called from process exit) and a new private message (MC_POSIXQUERY, using an ioctl-formatted M_CTL just like the existing MC_CANONQUERY).

The new close logic can be seen in (among other serial drivers) asyclose and async_progress_check.

static void
async_progress_check(void \*arg)
{
	struct asyncline \*async = arg;
	struct asycom	 \*asy = async->async_common;
	mblk_t \*bp;

	/\*
	 \* We define "progress" as either waiting on a timed break or delay, or
	 \* having had at least one transmitter interrupt.  If none of these are
	 \* true, then just terminate the output and wake up that close thread.
	 \*/
	mutex_enter(&asy->asy_excl);
	mutex_enter(&asy->asy_excl_hi);
	if (!(async->async_flags & (ASYNC_BREAK|ASYNC_DELAY|ASYNC_PROGRESS))) {
		async->async_ocnt = 0;
		async->async_flags &= ~ASYNC_BUSY;
		async->async_timer = 0;
		bp = async->async_xmitblk;
		async->async_xmitblk = NULL;
		mutex_exit(&asy->asy_excl_hi);
		if (bp != NULL)
			freeb(bp);
		/\*
		 \* Since this timer is running, we know that we're in exit(2).
		 \* That means that the user can't possibly be waiting on any
		 \* valid ioctl(2) completion anymore, and we should just flush
		 \* everything.
		 \*/
		flushq(async->async_ttycommon.t_writeq, FLUSHALL);
		cv_broadcast(&async->async_flags_cv);
	} else {
		async->async_flags &= ~ASYNC_PROGRESS;
		async->async_timer = timeout(async_progress_check, async,
		    drv_usectohz(asy_drain_check));
		mutex_exit(&asy->asy_excl_hi);
	}
	mutex_exit(&asy->asy_excl);
}
[...]
/\*
 \* Close routine.
 \*/
/\*ARGSUSED2\*/
static int
asyclose(queue_t \*q, int flag, cred_t \*credp)
{
[...]
	/\*
	 \* There are two flavors of break -- timed (M_BREAK or TCSBRK) and
	 \* untimed (TIOCSBRK).  For the timed case, these are enqueued on our
	 \* write queue and there's a timer running, so we don't have to worry
	 \* about them.  For the untimed case, though, the user obviously made a
	 \* mistake, because these are handled immediately.  We'll terminate the
	 \* break now and honor his implicit request by discarding the rest of
	 \* the data.
	 \*/
	if (async->async_flags & ASYNC_OUT_SUSPEND) {
		if (async->async_utbrktid != 0) {
			(void) untimeout(async->async_utbrktid);
			async->async_utbrktid = 0;
		}
		mutex_enter(&asy->asy_excl_hi);
		lcr = ddi_io_get8(asy->asy_iohandle, asy->asy_ioaddr + LCR);
		ddi_io_put8(asy->asy_iohandle,
		    asy->asy_ioaddr + LCR, (lcr & ~SETBREAK));
		mutex_exit(&asy->asy_excl_hi);
		async->async_flags &= ~ASYNC_OUT_SUSPEND;
		goto nodrain;
	}

	/\*
	 \* If the user told us not to delay the close ("non-blocking"), then
	 \* don't bother trying to drain.
	 \*
	 \* If the user did M_STOP (ASYNC_STOPPED), there's no hope of ever
	 \* getting an M_START (since these messages aren't enqueued), and the
	 \* only other way to clear the stop condition is by loss of DCD, which
	 \* would discard the queue data.  Thus, we drop the output data if
	 \* ASYNC_STOPPED is set.
	 \*/
	if ((flag & (FNDELAY|FNONBLOCK)) ||
	    (async->async_flags & ASYNC_STOPPED)) {
		goto nodrain;
	}

	/\*
	 \* If there's any pending output, then we have to try to drain it.
	 \* There are two main cases to be handled:
	 \*	- called by close(2): need to drain until done or until
	 \*	  a signal is received.  No timeout.
	 \*	- called by exit(2): need to drain while making progress
	 \*	  or until a timeout occurs.  No signals.
	 \*
	 \* If we can't rely on receiving a signal to get us out of a hung
	 \* session, then we have to use a timer.  In this case, we set a timer
	 \* to check for progress in sending the output data -- all that we ask
	 \* (at each interval) is that there's been some progress made.  Since
	 \* the interrupt routine grabs buffers from the write queue, we can't
	 \* trust changes in async_ocnt.  Instead, we use a progress flag.
	 \*
	 \* Note that loss of carrier will cause the output queue to be flushed,
	 \* and we'll wake up again and finish normally.
	 \*/
	if (!ddi_can_receive_sig() && asy_drain_check != 0) {
		async->async_flags &= ~ASYNC_PROGRESS;
		async->async_timer = timeout(async_progress_check, async,
		    drv_usectohz(asy_drain_check));
	}
	while (async->async_ocnt > 0 ||
	    async->async_ttycommon.t_writeq->q_first != NULL ||
	    (async->async_flags & (ASYNC_BUSY|ASYNC_BREAK|ASYNC_DELAY))) {
		if (cv_wait_sig(&async->async_flags_cv, &asy->asy_excl) == 0)
			break;
	}
	if (async->async_timer != 0) {
		(void) untimeout(async->async_timer);
		async->async_timer = 0;
	}

nodrain:

If we can't receive signals, then we start a timer to check every 15 seconds to see if we're making any progress. If not, we flush everything and let the close complete. The detection logic for enhanced drivers is in ldtermopen, where we send MC_POSIXQUERY, and ldtermrput, where we handle the MC_HAS_POSIX reply. Old drivers will just discard MC_POSIXQUERY and we'll keep the preallocated TCSBRK drain message. New ones send up MC_HAS_POSIX in reply, and ldterm then discards the preallocated TCSBRK message so it won't be used during close. (The "ldterm_drain_limit" flag below is an undocumented /etc/system flag that allows us to disable this logic just in case it runs into trouble in the field, rather than having to recompile the module and deliver a patch.)

ldtermopen(queue_t \*q, dev_t \*devp, int oflag, int sflag, cred_t \*crp)
[...]
	/\*
	 \* Find out if the underlying driver supports proper POSIX close
	 \* semantics.  If not, we'll have to approximate it using TCSBRK.  If
	 \* it does, it will respond with MC_HAS_POSIX, and we'll catch that in
	 \* the ldtermrput routine.
	 \*
	 \* When the ldterm_drain_limit tunable is set to zero, we behave the
	 \* same as old ldterm: don't send this new message, and always use
	 \* TCSBRK during close.
	 \*/
	if (ldterm_drain_limit != 0) {
		if ((qryp = open_ioctl(q, MC_POSIXQUERY)) == NULL)
			goto open_abort;
		qryp->b_datap->db_type = M_CTL;
		putnext(wq, qryp);
	}
[...]
ldtermrput(queue_t \*q, mblk_t \*mp)
[...]
		case MC_HAS_POSIX:
			/\* no longer any reason to drain from ldterm \*/
			if (ldterm_drain_limit != 0) {
				freemsg(tp->t_drainmsg);
				tp->t_drainmsg = NULL;
			}
			break;

Compared to negotiating the shark-infested design waters, the fix itself looks simple.

Finishing the job

The next big step was to test this fix (during which I found around a dozen other lurking problems) and then chase down follow-on problems either uncovered or caused by this fix, and update documentation and third-party developers about the problem and solution. In fact, the testing alone for this fix took many times the amount of effort than finding, fixing, and documenting the problem, and running through all the development process steps and reviews. It took many weeks on-and-off running both manual and automated test suites. But that story is a (lengthy) blog for another day.


Technorati Tag:
Technorati Tag:
Technorati Tag:
About

carlson

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News
Blogroll