Bug Wars: The Phantom Pool Bug Menace

Most systems programmers like to swap tales about tackling tricky or annoying bugs. Now, after a month of pulling my hair out, I can share my first "bug war" story as a systems programmer.

A somewhat long time ago, in a Sun Microsystems office not too far away, I occasionally encountered system panics of the following form while running my bug fixes through the standard zones test suite in snv_96:

assertion failed: pool->pool_ref == 0, file: ../../common/os/pool.c, line: 454

src.opensolaris.org located the assertion in pool_pool_destroy():

428 /\*
429  \* Destroy specified pool, and rebind all processes in it
430  \* to the default pool.
431  \*/
432 static int
433 pool_pool_destroy(poolid_t poolid)
434 {
435 	pool_t \*pool;
436 	int ret;
438 	ASSERT(pool_lock_held());
440 	if (poolid == POOL_DEFAULT)
441 		return (EINVAL);
442 	if ((pool = pool_lookup_pool_by_id(poolid)) == NULL)
443 		return (ESRCH);
444 	ret = pool_do_bind(pool_default, P_POOLID, poolid, POOL_BIND_ALL);
445 	if (ret == 0) {
446 		struct destroy_zone_arg dzarg;
448 		dzarg.old = pool;
449 		dzarg.new = pool_default;
450 		mutex_enter(&cpu_lock);
451 		ret = zone_walk(pool_destroy_zone_cb, &dzarg);
452 		mutex_exit(&cpu_lock);
453 		ASSERT(ret == 0);
454 		ASSERT(pool->pool_ref == 0);
455 		(void) nvlist_free(pool->pool_props);
456 		id_free(pool_ids, pool->pool_id);
457 		pool->pool_pset->pset_npools--;
458 		list_remove(&pool_list, pool);
459 		pool_count--;
460 		pool_pool_mod = gethrtime();
461 		kmem_free(pool, sizeof (pool_t));
462 	}
463 	return (ret);
464 }

Line 454 caused the panic, so something was referring to the dying pool. Looking at the source code further, I discerned that pool_do_bind() was supposed to rebind all processes within the pool specified by poolid to the pool to which the first function argument referred (in this case, the default pool). The zone callback invoked on like 451 only set a zone's pool and processor set associations; it didn't rebind processes. pool_do_bind() returned zero after completing successfully, so the problem was that pool_do_bind() indicated that it successfully rebound all processes from the dying pool to the default pool when in fact it sometimes did not.

I dug around the source tree and determined that a process' pool association (as indicated by the proc_t structure's p_pool field) only changed when a new system process spawned (that is, a process without a parent spawned; see newfork() in uts/common/os/fork.c), a process forked (cfork() in uts/common/os/fork.c), a process exited (proc_exit() in uts/common/os/exit.c), or a process was bound to a pool via pool_do_bind().

A gentle introduction to pool rebinding is necessary before I proceed further. When a process forks or exits, it enters the pool barrier, which encloses operations that are sensitive to changes in the process' pool binding. (In other words, a process' pool binding should not change while the process is within the pool barrier.) The pool barrier is sandwiched between invocations of pool_barrier_enter() and pool_barrier_exit() (see uts/common/os/fork.c:211-224,229-236,299-309,525-527,668-672 and uts/common/os/exit.c:489-493,590-605).

When a pool (call it P) is destroyed, pool_do_bind() (uts/common/os/pool.c:1239-1647) is invoked to rebind all processes within P to the default pool. pool_do_bind() creates an array of proc_t pointers called procs that can hold twice the number of active processes. procs will hold pointers to all processes that will be rebound to the default pool. Once procs is allocated, pool_do_bind() grabs pidlock and enters what I will call the first loop, which adds all active processes bound to P to procs (see pool.c:1359:1432). These processes are also marked with the PBWAIT flag (pool.c:1408), which causes them to block in pool_barrier_enter() and pool_barrier_exit(), effectively stopping them from entering or exiting the pool barrier. Once the first loop is done, pool_do_bind() releases pidlock and waits until all processes in procs that were within the pool barrier when marked with PBWAIT to block at pool_barrier_exit(). This guarantees that pool rebinding won't occur while the targeted processes are sensitive to pool rebinding.

Once the thread in pool_do_bind() resumes execution, it enters what I will call the second loop, which checks if the children of the processes in procs should be added to procs. This loop catches any processes that were spawned via cfork() while the thread in pool_do_bind() waited for marked processes to block at pool_barrier_exit(). (Note that a newly-spawned process' LWPs are not started until the parent process exits the pool barrier.) Once the second loop completes, pool_do_bind() rebinds the processes in procs to the default pool, adjusts P's reference count, and wakes all processes in procs that are blocked within pool_barrier_enter() and pool_barrier_exit(). (Note that cfork() and proc_exit() also adjust pool reference counts when processes fork or exit.)

Now, back to my story. I turned to MDB to give me some clues as to what was going wrong:

> ::status
debugging crash dump vmcore.4 (64-bit) from balaclava
operating system: 5.11 onnv-bugfix (i86pc)
panic message: assertion failed: pool->pool_ref == 0, file:
../../common/os/pool.c, line: 454
dump content: all kernel and user pages
> ::panicinfo
             cpu                3
          thread ffffff03afa72400
         message assertion failed: pool->pool_ref == 0, file:
../../common/os/pool.c, line: 454
             rdi fffffffffbf31690
             rsi ffffff0008017988
             rdx fffffffffbf311b0
             rcx              1c6
              r8 ffffff00080179c0
              r9               20
             rax                0
             rbx              1c6
             rbp ffffff00080179b0
             r10 ffffff00080178d0
             r10 ffffff00080178d0
             r11 ffffff01ce469680
             r12 fffffffffbf311b0
             r13 fffffffffbf31018
             r14 fffffffffbc5b4d8
             r15                3
          fsbase                0
          gsbase ffffff01d243b580
              ds               4b
              es               4b
              fs                0
              gs              1c3
          trapno                0
             err                0
             rip fffffffffb84be90
              cs               30
          rflags              246
             rsp ffffff00080178c8
              ss               38
          gdt_hi                0
          gdt_lo         f00001ef
          idt_hi                0
          idt_lo         10000fff
             ldt                0
            task               70     
             cr0         8005003b
             cr2         feda43a8
             cr3         e49a3000
             cr4              6f8

The faulting thread had an address of ffffff03afa72400 and was on CPU 3.

> ffffff03afa72400::findstack -v
stack pointer for thread ffffff03afa72400: ffffff00080178c0
  ffffff00080179b0 panic+0x9c()
  ffffff0008017a00 assfail+0x7e(fffffffffbf31018, fffffffffbf311b0, 1c6)
  ffffff0008017a50 pool_pool_destroy+0x16b(47)
  ffffff0008017aa0 pool_destroy+0x40(2, 8067ce8, 47)
  ffffff0008017ca0 pool_ioctl+0xa32(a300000000, 3, 8064ca0, 102003,
      ffffff05de08fc48, ffffff0008017e8c)
  ffffff0008017ce0 cdev_ioctl+0x48(a300000000, 3, 8064ca0, 102003,
      ffffff05de08fc48, ffffff0008017e8c)
  ffffff0008017d20 spec_ioctl+0x86(ffffff03ac78f700, 3, 8064ca0, 102003,
      ffffff05de08fc48, ffffff0008017e8c, 0)
  ffffff0008017da0 fop_ioctl+0x7b(ffffff03ac78f700, 3, 8064ca0, 102003,
      ffffff05de08fc48, ffffff0008017e8c, 0)
  ffffff0008017eb0 ioctl+0x174(3, 3, 8064ca0)
  ffffff0008017f00 sys_syscall32+0x1fc()

As expected, the failed assertion occurred in pool_pool_destroy() after pool_do_bind() was called. Pool 0x47 (a non-default pool) was being destroyed.

> ::cpuinfo
  0 fffffffffbc38fb0  1f    1    0  10  yes    no t-0    ffffff03ad328700 ppdmgr
  1 ffffff01d23a1580  1f    0    0  60   no    no t-0    ffffff01e9f98000 ppdmgr
  2 ffffff01d23ce580  1f    0    0  60   no    no t-0    ffffff01d2a01560
  3 fffffffffbc40600  1b    1    0  41   no    no t-0    ffffff03afa72400

pooladm was responsible for the panic.

> pool_list::walk list | ::print -a pool_t      
    ffffff01ce887140 pool_id = 0
    ffffff01ce887144 pool_ref = 0x5f
    ffffff01ce887148 pool_link = {
        ffffff01ce887148 list_next = 0xffffff01d0fb0b48
        ffffff01ce887150 list_prev = pool_list+0x10
    ffffff01ce887158 pool_props = 0xffffff01d28d36d0
    ffffff01ce887160 pool_pset = 0xffffff01cf500508
    ffffff01d0fb0b40 pool_id = 0x47
    ffffff01d0fb0b44 pool_ref = 0x1
    ffffff01d0fb0b48 pool_link = {
        ffffff01d0fb0b48 list_next = pool_list+0x10
        ffffff01d0fb0b50 list_prev = 0xffffff01ce887148
    ffffff01d0fb0b58 pool_props = 0xffffff01d4d759c8
    ffffff01d0fb0b60 pool_pset = 0xffffff01cf500508

There were two pools. The first was the default pool, which appeared to have been consistent when the system panicked. The second pool was being destroyed. However, its reference count was one when the assertion failed. Everything else in the guilty pool appeared to have been consistent when the system panicked.

> ::walk proc | ::print -a proc_t p_pool ! grep ffffff01d0fb0b40
ffffff07e7586fa0 p_pool = 0xffffff01d0fb0b40
> ::offsetof proc_t p_pool
offsetof (proc_t, p_pool) = 0xc00
> ffffff07e7586fa0-0xc00=X

There was exactly one process that referred to the guilty pool: ffffff07e75863a0.

> ffffff07e75863a0::print proc_t p_zone
p_zone = 0xffffff01edf1bf00
> ::walk zone
> 0xffffff01edf1bf00::print zone_t zone_name
zone_name = 0xffffff01e6eec200 "jj1"
> 0xffffff01edf1bf00::print zone_t zone_pool
zone_pool = 0xffffff01ce887140

The non-global zone jj1 contained the guilty process. The global zone was the only other zone in the system. Notice that jj1 was associated with the default pool, not the pool that was being destroyed, when the system panicked. So all but one of the processes within jj1 and jj1 itself were rebound to the default pool.

> ffffff07e75863a0::ptree
fffffffffbc36f70  sched
     ffffff01d25c33a0  init
          ffffff07e3d2a3a0  svc.startd
               ffffff07e5b563a0  ppd-cache-update
                    ffffff07e5c4f3a0  ppdmgr
                         ffffff07e60aa3a0  ppdmgr
                              ffffff07e70323a0  ppdmgr
                                   ffffff07e731c3a0  ppdmgr
                                        ffffff07e75863a0  ppdmgr

cfork() forked the guilty process. It was not created via newproc(), for if it were, then it would not have had a parent process (ffffff07e731c3a0).

>⁞ ffffff07e731c3a0::print proc_t p_pool
p_pool = 0xffffff01ce887140
> ffffff07e731c3a0::print proc_t p_zone
p_zone = 0xffffff01edf1bf00
> ffffff07e70323a0::print proc_t p_pool
p_pool = 0xffffff01ce887140
> ffffff07e70323a0::print proc_t p_zone
p_zone = 0xffffff01edf1bf00

Both the parent and grandparent of the guilty process were within zone jj1 and both referred to the default pool when the system panicked.

I started to look for interleavings of code from cfork(), proc_exit(), and pool_do_bind() that would lead to inconsistent states, going so far as to create diagrams illustrating which locks were held (and the order in which they were acquired) at various points in the aforementioned functions, but found nothing that suggested a race condition. I struggled to understand the fork and exit code (a nontrivial task) to see if any of the invoked subroutines were generating race conditions, but did not find anything. A fellow engineer suggested three or four possible sources of the bug, including a three-way race between the aforementioned functions, but a little investigation and a few counterexamples put his theories to rest. I made no progress for at least two weeks.

My frustrations were about to drive me insane when I stumbled upon what I thought was the source of the bug. The problem was in the second loop, in pool.c:1491-1500:

1491 		mutex_enter(&p->p_lock);
1492 		/\*
1493 		 \* Skip processes in local zones if we're not binding
1494 		 \* zones to pools (P_ZONEID).  Skip kernel processes also.
1495 		 \*/
1496 		if ((!INGLOBALZONE(p) && idtype != P_ZONEID) ||
1497 		    p->p_flag & SSYS) {
1498 			mutex_exit(&p->p_lock);
1499 			continue;
1500 		}

The problem was on line 1496. The first disjunct of this conditional statement made pool_do_bind() skip child processes that were not in the global zone. (idtype == P_POOLID [idtype is one of pool_do_bind()'s parameters] when a pool is being destroyed.) Therefore, if a process in a non-global zone was forking (but had not created its child's proc_t structure yet via getproc() [fork.c:907-1177, esp.1055-1067]) when pool_do_bind() went through the first loop, then the second loop would never have added the process' child to procs. Thus the child process would have remained bound to pool P, resulting in the failed assertion.

Here is a sample execution that illustrates this bug (thread A is the thread that is destroying P while thread B is executing cfork()):

  1. A enters pool_do_bind().
  2. B enters cfork().
  3. B enters the pool barrier.
  4. B enters getproc().
  5. B allocates and zeroes the child proc's proc_t structure.
  6. A acquires pidlock, adds B's proc to procs, and releases pidlock (i.e., A goes through the first loop).
  7. B adds the child proc to the process tree and the active process list (both of which require B to grab pidlock).
  8. B attempts to exit the pool barrier via pool_barrier_exit(), but PBWAIT is set in its proc_t's p_poolflag field, so it wakes A and blocks, waiting for A to signal it.
  9. A grabs pidlock and examines all processes in procs (i.e., A goes through the second loop).
  10. While examining procs, A looks at B's child process, sees that it is not in the global zone and idtype is not P_ZONEID (it is P_POOLID), and consequently does not add B's child process to procs.
  11. A rebinds all processes in procs and decrements the old pool's (P's) reference count accordingly.
  12. A signals (wakes) all processes in procs.
  13. B wakes up.
  14. B turns its child process's LWPs loose.

The solution was simple: extend the first disjunct of the above conditional statement (pool.c:1496) with another conjunct, idtype != P_POOLID, so that the first disjunct reads "!INGLOBALZONE(p) && idtype != P_ZONEID && idtype != P_POOLID". That way emptying pools of processes (e.g., during pool destruction) will not skip new processes in non-global zones.

I thought, "At last, I nailed it!" but my success was short-lived. The same assertion failed after a few more runs through the zones test suite. I jumped back into MDB and examined the new dump, which was the same as the old dump (see above) with two exceptions: First, the guilty process had descendants in the new dump. That meant that the guilty process was not being spawned when pool_do_bind() executed. If it were being spawned when pool_do_bind() started the first loop, then its parent process would have blocked at pool_barrier_exit() and the child's LWPs would not have started until pool_do_bind() finished executing, which would have given the child no opportunity to spawn descendants.

Furthermore, if the child was being spawned when pool_do_bind() started the first loop and the child started spawning descendants between the time when the thread executing pool_do_bind() returned to pool_pool_destroy() and when the thread encountered the failed assertion, the child's descendants would have been bound to the child's pool, making the pool's reference count greater than one. But the pool's reference count was one, so the child was not being spawned. (One might claim that the descendants could have rebound themselves to other pools before the assertion was made, but that was impossible because the pool lock, which prohibited concurrent pool operations, was held while pool_pool_destroy() and pool_do_bind() were executed.)

Second, the guilty process was executing a subroutine called by relvm() (which was inside the pool barrier) within proc_exit(). That fact led me to think that some interaction between proc_exit() and pool_do_bind() was responsible for the bug.

Further source code analysis did not reveal anything, so I scattered over twenty static DTrace probes throughout cfork(), proc_exit(), and pool_do_bind() in a desperate effort to acquire more useful information. After taking a few more dumps, adjusting the probes, and parsing the DTrace buffers stored in the dumps, I acquired a vital clue: a process that was exiting (via proc_exit()) and had entered (but not exited) the pool barrier was not being caught by the first loop in pool_do_bind(). Curious, I looked closely at the code surrounding pool_barrier_enter() in proc_exit() and the first loop in pool_do_bind(). I noticed nothing out of the ordinary, so I thoguht, "Great, I might as well reexamine functions called by proc_exit() and pool_do_bind() that I thought were correct." So I reexamined procinset() (which pool_do_bind() used in both the first and second loops to determine if a given process was bound to the pool that was being destroyed) and saw the following (uts/common/os/procset.c):

270 /\*
271  \* procinset returns 1 if the process pointed to by pp is in the process
272  \* set specified by psp, otherwise 0 is returned. A process that is
273  \* exiting, by which we mean that its p_tlist is NULL, cannot belong
274  \* to any set; pp's p_lock must be held across the call to this function.
275  \* The caller should ensure that the process does not belong to the SYS
276  \* scheduling class.
277  \*
278  \* This function expects to be called with a valid procset_t.
279  \* The set should be checked using checkprocset() before calling
280  \* this function.
281  \*/
282 int
283 procinset(proc_t \*pp, procset_t \*psp)
284 {
285 	int		loperand = 0;
286 	int		roperand = 0;
287 	int		lwplinproc = 0;
288 	int		lwprinproc = 0;
289 	kthread_t	\*tp = proctot(pp);
291 	ASSERT(MUTEX_HELD(&pp->p_lock));
293 	if (tp == NULL)
294 		return (0);
296 	switch (psp->p_lidtype) {

Notice lines 293-294. If a process' thread list was NULL, then procinset() indicated failure (the process was not in the process set). Now look at the code surrounding pool_barrier_enter() in proc_exit():

470 	mutex_enter(&p->p_lock);
472 	/\*
473 	 \* Clean up any DTrace probes associated with this process.
474 	 \*/
475 	if (p->p_dtrace_probes) {
476 		ASSERT(dtrace_fasttrap_exit_ptr != NULL);
477 		dtrace_fasttrap_exit_ptr(p);
478 	}
480 	while ((tmp_id = p->p_itimerid) != 0) {
481 		p->p_itimerid = 0;
482 		mutex_exit(&p->p_lock);
483 		(void) untimeout(tmp_id);
484 		mutex_enter(&p->p_lock);
485 	}
487 	lwp_cleanup();
489 	/\*
490 	 \* We are about to exit; prevent our resource associations from
491 	 \* being changed.
492 	 \*/
493 	pool_barrier_enter();
495 	/\*
496 	 \* Block the process against /proc now that we have really
497 	 \* acquired p->p_lock (to manipulate p_tlist at least).
498 	 \*/
499 	prbarrier(p);
501 #ifdef	SUN_SRC_COMPAT
502 	if (code == CLD_KILLED)
503 		u.u_acflag |= AXSIG;
504 #endif
505 	sigfillset(&p->p_ignore);
506 	sigemptyset(&p->p_siginfo);
507 	sigemptyset(&p->p_sig);
508 	sigemptyset(&p->p_extsig);
509 	sigemptyset(&t->t_sig);
510 	sigemptyset(&t->t_extsig);
511 	sigemptyset(&p->p_sigmask);
512 	sigdelq(p, t, 0);
513 	lwp->lwp_cursig = 0;
514 	lwp->lwp_extsig = 0;
515 	p->p_flag &= ~(SKILLED | SEXTKILLED);
516 	if (lwp->lwp_curinfo) {
517 		siginfofree(lwp->lwp_curinfo);
518 		lwp->lwp_curinfo = NULL;
519 	}
521 	t->t_proc_flag |= TP_LWPEXIT;
522 	ASSERT(p->p_lwpcnt == 1 && p->p_zombcnt == 0);
523 	prlwpexit(t);		/\* notify /proc \*/
524 	lwp_hash_out(p, t->t_tid);
525 	prexit(p);
527 	p->p_lwpcnt = 0;
528 	p->p_tlist = NULL;
529 	sigqfree(p);
530 	term_mstate(t);
531 	p->p_mterm = gethrtime();
533 	exec_vp = p->p_exec;
534 	execdir_vp = p->p_execdir;
535 	p->p_exec = NULLVP;
536 	p->p_execdir = NULLVP;
537 	mutex_exit(&p->p_lock);

Notice anything fishy?

proc_exit() set the exiting process' p_tlist field to NULL after entering the pool barrier but before releasing the process' p_lock (exit.c:528), which pool_do_bind() grabbed during the first loop before invoking procinset() (pool.c:1367-1378). So if a process entered the pool barrier but did not exit and another process attempted to destroy the pool, then procinset() would have informed the latter process that the former process was not bound to the pool that was being destroyed. Thus the thread executing pool_do_bind() would have skipped the exiting process, which would have remain bound to the dying pool. Hence the failed assertion.

(It is funny that I did not notice the comment in procset.c:272-274 when I first examined procinset(). It would have saved me much grief.)

The following sample execution will illustrate my point. Suppose that thread A belongs to a process that is bound to a non-default pool P. Suppose further that A is in the middle of proc_exit() and that some other thread B (in a different process) is destroying P and is in the middle of pool_do_bind(). Then the following might happen:

  1. B constructs procs and grabs pidlock. (pool.c:1333-1357)
  2. B begins checking each process in the active process list (i.e., it starts going through the first loop). (pool.c:1359-1366)
  3. B is context-switched with A.
  4. A grabs its process' p_lock and enters the pool barrier. (exit.c:470-493)
  5. A sets its process' p_tlist field to NULL. (exit.c:528)
  6. A releases its process' p_lock. (exit.c:537)
  7. A is context-switched with B.
  8. B grabs A's process' p_lock. (pool.c:1367)
  9. B calls procinset() and sees a return value of zero. (pool.c:1373)
  10. B skips A's process and does not add it to procs. (pool.c:1376-1377)
  11. B finishes pool_do_bind() successfully and returns to pool_pool_destroy().
  12. B asserts that the targeted pool's reference count is zero and fails. [pool.c:454]

Thus A's process would not be rebound to the default pool and the assertion would fail.

The second loop in pool_do_bind() did not examine the missed process (even if its parent were added to procs during the first loop) because the second loop also used procinset() to determine if child processes were bound the the targeted pool. So pool_do_bind() was incapable of catching an exiting process as described above.

Further examination of procinset() revealed that a process' p_tlist field was used only when the idtype argument was P_CID. Thus the most straightforward fix was to check that p_tlist != NULL iff idtype == P_CID. I took this approach, which (in addition to a few minor changes elsewhere) worked beautifully. The bug never appeared again, even when I executed my own test that created several pools, bound one of them to a running zone, and destroyed the pools in a tight loop for days.

Thus I found the two causes of the bug in roughly a month. You can imagine what a sigh of relief I gave when I verified my fix!

This episode has one moral: RTFC (Read The F\*%!@#& Comments)!


Post a Comment:
  • HTML Syntax: NOT allowed

I am a kernel developer at Sun Microsystems, Inc., working on zones and resource pools. This blog logs some of my thoughts regarding my work and the [mis]adventures that I have while working on Solaris.


« October 2016