On Locklessness

On Locklessness

Unlike many other Unix kernels, the Solaris kernel is heavily multithreaded, including our networking stack. The conventional way to ensure consistency between multiple threads is to use synchronization primitives such as mutexes and readers-writer locks -- and indeed, these are widely used throughout the Solaris kernel. However, there are times when conventional locks are inappropriate, and more creative approaches to ensuring consistency must be devised. In this blog entry, I discuss a recent issue I encountered while working on the new Nemo[1] network driver framework, and describe an elegant but rarely-used approach.

The Problem

Among its other responsibilities, the Nemo framework provides the glue between the network protocol layer (e.g., IP) and the underlying network device driver. In particular, when a packet needs to be sent to an underlying device, the protocol stack will call the dls_tx() function. The dls_tx() function takes two arguments: the data-link to send the packet on, and the the packet to send. If any part of the packet could not be sent, the remainder is returned. In the original Nemo implementation, dls_tx() was implemented as:

	mblk_t \*
	dls_tx(dls_channel_t dc, mblk_t \*mp);
	        dls_impl_t      \*dip = (dls_impl_t \*)dc;
	        return (dip->di_tx(dip->di_tx_arg, mp));
Note that dc and dip refer to the same object (the data-link), but the first is opaque so that the caller cannot grovel around in the implementation details[2].

As is apparent, dls_tx() turns around and calls a previously-specified per-datalink transmit function pointer to send packets on that data-link. The transmit function is passed a previously-specified callback argument, and the packet to transmit. The transmit function and callback argument usually specify the underlying network driver's transmit function and a driver-instance data structure associated with that link.

However, when promiscuous-mode[3] is enabled on the underlying device, the Nemo framework interposes its own transmit function and callback argument between dls_tx() and the underlying driver's transmit routine. This interposed function loops back each transmitted packet so that a system will see its own transmitted packets through tools such as snoop(1M)[4].

Unfortunately, this presents a problem: as can be seen in dls_tx(), no locks are held as part of dereferencing di_tx and di_tx_arg. Thus, if one thread is calling dls_tx() while another is enabling or disabling promiscuous mode, it is possible for us to end up invoking the callback function with a mismatched callback argument.

As an example, suppose that the normal di_tx and di_tx_arg are tx and txarg, and the promiscuous di_tx and di_tx_arg are looptx and looptxarg. Then the following could happen:

          thread 1 (in dls_tx())      |  thread 2 (enabling promiscuous)
  time    load di_tx = tx             |
   |                                  |  store di_tx = looptx
   |                                  |  store di_tx_arg = looptxarg
   V      load di_tx_arg = looptxarg  |

Clearly, passing the wrong argument to di_tx could trigger panic or other misbehavior and is thus unacceptable. As such, we need to ensure that from the perspective of dls_tx(), di_tx and di_tx_arg are always either tx/txarg, or looptx/looptxarg.

So ... why not lock?

The simplest and most obvious way to address this problem would be to introduce locks, changing dls_tx() to be:

	mblk_t \*
	dls_tx(dls_channel_t dc, mblk_t \*mp)
	        dls_impl_t      \*dip = (dls_impl_t \*)dc;
		mblk_t		\*mp;

	   	mp = dip->di_tx(dip->di_tx_arg, mp);
		return (mp);

Of course, the same lock would be acquired while assigning di_tx and di_tx_arg, making the assignment of both di_tx and di_tx_arg appear atomic from the perspective of dls_tx(). While this approach will work (and, moreover, is almost always the right way to solve this problem), in this particular case it has some drawbacks. To wit:

  1. All calls to the underlying di_tx function on the given channel are now serialized. That is, if multiple threads call dls_tx() for the same dc, only one thread will be able to call di_tx at a time.
  2. The dls_tx() function is called for every transmitted packet. Thus acquiring and releasing a lock -- even supposing that it is rarely contended -- introduces needless overhead [5].
  3. The dls_tx() function is no longer eligible for compiler tail-call optimization, impacting performance since the stack frame of dls_tx() can no longer be directly reused by di_tx()[6].

Lockless Atomicity

While the first problem could be fixed by switching to a readers-writer lock, this does nothing for the other two problems. Indeed, there appears to be no way to overpower this problem with complexity, so instead we return our requirement:

We need to ensure that from the perspective of dls_tx(), di_tx and di_tx_arg are always either tx/txarg, or looptx/looptxarg.
Now, if all we needed to do was guarantee that only di_tx or di_tx_arg assignments appeared atomic, we'd need no locks at all: all processors that Solaris supports guarantee that assignments to pointer-sized quantities will be atomic [7].

However, this of course means that assignment of structure pointers is also atomic. Thus, we can simply bundle these two fields into a structure:

	typedef struct mac_txinfo_s {
	        mac_tx_t                mt_fn;
	        void                    \*mt_arg;
	} mac_txinfo_t;

... and fill in two such structures for each driver: one that contains the tx/txarg pair, and one that contains the looptx/looptxarg pair. Here's the actual code that does this in i_mac_create():

         \* Set up the two possible transmit routines.
        mip->mi_txinfo.mt_fn = mp->m_tx;
        mip->mi_txinfo.mt_arg = mp->m_driver;
        mip->mi_txloopinfo.mt_fn = mac_txloop;
        mip->mi_txloopinfo.mt_arg = mip;

We then replace di_tx and di_tx_arg with a single di_txinfo field:

        const mac_txinfo_t              \*di_txinfo;

Again, since di_txinfo is a single pointer, it can be atomically set to point to either mi_txinfo or mi_txloopinfo. Coming back around to dls_tx(), we can now remove the need for locks, and also restore the compiler's ability to tail-call optimize dls_tx():

	mblk_t \*
	dls_tx(dls_channel_t dc, mblk_t \*mp)
	        const mac_txinfo_t \*mtp = ((dls_impl_t \*)dc)->di_txinfo;
	        return (mtp->mt_fn(mtp->mt_arg, mp));

Yes, that's it -- it's really that elegant. However, it's also a bit subtle: the code is careful to dereference di_txinfo only once. A more mechanical transformation of the original dls_tx() function would reintroduce a variant of the original bug, since di_txinfo might change between the two dereferences:

	mblk_t \*
	dls_tx(dls_channel_t dc, mblk_t \*mp)
	        dls_impl_t      \*dip = (dls_impl_t \*)dc;

		/\* WRONG \*/	
	        return (dip->di_txinfo->mt_fn(dip->di_txinfo->mt_arg, mp));

It's also worth noting that even the correct lockless dls_tx() has subtlely different semantics from the version which used mutexes: since there is no explicit synchronization, it's possible a thread in dls_tx() may not see recent changes to the di_txinfo pointer. In this particular case, this is acceptable since:

  • The lifetime of either value is identical, and it is not possible for either value to become invalid while in dls_tx(), due to other invariants in the Nemo framework.
  • The effect of using the wrong function for a few packets is unobservable: either the kernel will attempt to loop back a few packets unnecessarily, or the first few packets after promiscuous-mode is enabled will not be looped back. The latter is unobservable because the application which enabled promiscuous-mode cannot observe the exact instant that it became possible to loop-back packets it transmits.

In many cases, these restrictions may limit the applicability of this approach -- and moreover, may suggest that a more conventional locking approach is needed. Nonetheless, as this discussion has hopefully made clear, there are times when locks are a poor match to the problem at hand, and "going lockless" both improves performance and code elegance.


[1] Nemo is also known as GLDv3, and is our brand new network driver framework that bge and xge currently use. All future Sun-provided network drivers on Solaris will use GLDv3. Once the interfaces stabilize, they will be published for third-party network driver developers.
[2] Throughout the kernel, we prefer this to void \* since this enables the compiler to still type-check, thus maintaining type-safety. The actual definition of dls_channel_t is typedef struct dls_t \*dls_channel_t, where dls_t is never defined.
[3] Enabling promiscuous-mode allows a network device to receive and send up packets on the network which are not destined for its hardware unicast, multicast, or broadcast addresses. In the old days, this would allow all traffic on the subnet to be observed -- but in these days of switched networks, enabling promiscuous-mode often yields only a few additional packets.
[4] Almost all network devices are deaf while transmitting, so transmitted packets which a host must also receive must be looped back in software.
[5] For instance, on SPARC, a mutex_exit() will perform a membar #LoadStore|#StoreStore, which forces the processor to ensure that all outstanding loads and stores complete before any future store complete.
[6] Those not familiar with the basics of tail-call optimization should read this wiki entry. On SPARC, this optimization can be important because it can avoid needless register spill/fill traps (among other things).
[7] To my knowledge, we have never officially stated this as a guarantee, but tons of kernel code relies on this invariant.

Technorati Tag:
Technorati Tag:


Hi Meem, great read. I love to read those neat little tricks. Keep it coming.

Posted by Tiffany on October 04, 2006 at 09:03 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014

No bookmarks in folder


No bookmarks in folder