Beware of the Performance of RW Locks

In my naive little mind a rw lock would represents a performant scalable construct inasmuch as WRITERS do not hold the lock for a significant amount of time. One figures that the lock would be held for short WRITERS times followed by concurrent execution of RW_READERS.

What I recently found out is quite probably well known to seasoned kernel engineer but this was new to me. So I figured it could be of interest to others.


So Reader/Writer locks (RW) can be used in kernel and user level code to allow multiple READERS of, for instance, a data structure, to access the structure while allowing only a single WRITER at a time within the bounds of the rwlock().

A RW locks (rwlock(9F), rwlock(3THR)) is more complex that a simple mutex. So acquiring such locks will be more expensive. This means that if the expected hold times of a lock is quite small (say to update or read 1 or 2 fields of a structure) then regular mutexes can usually do that job very well. A common programming mistake is to expect faster execution of RW locks for those cases.

However when READ hold times need to be fairly long; then RW locks represent an alternative construct. With those locks we expect to have multiple READERS executing concurrently thus leading to performant code that scales to large numbers of threads. As I said, if WRITERS are just quick updates to the structure, we can naively believe that our code will scale well.

Not So

Let's see how it goes. A WRITER lock cannot get in the protected code while READERS are executing protected code. The WRITER must then wait at the door until READERS releases their hold. If the implementation of RW locks didn't pay attention, there would be cases in which at least one READER is always present within the protected code and WRITERS would get starved of access. To prevent such starvation, RW lock must block READERS as soon as a WRITER has requested access. But no matter, our WRITERS will quickly update the structure and we will get concurrent execution most of the time. Isn't it ?

Well not quite. As just stated, a RW locks will block readers as soon as a WRITER has hit the door. This means that the construct does not allow parallel execution at that point. Moreover the WRITER will stay at the door while READERS are executing. So the construct stays fully serializing from the time a WRITER hits until all current READERS are done followed by the WRITERS time.

For Instance:

	- a RW_READER gets in and will keep a long time. ---|
	- a RW_WRITER hits the lock; is put on hold.        |
	- other RW_READERS now also block.                  |
	.... time passes			            |
	- the long RW_READER releases	   <----------------|
	- the RW_WRITER gets the lock; work; releases
	- other RW_READER now work concurrently.

Pretty obvious once you think about it. So to assess the capacity of a RW lock to allow parallel execution, one must consider the average hold time as a READER but also the frequency of access as a WRITER. The construct becomes efficient and scalable to N threads if and only if:

(avg interval between writers) >> (N \* avg read hold times).


In the end, from a performance point of view, RW locks should be used only when the average hold times is significant in order to justify the use of this more complex type of lock: for instance, calling a function of unknown latency or issuing an I/O while holding the lock represent good candidates. But the construct will be scalable to N threads, if and only if WRITERS are very infrequent.



Thanks for writing this, Roch. The application I work on uses rwlocks for inter-process synchronisation, and what you've said tallies with what we've seen - you've just put it better than I could.

In terms of read scalability - we've found the price of admission relatively high for rwlocks. The results we see show that something that takes as much as 1000 cycles for a "read" operation does not parallelize well if a rwlock is used - in this case, two readers result in less throughput than a single reader on most machines. As you've mentioned, the effects of frequent write are fairly catastrophic too - although ignoring scalability you could justify the use of rwlocks by their queuing semantics (writers preferred over readers, which is very much what we want, for example).

It's also worth mentioning that preemption control (i.e. the schedctl APIs) can help avoid an effective priority inversion when using relatively short userland locks like this.

Going back to the original point, you said "RW locks should be used only when the average hold times is significant in order to justify the use of this more complex type of lock". It's very subjective, but what do you consider to be significant in this case?

Posted by Philip Beevers on décembre 15, 2005 at 12:21 PM MET #

rwlocks are mostly horrid. Their very nature encourages developers to hold locks for extended periods, contrary to received wisdom. The more readers there are, and the longer they run holding the lock, the greater the chance that one (or more) will be preempted before releasing the lock. The net effect is that readers end up using stale data, while fresh data are waiting.

If reads and updates are short, a mutex should suffice. Mutexes imply less in terms of fairness, and are therefore generally much cheaper to implement.

If read processing is long but updates are rare, it is generally better for readers to take a constistent read up front, do their processing without lock protection, and then to restart if anything has changed.

I'm sure there are cases where rwlocks make sense, but I generally work hard to avoid them. Their appearance in Solaris owes more to standards compliance than hearty approval.

Posted by Phil Harman on décembre 15, 2005 at 02:52 PM MET #

The behaviour of blocking readers when a writer is waiting which you describe is still better than the behaviour of a mutex, right? A mutex would block new readers too, and not just when a writer is waiting.

Posted by Luke on décembre 15, 2005 at 03:59 PM MET #

[In response to Luke] : On paper, yes, the rwlock behaviour is better - but the additional overhead of the rwlock means the mutex wins for short lock times. Remember that it's likely that the rwlock is implemented using a mutex to protect it anyway; every time you lock an rwlock, you're locking and unlocking a mutex, and similarly when you unlock, so it's inherently quite a bit more expensive than a mutex.

Posted by Philip Beevers on décembre 15, 2005 at 05:13 PM MET #

[In response to Philip's repsonse to Luke] : Indeed. Plus rwlocks imply a prioritised sleep queue (as least, writers have priority over readers), and this adds to the implementation overhead. But more than that, in our implementation all waiters are also queued in prioritised FIFO order (a POSIX requirement for realtime etc). The only saving grace is that my process shared implementation does a direct handoff to waiters (so they don't have to reacquire a mutex when they are awoken).

Posted by Phil Harman on décembre 16, 2005 at 08:07 AM MET #

"hold times should be significant" that should be in relationship with the time to acquire the locks. This would depend if it's kernel or user land. I have not measured them but I'd put the bar at tens of usec (for kernel). By even below 10usec hold times; if you have frequent access and 32 cpus, a mutex may not be the solution. Ok, writing scalable code is hard. "So a mutex would block readers too" yes but in kernel they would also potentially cause readers to spin on CPU and as pointed our does not order other consumers like RW locks. Obviously each construct will have their benefit and sometimes RW locks is the right semantic. What I wanted to point out is really something about the performance characteristics. Just saying that a RW lock does quick updates (WRITER) does not make the lock scalable. Frequency of Writers combined with Readers hold times contributes.

Posted by Roch Bourbonnais on décembre 16, 2005 at 09:08 AM MET #

Post a Comment:
Comments are closed for this entry.



« juillet 2016

No bookmarks in folder