ZFS is designed as a highly scalable storage pool kernel module.
Behind that simple idea are a lot of subsystems, internal to ZFS, which are cleverly designed to deliver high performance for the most demanding environments. But as computer systems grow in size and as demand for performance follows that growth, we are bound to hit scalability limits (at some point) that we had not anticipated at first.
ZFS easily scales in capacity by aggregating 100s of hard disks into a single administration domain. From that single pool, 100s or even 1000s of filesystems can be trivially created for a variety of purposes. But then people got crazy (rightly so) and we started to see performance tests running on a single filesystem
. That scenario raised an interesting scalability limit for us...something had to be done.
Filesystems are kernel objects that get mounted once at some point (often at boot). Then, they are used over and over again, millions even billions of times. To simplify each read/write system call uses the filesystem object for a few milliseconds. And then, days or weeks later, a system administrator wants this filesystem unmounted and that's that. Filesystem modules, ZFS or other, need to manage this dance in which the kernel object representing a mount point is in-use for the duration of a system call and so must prevent that object from disappearing. Only when there are no more system calls using a mountpoint, can a request to unmount be processed. This is implemented simply using a basic reader/writer lock, rwlock(3C)
: A read or write system call acquires a read
lock on the filesystem object and holds it for the duration of the call, while a umount(2) acquires a write
lock on the object.
For many years, individual filesystems from a ZFS pool were protected by a standard Solaris rwlock. And while this could handle 100s of thousands or read/write calls per second through a single filesystem eventually people wanted more.
Rather than depart from the basic kernel rwlock, the Solaris team decided to tackle the scalability of the rwlock code itself. By taking advantage of visibility into a system's architecture, Solaris is able to use multiple counters in a way that scales with the system's size. A small system can use a simple counter to track readers while a large system can use multiple counters each stored on separate cache lines for better scaling. As a bonus they were able to deliver this feature without changing the rwlock function signature. For ZFS code, just simple rwlock initialization change was needed to open up the benefit of this scalable rwlock.
We also found that, in addition to protecting the filesystem object itself, another structure called a ZAP object used to manage directories was also hitting the rwlock scalability limit and that was changed too.
Since the new locks have been put into action, they have delivered scalable performance into single filesystems that is absolutely superb. While the French explorer Jean-Louis Etienne
claims that "On se repousse pas ses limites, on les decouvre:" From the comfort of my air-conditioned office, I conclude that we are
pushing the limits out of harm's way.