ZFS from a RAS point of view: context of data
By relling on Nov 16, 2005
I've been working on RAS analysis of ZFS for a while and I haven't been this excited about a new product launch for a very long time. I'll be blogging about it more over the next few months as it is a very deep, interesting, and detailed analysis.
Let's begin with a look at some of the history of data storage in computers. Long, long ago, persistent data storage was very costly (price/bit) and slow. This fact of life lead file system designers to be efficient with space while also trying to optimize storage placement on the disk for performance. Take a look at the venerable UFS for example. When UFS was designed, disks were quite simple and mostly dumb devices. If you examine the newfs(1m) man page, you'll see all sorts of options for setting block sizes (which are fixed), cylinder groups, rotational delay, rotational speed, number of tracks per cylinder, et.al. UFS wanted to know the details of the hardware so that it could optimize its utilization. Of course, over time the hardware changed rather dramatically: SCSI (which hides disk geometry from the host) became ubiquitous, new disk interfaces were developed, processors became much faster than the rotational delay, new storage devices were invented, and so on. A problem with this design was that, in practice, the hardware dictated the data storage design. As the hardware changed dramatically in the years since its invention, the affects of the early design philosophy became more apparent and required many modifications. Change is generally a good thing, and UFS's ability to survive more than 25 years is a testament to its good design at inception.
From a RAS perspective, having the hardware drive the file system design leads to some limitations. The most glaring is that UFS trusts the underlying hardware to deliver correct data. In hindsight, this is rather risky as the hardware was often unreliable with little error detection or correction in the data path from memory to media. But given that CPUs were so slow and expensive, using CPU cycles to provide enhanced error detection wasn't feasible. In other words, a data corruption problem caused by hardware is not handled by the file system very well. If you have ever had to manually fsck(1m), you'd know what I mean.
This approach had another affect on system design. The computer industry has spent enormous effort designing some very cool and complex devices to look like disk drives. Think about it. RAID arrays are perhaps the most glorious examples of this taken to the extreme, you'll see all sorts of disk virtualization, replication, backup, and other tricks taking place behind a thin veneer disguised as a disk. The problem with this is that by emulating a rather simple device, any context of the data is lost. From a data context perspective, a RAID array is basically dumbed down to the level of a simple rotating platter with a moving head. While many people seem to be happy with this state of affairs, it is really very limiting. For example, a RAID-5 volume can achieve its best write performance when full stripe writes occur. But the RAID-5 volume is really using disks to emulate a disk. If you have a 4+1 RAID volume then the minimum stripe size for best performance would be a multiple of 2 kBytes (N \* 4 \* 512 bytes), more likely it will be much larger. Regardless of the optimal stripe width, UFS doesn't know anything about it, and neither do applications. So UFS goes happily on its way dutifully placing application data in the file system. To their credit, performance experts have spent many hours manually trying to optimally match application, file system, and storage data alignment to reach peak performance. I'd rather be surfing...
Suppose we could do it all over again, and this chance may not arise for another 25 years. Rather than having a hardware design dictate the file system design, let's make the file system design fit our data requirements. Further, let's not trust the hardware to provide correct data. Let's call it ZFS. Suddenly we are liberated! When an application writes data, ZFS knows how big the request is, and can allocate the appropriate block size. CPU cycles are now very inexpensive (more so now that cores are almost free) so we can use the CPU to detect data corruption anywhere in the data path. We don't have to rely on a parity protected SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure that what is on persistent storage is what we get in memory. For that matter, we really don't care if the storage is a disk at all, we'll just treat it as a random access list of blocks. In any case, by distrusting everything in the storage data path we will build in the reliability and redundancy into the file system. We'd really like applications to do this, too, but that is like boiling the ocean, and I digress. The key here is that ZFS knows the data, knows when it is bad, and knows how to add redundancy to make it reliable. This knowledge of the context of the file system data is very powerful. I'll be exploring the meaning of this in detail in later blogs. For now, my advice is to get on the OpenSolaris bandwagon and try it out, you will be pleasantly surprised.