Looking at ZFS
By relling on Aug 29, 2007
A few months ago, I blogged about why I wasn't at JavaOne and mentioned that I was looking at some JOGL code. Now I'm ready to show you some cool pictures which provide a view into how ZFS uses disks.
The examples here show a mirrored disk pair. I created a mirrored zpool and use the default ZFS settings. I then did a recursive copy of /usr/share into the ZFS file system. This is a write-mostly workload.
There are several problems with trying to visualize this sort of data:
- There is a huge number of data points. A 500 GByte disk has about a billion blocks. Mirror that and you are trying to visualize two billion data points. My workstation screen size is only 1.92 million pixels (1600x1200) so there is no way that I could see this much data.
- If I look at an ASCII table of this data, then it may be hundreds of pages long. Just for fun, try looking at the output of zdb -dddddd to get an idea of how the data might look in ASCII, but I'll warn you in advance, try this only on a small zpool located on a non-production system.
- One dimensional views of the data are possible. Actually, this is what zdb will show for you. There is some reasoning here because a device is accessed as a single set of blocks using an offset and size for read or write operations. But this doesn't scale well, especially to a billion data points.
- Two dimensional views are also possible, where we basically make a two dimensional array of the one dimensional data. This does hide some behaviour, as disks are two dimensional, but they are stacks of circles of different sizes. These physical details are cleverly hidden and subject to change on a per-case basis. So, perhaps we can see some info in two dimensions that would help us understand what is happening.
- Three dimensional views can show even more data. This is where JOGL comes in, it is a 3-D libary for JAVA.
It is clear that some sort of 3-D visualization system could help provide some insight into this massive amount of data. So I did it.
Where is the data going?
This is a view of the two devices in the mirror after they have been filled by the recursive copy. Yellow blocks indicate write operations, green blocks are read operations. Since this was a copy into the file system, there aren't very many reads. I would presume that your browser window is not of sufficient resolution to show the few, small reads anyway, so you'll just have to trust me.
What you should be able to see, even at a relatively low resolution, is that we're looking at a 2-D representation of each device from a 3-D viewpoint. Zooming, panning, and moving the viewpoint allows me to observe more or less detail.
To gather this data, I used TNF tracing. I could also write a dtrace script to do the same thing. But I decided to use TNF data because it has been available since Solaris 8 (7-8 years or so) and I have an archive of old TNF traces that I might want to take a look at some day. So what you see here are the I/O operations for each disk during the experiment.
How long did it take? (Or, what is the latency?)
The TNF data also contains latency information. The latency is measured as the difference in time between the start of the I/O and its completion. Using the 3rd dimension, I put the latency in the Z-axis.
Ahhh... this view tells me something interesting. The latency is shown as a line emitting from the starting offset of the block being written. You can see some regularity over the space as ZFS will coalesce writes into 128 kByte I/Os. The pattern is more clearly visible on the device on the right.
But wait! What about all of the red? I color the latency line green when the latency is less than 10ms, yellow until 25ms, and red for latency > 25ms. The height of the line is a multiple of its actual latency. Wow! The device on the left has a lot of red, it sure looks slow. And it is. On the other hand, the device on the right sure looks fast. And it is. But this view is still hard to see, even when you can fly around and look at it from different angles. So, I added some icons...
I put icons at the top of the line. If I hover the mouse over an icon, it will show a tooltip which contains more information about that data point. In this case, the tooltip says, "Write, block=202688, size=64, flags=3080101, time=87.85" The size is in blocks, the flags are defined in a header file somewhere, and the time is latency in milliseconds. So we wrote 32 kBytes at block 202,688 in 87.85 ms. This is becoming useful! By cruising around, it becomes apparent that for this slow device, small writes are faster than large writes, which is pretty much what you would expect.
Finding a place in the world
Now for the kicker. I implemented this as an add-on to NASA's Worldwind.
I floated my devices at 10,000 m above the ocean off the west coast of San Diego! By leveraging the Worldwind for Java SDK, I was able to implement my visualization by writing approximately 2,000 lines of code. This is a pretty efficient way of extending a GIS tool into non-GIS use, while leveraging the fact that GIS tools are inherently designed to look at billions of data points in 3-D.
More details of the experiment
The two devices are intentionally very different from a performance perspective. The device on the left is an old, slow, relatively small IDE disk. The device on the right is a ramdisk.
I believe that this technique can lead to a better view of how systems work under the covers, even beyond disk devices. I've got some cool ideas, but not enough days in the hour to explore them all. Drop me a line if you've got a cool idea.
The astute observer will notice another view of the data just to the north of the devices. This is the ZFS space map allocation of one of the mirror vdevs. More on that later... I've got a movie to put together...