Wednesday Aug 29, 2007

ZFS I/Os in motion

I'm walking a line - I'm thinking about I/O in motion
I'm walking a line - Just barely enough to be living
Get outta the way - No time to begin
This isn't the time - So nothing was biodone
Not talking about - Not many at all
I'm turning around - No trouble at all
You notice there's nothing around you, around you
I'm walking a line - Divide and dissolve.

[theme song for this post is Houses in Motion by the Talking Heads]

Previously, I mentioned a movie. OK, so perhaps it isn't a movie, but an animated GIF.


This is a time-lapse animation of some of the data shown in my previous blog on ZFS usage of mirrors. Here we're looking at one second intervals and the I/O to the slow disk of a two-disk mirrored ZFS file system. The workload is a recursive copy of the /usr/share directory into this file system.

The yellow areas on the device field are write I/O operations. For each time interval, the new I/O operations are shown with their latency elevators. Shorter elevators mean lower latency. Green elevators mean the latency is 10ms or less, yellow until 25ms, and red beyond 25ms. This provides some insight into the way the slab allocator works for ZFS. If you look closely, you can also see the redundant uberblock updates along the lower-right side near the edge. If you can't see that in the small GIF, click on the GIF for a larger version which is easier to see.

ZFS makes redundant copies of the metadata. By preference, these will be placed in a different slab. You can see this in the animation as there are occasionally writes further out than the bulk of the data writes. As the disk begins to fill, the gaps become filled.  Interestingly, the writes to the next slab (metadata) do not have much latency - they are in the green zone. This is a simple IDE disk, so there is a seek required by these writes. This should help allay one of the fears of ZFS, that the tendency to have data spread out will be a performance problem - I see no clear evidence of that here. 

I have implemented this as a series of Worldwind layers. This isn't really what Worldwind was designed to do, so there are some inefficiencies in the implementation, or it may be that there is still some trick I have yet to learn.  But it is functional in that you can see I/Os in motion.

Looking at ZFS

A few months ago, I blogged about why I wasn't at JavaOne and mentioned that I was looking at some JOGL code. Now I'm ready to show you some cool pictures which provide a view into how ZFS uses disks.

The examples here show a mirrored disk pair. I created a mirrored zpool and use the default ZFS settings. I then did a recursive copy of /usr/share into the ZFS file system. This is a write-mostly workload.

There are several problems with trying to visualize this sort of data:

  1. There is a huge number of data points.  A 500 GByte disk has about a billion blocks.  Mirror that and you are trying to visualize two billion data points. My workstation screen size is only 1.92 million pixels (1600x1200) so there is no way that I could see this much data.
  2. If I look at an ASCII table of this data, then it may be hundreds of pages long.  Just for fun, try looking at the output of zdb -dddddd to get an idea of how the data might look in ASCII, but I'll warn you in advance, try this only on a small zpool located on a non-production system.
  3. One dimensional views of the data are possible.  Actually, this is what zdb will show for you.  There is some reasoning here because a device is accessed as a single set of blocks using an offset and size for read or write operations. But this doesn't scale well, especially to a billion data points.
  4. Two dimensional views are also possible, where we basically make a two dimensional array of the one dimensional data.  This does hide some behaviour, as disks are two dimensional, but they are stacks of circles of different sizes.  These physical details are cleverly hidden and subject to change on a per-case basis.  So, perhaps we can see some info in two dimensions that would help us understand what is happening.
  5. Three dimensional views can show even more data.  This is where JOGL comes in, it is a 3-D libary for JAVA.

It is clear that some sort of 3-D visualization system could help provide some insight into this massive amount of data.  So I did it.

Where is the data going? 

mirrored write iops

This is a view of the two devices in the mirror after they have been filled by the recursive copy. Yellow blocks indicate write operations, green blocks are read operations.  Since this was a copy into the file system, there aren't very many reads. I would presume that your browser window is not of sufficient resolution to show the few, small reads anyway, so you'll just have to trust me.

What you should be able to see, even at a relatively low resolution, is that we're looking at a 2-D representation of each device from a 3-D viewpoint. Zooming, panning, and moving the viewpoint allows me to observe more or less detail.

To gather this data, I used TNF tracing.  I could also write a dtrace script to do the same thing. But I decided to use TNF data because it has been available since Solaris 8 (7-8 years or so) and I have an archive of old TNF traces that I might want to take a look at some day. So what you see here are the I/O operations for each disk during the experiment.

How long did it take?  (Or, what is the latency?)

The TNF data also contains latency information.  The latency is measured as the difference in time between the start of the I/O and its completion. Using the 3rd dimension, I put the latency in the Z-axis.


Ahhh... this view tells me something interesting. The latency is shown as a line emitting from the starting offset of the block being written. You can see some regularity over the space as ZFS will coalesce writes into 128 kByte I/Os. The pattern is more clearly visible on the device on the right.

 But wait! What about all of the red?  I color the latency line green when the latency is less than 10ms, yellow until 25ms, and red for latency > 25ms.  The height of the line is a multiple of its actual latency.  Wow!  The device on the left has a lot of red, it sure looks slow.  And it is.  On the other hand, the device on the right sure looks fast.  And it is. But this view is still hard to see, even when you can fly around and look at it from different angles. So, I added some icons...

I put icons at the top of the line. If I hover the mouse over an icon, it will show a tooltip which contains more information about that data point. In this case, the tooltip says, "Write, block=202688, size=64, flags=3080101, time=87.85"  The size is in blocks, the flags are defined in a header file somewhere, and the time is latency in milliseconds.  So we wrote 32 kBytes at block 202,688 in 87.85 ms.  This is becoming useful!  By cruising around, it becomes apparent that for this slow device, small writes are faster than large writes, which is pretty much what you would expect.

Finding a place in the world

Now for the kicker.  I implemented this as an add-on to NASA's Worldwind.



 I floated my devices at 10,000 m above the ocean off the west coast of San Diego! By leveraging the Worldwind for Java SDK, I was able to implement my visualization by writing approximately 2,000 lines of code. This is a pretty efficient way of extending a GIS tool into non-GIS use, while leveraging the fact that GIS tools are inherently designed to look at billions of data points in 3-D.

More details of the experiment

The two devices are intentionally very different from a performance perspective. The device on the left is an old, slow, relatively small IDE disk. The device on the right is a ramdisk. 

I believe that this technique can lead to a better view of how systems work under the covers, even beyond disk devices.  I've got some cool ideas, but not enough days in the hour to explore them all.  Drop me a line if you've got a cool idea. 

The astute observer will notice another view of the data just to the north of the devices. This is the ZFS space map allocation of one of the mirror vdevs. More on that later... I've got a movie to put together...


Tuesday May 08, 2007

Who's not at JavaONE?

I'm not going to JavaONE this year.  And I'm a little sad about that, but I'll make it one day.  I was part of the very first JavaONE prelude.  "Prelude" is the operative term because at that time Sun employees were actively discouraged from attending.  This was a dain bramaged policy which has since been fixed, but at the time the idea was to fill the event with customers and developers.  Now it is just full of every sort of person.  Y'all have fun up there!

So to celebrate this year's JavaONE, I'm learning JOGL.   Why?  Well, I've got some data that I'd like to visualize and I've not found a reasonable tool for doing it in an open, shareable manner.  Stay tuned...




« February 2017