Wednesday Oct 01, 2008

Hey! Where did my snapshots go? Ahh, new feature...

I've been running Solaris NV b99 for a week or so.  I've also been experimenting with the new automatic snapshot tool, which should arrive in b100 soon. To see what snapshots are taken, you can use zfs's list subcommand.

# zfs list
NAME                     USED    AVAIL    REFER   MOUNTPOINT
rpool                   3.63G    12.0G    57.5K   /rpool
rpool@install             17K        -    57.5K   -


This is typical output and shows that my rpool (root pool) file system has a snapshot which was taken at install time: rpool@install

But in NV b99, suddenly, the snapshots are no longer listed by default. In general, this is a good thing because there may be thousands of snapshots and the long listing is too long for humans to understand. But what if you really do want to see the snapshots?  A new flag has been added to the zfs list subcommand which will show the snapshots.

# zfs list -t snapshot
NAME                     USED    AVAIL    REFER   MOUNTPOINT
rpool@install             17K        -    57.5K   -


This should clear things up a bit and make it easier to manage large numbers of snapshots when using the CLI. If you want to see more details on this change, see the head's up notice for PSARC 2008/469.

Tuesday Sep 02, 2008

Sample RAIDoptimizer output

We often get asked, "what is the best configuration for lots of disks" on the ZFS-discuss forum. There is no one answer to this question because you are really trading-off performance, RAS, and space.  For a handful of disks, the answer is usually easy to figure out in your head.  For a large number of disks, like the 48 disks found on a Sun Fire X4540 server, there are too many permutations to keep straight.  If you review a number of my blogs on this subject, you will see that we can model the various aspects of these design trade-offs and compare.

A few years ago, I wrote a tool called RAIDoptimizer, which will do the math for you for all of the possible permutations. I used the output of this tool to build many of the graphs you see in my blogs.

Today, I'm making available a spreadsheet with a sample run of the permutations of a 48-disk system using reasonable modeling defaults.  In this run, there are 339 possible permutations for ZFS.  The models described in my previous blogs are used to calculate the values.  The default values used are not representative of a specific disk, and merely represent ballpark, default values.  The exact numbers are not as important as the relationships exposed for when you look at different configurations.  Obviously, the tool allows us to change the disk parameters, which are usually available from disk data sheets.  But this will get you into the ballpark, and is a suitable starting point for making some trade-off decisions. 

For your convenience, I turned on the data filters for the columns so that you can easily filter the results. Many people also sort on the various columns.  StarOffice or OpenOffice will let you manipulate the data until the cows come home.  Enjoy.

Wednesday Apr 09, 2008

RAS in the T5140 and T5240

Today, Sun introduced two new CMT servers, the Sun SPARC Enterprise T5140 and T5240 servers.

I'm really excited about this next stage of server development. Not only have we effectively doubled the performance capacity of the system, we did so without significantly decreasing the reliability. When we try to predict reliability of products which are being designed, we make those predictions based on previous generation systems. At Sun, we make these predictions at the component level. Over the years we have collected detailed failure rate data for a large variety of electronic components as used in the environments often found at our customer sites. We use these component failure rates to determine the failure rate of collections of components. For example, a motherboard may have more than 2,000 components: capacitors, resistors, integrated circuits, etc. The key to improving motherboard reliability is, quite simply, to reduce the number of components. There is some practical limit, though, because we could remove many of the capacitors, but that would compromise signal integrity and performance -- not a good trade-off. The big difference in the open source UltraSPARC T2 and UltraSPARC T2plus processors is the high level of integration onto the chip. They really are systems on a chip, which means that we need very few additional components to complete a server design. Fewer components means better reliability, a win-win situation. On average, the T5140 and T5240 only add about 12% more components over the T5120 and T5220 designs. But considering that you get two or four times as many disks, twice as many DIMM slots, and twice the computing power, this is a very reasonable trade-off.

Let's take a look at the system block diagram to see where all of the major components live.

You will notice that the two PCI-e switches are peers and not cascaded. This allows good flexibility and fault isolation. Compared to the cascaded switches in the T5120 and T5220 servers, this is a simpler design. Simple is good for RAS.

You will also notice that we use the same LSI1068E SAS/SATA controller with onboard RAID. The T5140 is limited to 4 disk bays, but the T5240 can accommodate 16 disk bays. This gives plenty of disk targets for implementing a number of different RAID schemes. I recommend at least some redundancy, dual parity if possible.

Some people have commented that the Neptune Ethernet chip, which provides dual-10Gb Ethernet or quad-1Gb Ethernet interfaces is a single point of failure. There is also one quad GbE PHY chip. The reason the Neptune is there to begin with is because when we implemented the coherency links in the UltraSPARC T2plus processor we had to sacrifice the builtin Neptune interface which is available in the UltraSPARC T2 processor. Moore's Law assures us that this is a somewhat temporary condition and soon we'll be able to cram even more transistors onto a chip. This is a case where high integration is apparent in the packaging. Even though all four GbE ports connect to a single package, the electronics inside the package are still isolated. In other words, we don't consider the PHY to be a single point of failure because the failure modes do not cross the isolation boundaries. Of course, if your Ethernet gets struck by lightning, there may be a lot of damage to the server, so there is always the possibility that a single event will create massive damage. But for the more common cabling problems, the system offers suitable isolation. If you are really paranoid about this, then you can purchase a PCI-e card version of the Neptune and put it in PCI-e slot 1, 2, or 3 to ensure that it uses the other PCI-e switch.

The ILOM service processor is the same as we use in most of our other small servers and has been a very reliable part of our systems. It is connected to the rest of the system through a FPGA which manages all of the service bus connections. This allows the service processor to be the serviceability interface for the entire server.

The server also uses ECC FB-DIMMs with Extended ECC, which is another common theme in Sun servers. We have recently been studying the affects of Solaris Fault Management Architecture and Extended ECC on systems in the field and I am happy to report that this combination provides much better system resiliency than possible through the individual features. In RAS, the whole can be much better than the sum of the parts.

For more information on the RAS features of the new T5140 and T5240 servers, see the white paper, Maximizing IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140 and T5240 Servers. The whitepaper has results of our RAS benchmarks as well as some performability calculations.

Tuesday Apr 08, 2008

more on holey files

My colleague Christine asked me some questions about my holey files posts. These are really good questions, and I'm just a little surprised that more people didn't ask them... hey, that is what the comments section is for!  So, I thought I would reply publically, helping to stimulation some conversations.

Q1. How could you have a degraded pool and data corruption w/o a repair?  I assume this pool must be raidz or mirror.

A1. No, this was a simple pool, not protected at the pool level. I used the ZFS copies parameter to set the number of redundant data copies to 2. For more information on how copies works, see my post with pictures.

There is another, hidden question here.  How did I install Indiana such that it uses copies=2? By opening a shell and becoming root prior to beginning the install, I was able to set the copies=2 property just after the storage pool was created. By default, it gets inherited by any subsequent file system creation.  Simple as that.  OK, so it isn't that simple.  I've also experimented with better ways to intercept the zpool create, but am not really happy with my hacks thus far.  A better solution is for the installer to pick up a set of properties, but it doesn't, at least for now.

Q2.  Can a striped pool be in a degraded state?  Wouldn't a device faulting in that pool renders it unusable and therefore faulted?

A2. Yes, a striped storage pool can be in a degraded state. To understand this, you need to know the definitions of DEGRADED and FAULTED.  Fortunately, they are right there in the zpool manual page.



One or more top-level vdevs is in the degraded state because one or more component devices are offline. Sufficient replicas exist to continue functioning.



One or more top-level vdevs is in the faulted state because one or more component devices are offline. Insufficient replicas exist to continue functioning.


By default, there are multiple replicas, so for a striped volume it is possible to be in a DEGRADED state. However, I expect that the more common case will be a FAULTED state. In other words, I do tend to recommend a more redundant storage pool: mirror, raidz, raidz2. 

Q3. What does filling the corrupted part with zero do for me?  It doesn't fix it, those bits weren't zero to begin with.

A3. Filling with zeros will just make sure that the size of the "recovered" file is the same as the original. Some applications get to data in a file via a seek to an offset (random access), so this is how you would want to recover the file.  For applications which process files sequentially, it might not matter.

Thursday Mar 13, 2008

dd tricks for holey files

Bob Netherton took a look at my last post on corrupted file recovery (?) and asked whether I had considered using the noerror option to dd. Yes, I did experiment with dd and the noerror option.

The noerror option is described in dd(1) as:

    noerror Does not stop processing on an input error.
            When an input error occurs, a diagnostic mes-
            sage is written on standard error, followed
            by the current input and output block counts
            in the same format as used at completion. If
            the sync conversion is specified, the missing
            input is replaced with null bytes and pro-
            cessed normally. Otherwise, the input block
            will be omitted from the output.

This looks like the perfect solution, rather than my dd and iseek script. But I didn't post this because, quite simply, I don't really understand what I get out of it.

Recall that I had a corrupted file which is 2.9 MBytes in size. Somewhere around 1.1 MBytes into the file, the data is corrupted and fails the ZFS checksum test.

# zpool scrub zpl_slim
# zpool status -v zpl_slim
  pool: zpl_slim
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the
        entire pool from backup.

scrub: scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42 2008

       zpl_slim    DEGRADED     0     0     9
         c2t0d0s0  DEGRADED     0     0     9

errors: Permanent errors have been detected in the following files:
# ls -ls /mnt/root/lib/amd64/
4667 -rwxr-xr-x 1 root bin 2984368 Oct 31 18:04 /mnt/root/lib/amd64/

I attempted to use dd with the noerror flag using several different block sizes to see what I could come up with. Here are those results:

# for i in 1k 8k 16k 32k 128k 256k 512k
> do
dd of=/tmp/whii.$i bs=$i conv=noerror
> done
read: I/O error
1152+0 records in
1152+0 records out
ls -ls /tmp/whii\*
3584 -rw-r--r-- 1 root root 1835008 Mar 13 11:27 /tmp/whii.128k
2464 -rw-r--r-- 1 root root 1261568 Mar 13 11:27 /tmp/whii.16k
2320 -rw-r--r-- 1 root root 1184768 Mar 13 11:27 /tmp/whii.1k
4608 -rw-r--r-- 1 root root 2359296 Mar 13 11:27 /tmp/whii.256k
2624 -rw-r--r-- 1 root root 1343488 Mar 13 11:27 /tmp/whii.32k
7168 -rw-r--r-- 1 root root 3670016 Mar 13 11:27 /tmp/whii.512k
2384 -rw-r--r-- 1 root root 1220608 Mar 13 11:27 /tmp/whii.8k

hmmm... all of these files are of different sizes, so I'm really unsure what I've ended up with. None of them are the same size as the original file, which is a bit unexpected.

# dd of=/tmp/whaa.1k bs=1k conv=noerror
read: I/O error
1152+0 records in
1152+0 records out
read: I/O error
1153+0 records in
1153+0 records out
read: I/O error
1154+0 records in
1154+0 records out
read: I/O error
1155+0 records in
1155+0 records out
read: I/O error
1156+0 records in
1156+0 records out
read: I/O error
1157+0 records in
1157+0 records out
# ls -ls /tmp/whaa.1k
2320 -rw-r--r-- 1 root root 1184768 Mar 13 11:12 /tmp/whaa.1k

hmmm... well, dd did copy some of the file, but seemed to give up after around 5 attempts and I only seemed to get the first 1.1 MBytes of the file. What is going on here? A quick look at the dd source (open source is a good thing) shows that there is a definition of BADLIMIT which is how many times dd will try before giving up. The default compilation sets BADLIMIT to 5. Aha! A quick download of the dd code and I set BADLIMIT to be really huge and tried again.

# bigbaddd of=/tmp/whbb.1k bs=1k conv=noerror
read: I/O error
1152+0 records in
1152+0 records out
read: I/O error
3458+0 records in
3458+0 records out
\^C I give up
# ls -ls /tmp/whbb.1k
6920 -rw-r--r-- 1 root root 3543040 Mar 13 11:47 /tmp/whbb.1k

As dd processes the input file, it doesn't really do a seek, so it can't really get past the corruption. It is getting something, because od shows that the end of the whbb.1k file is not full of nulls. But I really don't believe this is the data in a form which could be useful. And I really can't explain why the new file is much larger than the original. I suspect that dd gets stuck at the corrupted area and does not seek beyond it. In any case, it appears that letting dd do the dirty work by itself will not acheive the desired results. This is, of course, yet another opportunity...

Wednesday Mar 12, 2008

Holy smokes! A holey file!

I was RASing around with ZFS the other day, and managed to find a file which was corrupted.

# zpool scrub zpl_slim
# zpool status -v zpl_slim
  pool: zpl_slim
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the
        entire pool from backup.
 scrub: scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42 2008
        NAME        STATE     READ WRITE CKSUM
        zpl_slim    DEGRADED     0     0     9
          c2t0d0s0  DEGRADED     0     0     9

errors: Permanent errors have been detected in the following files:

# ls -ls /mnt/root/lib/amd64/
4667 -rwxr-xr-x 1 root bin 2984368 Oct 31 18:04 /mnt/root/lib/amd64/

argv! Of course, this particular file is easily extracted from the original media, it does't contain anything unique. For those who might be concerned that it is the C runtime library, and thus very critical to running Solaris, the machine in use is only 32-bit, so the 64-bit (amd64) version of this file is never used. But suppose this were an important file for me and I wanted to recover something from it? This is a more interesting challenge...

First, let's review a little bit about how ZFS works. By default, when ZFS writes anything, it generates a checksum which is recorded someplace else, presumably safe. Actually, the checksum is recorded at least twice, just to be doubly sure it is correct. And that record is also checksummed. Back to the story, the checksum is computed on a block, not for the whole file. This is an important distinction which will come into play later. If we perform a storage pool scrub, ZFS will find the broken file and report it to you (see above), which is a good thing -- much better than simply ignoring it, like many other file systems will do.

OK, so we know that somewhere in the midst of this 2.8 MByte file, we have some corruption. But can we at least recover the bits that aren't corrupted? The answer is yes. But if you try a copy, then it bails with an error.

# cp /mnt/root/lib/amd64/ /tmp
/mnt/root/lib/amd64/ I/O error

Since the copy was not successful, there is no destination file, not even a partial file. It turns out that cp uses mmap(2) to map the input file and copies it to the output file with a big write(2). Since the write doesn't complete correctly, it complains and removes the output file. What we need is something less clever, dd.

# dd if=/mnt/root/lib/amd64/ of=/tmp/whee
read: I/O error
2304+0 records in
2304+0 records out
# ls -ls /tmp/whee
2304 -rw-r--r-- 1 root root 1179648 Mar 12 18:53 /tmp/whee

OK, from this experiment we know that we can get about 1.2 MBytes by directly copying with dd. But this isn't all, or even half of the file. We can get a little more clever than that. To make it simpler, I wrote a little ksh script:

integer i=0
while ((i < 23))
    typeset -RZ2 j=$i
    dd if=$1 of=$2.$j bs=128k iseek=$i count=1

This script will write each of the first 23 128kByte blocks from the first argument (a file) to a unique filename as a number appended to the second argument. dd is really dumb and doesn't offer much error handling which is why I hardwired the count into the script. An enterprising soul with a little bit of C programming skill could do something more complex which handles the more general case. Ok, that was difficult to understand, and I wrote it. To demonstrate, I first appologize for the redundant verbosity:

# ./getaround.ksh /tmp/zz
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
read: I/O error
0+0 records in
0+0 records out

1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
0+1 records in
0+1 records out
# ls -ls /tmp/zz.\*
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.00
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.01
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.02
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.03
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.04
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.05
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.06
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.07
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.08
   0 -rw-r--r-- 1 root root      0 Mar 12 19:00 /tmp/zz.09
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.10
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.11
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.12
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.13
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.14
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.15
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.16
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.17
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.18
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.19
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.20
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.21
 200 -rw-r--r-- 1 root root 100784 Mar 12 19:00 /tmp/zz.22

So we can clearly see that the 10th (128kByte) block is corrupted, but the rest of the blocks are ok. We can now reassemble the file with a zero-filled block.

# dd if=/dev/zero of=/tmp/zz.09 bs=128k count=1
1+0 records in
1+0 records out
# cat /tmp/zz.\* > /tmp/zz
# ls -ls /tmp/zz
5832 -rw-r--r-- 1 root root 2984368 Mar 12 19:03 /tmp/zz

Now I have recreated the file with a zero-filled hole where the data corruption was. Just for grins, if you try to compare with the previous file, you should get what you expect.

# cmp /tmp/zz+
cmp: EOF on

How is this useful?

Personally, I'm not sure this will be very useful for many corruption cases. As a RAS guy, I advocate many verified copies of important data placed on diverse systems and media. But most folks aren't so inclined. Everytime we talk about this on the zfs-discuss alias, somebody will say that they don't care about corruption in the middle of their mp3 files. I'm no audiophile, but I prefer my mp3s to be hole-less. So I did this little exercise to show how you can regain full access to the non-corrupted bits of a corrupted file in a more-or-less easy way. Consider this a proof of concept. There are many possible variations, such as filling with spaces instead of nulls when you are missing parts of a text file -- opportunities abound.

Tuesday Sep 18, 2007

Space Maps from Space

Jeff Bonwick recently blogged about why ZFS uses space maps for keeping track of allocations. In my recent blog on looking at ZFS I teased you with a comment about the space map floating near the Channel Islands. Now that Jeff has explained how they work, I'll show you what they look like as viewed from space.

 Space map

 This is a view of a space map for a ZFS file system which was created as a recursive copy of the /usr directory followed by a recursive remove of the /usr/share directory. This allows you to see how some space is allocated and some space is free.

I wrote an add-on to NASA's Worldwind to parse zdb output looking for the space map information. Each allocation appears as a green rectangle with a starting offset and length mapped onto a square field floating above the earth. The allocations are green and the frees are yellow.  The frees are also floating 100m above the allocations, though it is not easy to see from this view. Each map entry also has an optional user-facing icon which shows up as a shadowed green or yellow square. I snagged these from the StarOffice bullets images. If you hover the mouse over an icon, then a tool tip will appear showing the information about the space.  In this example, the tooltip says "Free, txg=611, pass=1, offset=53fe000, size=800"

I can think of about a half dozen cool extensions to make for this, such as showing metaslab boundaries.  I also need to trim the shadow field to fit; it extends too far on the right.  So much to do, so little time...


Wednesday Aug 29, 2007

ZFS I/Os in motion

I'm walking a line - I'm thinking about I/O in motion
I'm walking a line - Just barely enough to be living
Get outta the way - No time to begin
This isn't the time - So nothing was biodone
Not talking about - Not many at all
I'm turning around - No trouble at all
You notice there's nothing around you, around you
I'm walking a line - Divide and dissolve.

[theme song for this post is Houses in Motion by the Talking Heads]

Previously, I mentioned a movie. OK, so perhaps it isn't a movie, but an animated GIF.


This is a time-lapse animation of some of the data shown in my previous blog on ZFS usage of mirrors. Here we're looking at one second intervals and the I/O to the slow disk of a two-disk mirrored ZFS file system. The workload is a recursive copy of the /usr/share directory into this file system.

The yellow areas on the device field are write I/O operations. For each time interval, the new I/O operations are shown with their latency elevators. Shorter elevators mean lower latency. Green elevators mean the latency is 10ms or less, yellow until 25ms, and red beyond 25ms. This provides some insight into the way the slab allocator works for ZFS. If you look closely, you can also see the redundant uberblock updates along the lower-right side near the edge. If you can't see that in the small GIF, click on the GIF for a larger version which is easier to see.

ZFS makes redundant copies of the metadata. By preference, these will be placed in a different slab. You can see this in the animation as there are occasionally writes further out than the bulk of the data writes. As the disk begins to fill, the gaps become filled.  Interestingly, the writes to the next slab (metadata) do not have much latency - they are in the green zone. This is a simple IDE disk, so there is a seek required by these writes. This should help allay one of the fears of ZFS, that the tendency to have data spread out will be a performance problem - I see no clear evidence of that here. 

I have implemented this as a series of Worldwind layers. This isn't really what Worldwind was designed to do, so there are some inefficiencies in the implementation, or it may be that there is still some trick I have yet to learn.  But it is functional in that you can see I/Os in motion.

Looking at ZFS

A few months ago, I blogged about why I wasn't at JavaOne and mentioned that I was looking at some JOGL code. Now I'm ready to show you some cool pictures which provide a view into how ZFS uses disks.

The examples here show a mirrored disk pair. I created a mirrored zpool and use the default ZFS settings. I then did a recursive copy of /usr/share into the ZFS file system. This is a write-mostly workload.

There are several problems with trying to visualize this sort of data:

  1. There is a huge number of data points.  A 500 GByte disk has about a billion blocks.  Mirror that and you are trying to visualize two billion data points. My workstation screen size is only 1.92 million pixels (1600x1200) so there is no way that I could see this much data.
  2. If I look at an ASCII table of this data, then it may be hundreds of pages long.  Just for fun, try looking at the output of zdb -dddddd to get an idea of how the data might look in ASCII, but I'll warn you in advance, try this only on a small zpool located on a non-production system.
  3. One dimensional views of the data are possible.  Actually, this is what zdb will show for you.  There is some reasoning here because a device is accessed as a single set of blocks using an offset and size for read or write operations. But this doesn't scale well, especially to a billion data points.
  4. Two dimensional views are also possible, where we basically make a two dimensional array of the one dimensional data.  This does hide some behaviour, as disks are two dimensional, but they are stacks of circles of different sizes.  These physical details are cleverly hidden and subject to change on a per-case basis.  So, perhaps we can see some info in two dimensions that would help us understand what is happening.
  5. Three dimensional views can show even more data.  This is where JOGL comes in, it is a 3-D libary for JAVA.

It is clear that some sort of 3-D visualization system could help provide some insight into this massive amount of data.  So I did it.

Where is the data going? 

mirrored write iops

This is a view of the two devices in the mirror after they have been filled by the recursive copy. Yellow blocks indicate write operations, green blocks are read operations.  Since this was a copy into the file system, there aren't very many reads. I would presume that your browser window is not of sufficient resolution to show the few, small reads anyway, so you'll just have to trust me.

What you should be able to see, even at a relatively low resolution, is that we're looking at a 2-D representation of each device from a 3-D viewpoint. Zooming, panning, and moving the viewpoint allows me to observe more or less detail.

To gather this data, I used TNF tracing.  I could also write a dtrace script to do the same thing. But I decided to use TNF data because it has been available since Solaris 8 (7-8 years or so) and I have an archive of old TNF traces that I might want to take a look at some day. So what you see here are the I/O operations for each disk during the experiment.

How long did it take?  (Or, what is the latency?)

The TNF data also contains latency information.  The latency is measured as the difference in time between the start of the I/O and its completion. Using the 3rd dimension, I put the latency in the Z-axis.


Ahhh... this view tells me something interesting. The latency is shown as a line emitting from the starting offset of the block being written. You can see some regularity over the space as ZFS will coalesce writes into 128 kByte I/Os. The pattern is more clearly visible on the device on the right.

 But wait! What about all of the red?  I color the latency line green when the latency is less than 10ms, yellow until 25ms, and red for latency > 25ms.  The height of the line is a multiple of its actual latency.  Wow!  The device on the left has a lot of red, it sure looks slow.  And it is.  On the other hand, the device on the right sure looks fast.  And it is. But this view is still hard to see, even when you can fly around and look at it from different angles. So, I added some icons...

I put icons at the top of the line. If I hover the mouse over an icon, it will show a tooltip which contains more information about that data point. In this case, the tooltip says, "Write, block=202688, size=64, flags=3080101, time=87.85"  The size is in blocks, the flags are defined in a header file somewhere, and the time is latency in milliseconds.  So we wrote 32 kBytes at block 202,688 in 87.85 ms.  This is becoming useful!  By cruising around, it becomes apparent that for this slow device, small writes are faster than large writes, which is pretty much what you would expect.

Finding a place in the world

Now for the kicker.  I implemented this as an add-on to NASA's Worldwind.



 I floated my devices at 10,000 m above the ocean off the west coast of San Diego! By leveraging the Worldwind for Java SDK, I was able to implement my visualization by writing approximately 2,000 lines of code. This is a pretty efficient way of extending a GIS tool into non-GIS use, while leveraging the fact that GIS tools are inherently designed to look at billions of data points in 3-D.

More details of the experiment

The two devices are intentionally very different from a performance perspective. The device on the left is an old, slow, relatively small IDE disk. The device on the right is a ramdisk. 

I believe that this technique can lead to a better view of how systems work under the covers, even beyond disk devices.  I've got some cool ideas, but not enough days in the hour to explore them all.  Drop me a line if you've got a cool idea. 

The astute observer will notice another view of the data just to the north of the devices. This is the ZFS space map allocation of one of the mirror vdevs. More on that later... I've got a movie to put together...


Monday Jul 30, 2007

Solaris Cluster Express is now available

As you have probably already heard, we have begun to release Solaris Cluster source at the OpenSolaris website. Now we are also releasing a binary version for use with Solaris Express. You can download the bits from the download center.

Share and enjoy!

Monday Jul 16, 2007

San Diego OpenSolaris User's Group meeting this week

Meeting Time: Wednesday, July 18 at 6:00pm
Sun Microsystems
9515 Towne Centre Drive
San Diego, CA 92121
Building SAN10 - 2nd floor - Gas Lamp Conference Room
Map to Sun San Diego

On July 18, Ryan Scott will be presenting "Building and Deploying OpenSolaris." Ryan will demonstrate how to download the OpenSolaris source code, how to make source code changes, and how to build and install these changes.

Ryan is a kernel engineer in the Solaris Core Technology Group, working on implementing Solaris on the Xen Hypervisor. In previous work at Sun, he worked on Predictive Self Healing, for which he received a Chairman's Award for Innovation. He has also worked on error recovery and SPARC platform sustaining. Ryan joined Sun in 2001 after receiving a BSE in Computer Engineering from Purdue University.

Information about past meetings is here.

Monday Jun 25, 2007

Brain surgery: /usr/ccs tumor removed

Sometimes it just takes way too long to remove brain damage. Way back when Solaris 2.0 was forming, someone had the bright idea to move the  C compilation system from /usr to /usr/ccs. I suppose the idea was that since you no longer had to compile the kernel, the C compiler no longer needed to be in the default user environment.  I think the same gremlin also removed /usr/games entirely, another long-time staple.  This move also coincided with the "planetization of Sun" idea, so the compilers were split off to become their own profit and loss center.  IMHO, this is the single biggest reason gcc ever got any traction beyond VAXen.  But I digress...

No matter what the real reasons were, and who was responsible (this was long before I started working at Sun), I am pleased to see that /usr/ccs is being removed. I've long been an advocate of the thought that useful applications should be in the default user environment. We should never expose our company organization structure in products, especially since we're apt to reorganize far more often than products change. IMHO, the /usr/ccs fiasco was exposing our customers to pain because of our organizational structure.  Brain damage. Cancer. A bad thing.

I performed a study of field installed software a few years ago. It seems that Sun makes all sorts of software which nobody knows about because it is not installed by default, or is not installed into the default user environment. I'm very happy to see all of the positive activity in the OpenSolaris community to rectify this situation and make Solaris a better out of the box experience.  We still have more work to do, but removing the cancerous brain damage that was /usr/ccs is a very good sign that we are moving in the right direction.

Friday May 04, 2007

ZFS, copies, and data protection

OpenSolaris build 61 (or later) is now available for download. ZFS has added a new feature that will improve data protection: redundant copies for data (aka ditto blocks for data). Previously, ZFS stored redundant copies of metadata. Now this feature is available for data, too.

This represents a new feature which is unique to ZFS: you can set the data protection policy on a per-file system basis, beyond that offered by the underlying device or volume. For single-device systems, like my laptop with its single disk drive, this is very powerful. I can have a different data protection policy for the files that I really care about (my personal files) than the files that I really don't care about or that can be easily reloaded from the OS installation DVD. For systems with multiple disks assembled in a RAID configuration, the data protection is not quite so obvious. Let's explore this feature, look under the hood, and then analyze some possible configurations.

Using Copies

To change the numbers of data copies, set the copies property. For example, suppose I have a zpool named "zwimming." The default number of data copies is 1. But you can change that to 2 quite easily.

# zfs set copies=2 zwimming

The copies property works for all new writes, so I recommend that you set that policy when you create the file system or immediately after you create a zpool.

You can verify the copies setting by looking at the properties.

# zfs get copies zwimming
zwimming  copies    2         local

ZFS will account for the space used. For example, suppose I create three new file systems and copy some data to them. You can then see that the space used reflects the number of copies. If you use quotas, then the copies will be charged against the quotas, too.

# zfs create -o copies=1 zwimming/single
# zfs create -o copies=2 zwimming/dual
# zfs create -o copies=3 zwimming/triple
# cp -rp /usr/share/man1 /zwimming/single
# cp -rp /usr/share/man1 /zwimming/dual
# cp -rp /usr/share/man1 /zwimming/triple
# zfs list -r zwimming                                                       
zwimming 48.2M 310M 33.5K /zwimming
zwimming/dual 16.0M 310M 16.0M /zwimming/dual
zwimming/single 8.09M 310M 8.09M /zwimming/single
zwimming/triple 23.8M 310M 23.8M /zwimming/triple

This makes sense. Each file system has one, two, or three copies of the data and will use correspondingly one, two, or three times as much space to store the data.

Under the Covers

ZFS will spread the ditto blocks across the vdev or vdevs to provide spatial diversity. Bill Moore has previously blogged about this, or you can see it in the code for yourself. From a RAS perspective, this is a good thing. We want to reduce the possibility that a single failure, such as a drive head impact with media, could disturb both copies of our data. If we have multiple disks, ZFS will try to spread the copies across multiple disks. This is different than mirroring, in subtle ways. The actual placement is ultimately based upon available space. Let's look at some simplified examples. First, for the default file system configuration settings on a single disk.

Default, simple config

Note that there are two copies of the metadata, by default. If we have two or more copies of the data, the number of metadata copies is three.

ZFS, 2 copies 

Suppose you have a 2-disk stripe. In that case, ZFS will try to spread the copies across the disks.

ZFS, 2 copies, 2 disks

Since the copies are created above the zpool, a mirrored zpool will faithfully mirror the copies.


ZFS, copies=2, mirrored

Since the copies policy is set at the file system level, not the zpool level, a single zpool may contain multiple file systems, each with different policies. In other words, you could have data which is not copied allocated along with data that is copied.


ZFS, mixed copies

Using different policies for different file systems allows you to have different data protection policies, allows you to improve data protection, and offers many more permutations of configurations for you to weigh in your designs.

RAS Modeling

It is obvious that increasing the number of data copies will effectively reduce the amount of available space accordingly. But how will this affect reliability? To answer that question we use the MTTDL[2] model I previously described, with the following changes:

First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted). The number of copies decreases this probability. This makes sense as we could use another copy of the data for reconstruction and to completely fail, we'd need to lose all copies:
Precon_fail = ((N-1) \* size / UER)copies
For single-disk failure protection:
MTTDL[2] = MTBF / (N \* Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N \* (N-1) \* MTTR \* Precon_fail)

Note that as the number of copies increases, Precon_fail approaches zero quickly. This will increase the MTTDL. We want higher MTTDL, so this is a good thing.

OK, now that we can calculate available space and MTTDL, let's look at some configurations for 46 disks available on a Sun Fire X4500 (aka Thumper). We'll look at single parity schemes, to reduce the clutter, but double parity schemes will show the same, relative improvements.

ZFS, X4500 single parity schemes with copies

bigger view 

You can see that we are trading off space for MTTDL. You can also see that for raidz zpools, having more disks in the sets reduces the MTTDL. It gets more interesting to see that the 2-way mirror with copies=2 is very similar in space and MTTDL to the 5-disk raidz with copies=3. Hmm. Also, the 2-way mirror with copies=1 is similar in MTTDL to the 7-disk raidz with copies=2, though the mirror configurations allow more space. This information may be useful as you make trade-offs. Since the copies parameter is set per file system, you can still set the data protection policy for important data separately from unimportant data. This might be a good idea for some situations where you might have permanent originals (eg. CDs, DVDs) and want to apply a different data protection policy.

In the future, once we have a better feel for the real performance considerations, we'll be able to add a performance component into the analysis.

Single Device Revisited

Now that we see how data protection is improved, let's revisit the single device case. I use the term device here because there is a significant change occurring in storage as we replace disk drives with solid state, non-volatile memory devices (eg. flash disks and future MRAM or PRAM devices). A large number of enterprise customers demand dual disk drives for mirroring root file systems in servers. However, there is also a growing demand for solid state boot devices, and we have some Sun servers with this option. Some believe that by 2009, the majority of laptops will also have solid state devices instead of disk drives. In the interim, there are also hybrid disk drives.

What affect will these devices have on data retention? We know that if the entire device completely fails, then the data is most likely unrecoverable. In real life, these devices can suffer many failures which result in data loss, but which are not complete device failures. For disks, we see the most common failure is an unrecoverable read where data is lost from one or more sector (bar 1 in the graph below). For flash memories, there is an endurance issue where repeated writes to a cell may reduce the probability of reading the data correctly. If you only have one copy of the data, then the data is lost, never to be read correctly again.

We captured disk error codes returned from a number of disk drives in the field. The Pareto chart below shows the relationship between the error codes. Bar 1 is the unrecoverable read which accounts for about 24% of the errors recorded. The violet bars show recoverable errors which did succeed. Examples of successfully recovered errors are: write error - recovered with block reallocation, read error - recovered by ECC using normal retries, etc. The recovered errors do not (immediately) indicate a data loss event, so they are largely transparent to applications. We worry more about the unrecoverable errors.


Disk error Pareto chart

Approximately 1/3 of the errors were unrecoverable. If such an error occurs in ZFS metadata, then ZFS will try to read alternate metadata copy and repair the metadata. If the data has multiple copies, then it is likely that we will not lose any data. This is a more detailed view of the storage device because we are not treating all failures as a full device failure.

Both real and anecdotal evidence suggests that unrecoverable errors can occur while the device is still largely operational. ZFS has the ability to survive such errors without data loss. Very cool. Murphy's Law will ultimately catch up with you, though. In the case where ZFS cannot recover the data, ZFS will tell you which file is corrupted. You can then decide whether or not you should recover it from backups or source media.

Another Single Device

Now that I've got you to think of the single device as a single device, I'd like to extend the thought to RAID arrays. There is much confusion amongst people about whether ZFS should or should not be used with RAID arrays. If you search, you'll find comments and recommendations both for and against using hardware RAID for ZFS. The main argument is centered around the ability of ZFS to correct errors. If you have a single device backed by a RAID array with some sort of data protection, then previous versions of ZFS could not recover data which was lost. Hold it right there, fella! Do I mean that RAID arrays and the channel from the array to main memory can have errors? Yes, of course! We have seen cases where errors were introduced somewhere along the path between disk media to main memory where data was lost or corrupted. Prior to ZFS, these were silent errors and blissfully ignored. With ZFS, the checksum now detects these errors and tries to recover. If you don't believe me, then watch the ZFS forum on where we get reports like this about once a month or so. With ZFS copies, you can now recover from such errors without changing the RAID array configuration.

If ZFS can correct a data error, it will attempt to do so. You now have a the option to improve your data protection even when using a single RAID LUN. And this is the same mechanism we can use for a single disk or flash drive: data copies. You can implement the copies on a per-file system basis and thus have different data protection policies even though the data is physically stored on a RAID LUN in a hardware RAID array. I really hope we can put to rest the "ZFS prefers JBOD" argument and just concentrate our efforts on implementing the best data protection policies for the requirements.

ZFS with data copies is another tool in your toolbelt to improve your life, and the life of your data.

Monday Apr 23, 2007

Mainframe inspired RAS features in new SPARC Enterprise Servers

My colleague, Gary Combs, put together a podcast describing the new RAS features found in the Sun SPARC Enterprise Servers. The M4000, M5000, M8000, and M9000 servers have very advanced RAS features, which put them head and shoulders above the competition. Here is my list of favorites, in no particular order:

  1. Memory mirroring. This is like RAID-1 for main memory. As I've said many times, there are 4 types of components which tend to break most often: disks, DIMMs (memory), fans, and power supplies. Memory mirroring brings the fully redundant reliability techniques often used for disks, fans, and power supplies to DIMMs.
  2. Extended ECC for main memory.  Full chip failures on a DIMM can be tolerated.
  3. Instruction retry. The processor can detect faulty operation and retry instructions. This feature has been available on mainframes, and is now available for the general purpose computing markets.
  4. Improved data path protection. Many improvements here, along the entire data path.  ECC protection is provided for all of the on-processor memory.
  5. Reduced part count from the older generation Sun Fire E25K.  Better integration allows us to do more with fewer parts while simultaneously improving the error detection and correction capabilities of the subsystems.
  6. Open-source Solaris Fault Management Architecture (FMA) integration. This allows systems administrators to see what faults the system has detected and the system will automatically heal itself.
  7. Enhanced dynamic reconfiguration.  Dynamic reconfiguration can be done at the processor, DIMM (bank), and PCI-E (pairs) level of grainularity.
  8. Solaris Cluster support.  Of course Solaris Cluster is supported including clustering between Solaris containers, dynamic system domains, or chassis.
  9. Comprehensive service processor. The service processor monitors the health of the system and controls system operation and reconfiguration. This is the most advanced service processor we've developed. Another welcome feature is the ability to delegate responsibilities to different system administrators with restrictions so that they cannot control the entire chassis.  This will be greatly appreciated in large organizations where multiple groups need computing resources.
  10. Dual power grid. You can connect the power supplies to two different power grids. Many people do not have the luxury of access to two different power grids, but those who have been bitten by a grid outage will really appreciate this feature.  Think of this as RAID-1 for your power source.

I don't think you'll see anything revolutionary in my favorites list. This is due to the continuous improvements in the RAS technologies.  The older Sun Fire servers were already very reliable, and it is hard to create a revolutionary change for mature technologies.  We have goals to make every generation better, and we've made many advances with this new generation.  If the RAS guys do their job right, you won't notice it - things will just keep working.

Wednesday Apr 18, 2007

Is eWeek living on internet time?

eWeek has published a nice article describing Sun's new, low-cost RAID array: the Sun StorageTek ST2500 Low Cost Array.  This is an interesting new product that has broad appeal and will be a heck of a good box to run under ZFS.

But I'm worried about eWeek.  It seems that they've lost track of time.  Many of us have been running on internet time for most of our lives.  This quote makes me wonder if eWeek forgot to update their timezone data:

"The ZFS [the speedy Zeta[sic] file system, recently released to the open-source community by Sun] is very interesting, and people are looking at it," [Henry] Baltazar told eWeek.

I'm pretty sure Henry Baltazar is running on internet time, and provided a very nice quote.  But whoever added the editorial clarification at eWeek spelled Zettabyte wrong and is running on island timeZFS was released to the open-source community on June 14, 2005 - nearly 2 years ago in real time.  Even in real time, 2 years can hardly be considered "recent."  Sigh.





« July 2016