Wednesday Nov 05, 2008

Roch star and Really Cool Stuff

Our very own Roch star is in a new video released today as part of the MySQL releases today. In the video "A Look Inside Sun's MySQL Optimization Lab" Roch gives a little bit of a tour and at around 3:00, you get a glimpse of some Really Cool Stuff which might be of interest to ZFS folks.

Wednesday Oct 01, 2008

Hey! Where did my snapshots go? Ahh, new feature...

I've been running Solaris NV b99 for a week or so.  I've also been experimenting with the new automatic snapshot tool, which should arrive in b100 soon. To see what snapshots are taken, you can use zfs's list subcommand.

# zfs list
NAME                     USED    AVAIL    REFER   MOUNTPOINT
rpool                   3.63G    12.0G    57.5K   /rpool
rpool@install             17K        -    57.5K   -


This is typical output and shows that my rpool (root pool) file system has a snapshot which was taken at install time: rpool@install

But in NV b99, suddenly, the snapshots are no longer listed by default. In general, this is a good thing because there may be thousands of snapshots and the long listing is too long for humans to understand. But what if you really do want to see the snapshots?  A new flag has been added to the zfs list subcommand which will show the snapshots.

# zfs list -t snapshot
NAME                     USED    AVAIL    REFER   MOUNTPOINT
rpool@install             17K        -    57.5K   -


This should clear things up a bit and make it easier to manage large numbers of snapshots when using the CLI. If you want to see more details on this change, see the head's up notice for PSARC 2008/469.

Tuesday Sep 02, 2008

Sample RAIDoptimizer output

We often get asked, "what is the best configuration for lots of disks" on the ZFS-discuss forum. There is no one answer to this question because you are really trading-off performance, RAS, and space.  For a handful of disks, the answer is usually easy to figure out in your head.  For a large number of disks, like the 48 disks found on a Sun Fire X4540 server, there are too many permutations to keep straight.  If you review a number of my blogs on this subject, you will see that we can model the various aspects of these design trade-offs and compare.

A few years ago, I wrote a tool called RAIDoptimizer, which will do the math for you for all of the possible permutations. I used the output of this tool to build many of the graphs you see in my blogs.

Today, I'm making available a spreadsheet with a sample run of the permutations of a 48-disk system using reasonable modeling defaults.  In this run, there are 339 possible permutations for ZFS.  The models described in my previous blogs are used to calculate the values.  The default values used are not representative of a specific disk, and merely represent ballpark, default values.  The exact numbers are not as important as the relationships exposed for when you look at different configurations.  Obviously, the tool allows us to change the disk parameters, which are usually available from disk data sheets.  But this will get you into the ballpark, and is a suitable starting point for making some trade-off decisions. 

For your convenience, I turned on the data filters for the columns so that you can easily filter the results. Many people also sort on the various columns.  StarOffice or OpenOffice will let you manipulate the data until the cows come home.  Enjoy.

Tuesday Aug 05, 2008

More Enterprise-class SSDs Coming Soon

Sun has been talking more and more about enterprise-class solid-state disks (SSDs) lately. Even Jonathan blogged about it. Now we are starting to see some interesting articles hitting the press as various companies prepare to release interesting products for this market.

Today, CNET posted an interesting article by Brooke Crothers that offers some insight into how the consumer and enterprise class devices are diverging in their designs.  My favorite quote is, "One of the things that SSD manufacturers have been slow to learn (is that) you can't just take a compact flash controller, throw some NAND on there and call it an SSD," said Dean Klein, vice president of memory system development at Micron. Yes, absolutely correct.  Though Sun makes several products which offer compact flash (CF) for storage, the future of enterprise class SSDs is not re-badged CFs.  There are many more clever tricks that can be used to provide highly reliable, fast, and reasonably priced SSDs.

Monday Jul 14, 2008

ZFS Workshop at LISA'08

We have organized a ZFS Workshop for the USENIX Large Installation Systems Administration (LISA'08) conference in San Diego this November. I hope you can attend.

The call for papers describes workshops as:
One-day workshops are hands-on, participatory, interactive sessions where small groups of system administrators have an opportunity to discuss a topic of common interest. Workshops are not intended as tutorials, and participants normally have significant experience in the appropriate area, enabling discussions at a peer level. However, attendees with less experience often find workshops useful and are encouraged to discuss attendance with the workshop organizer.

There is an opportunity to seed the discussions, so be sure to let me know if there is an interesting topic to be explored. 

The LISA conference is always one of the more interesting conferences for people who must deal with large sites as their day job. Many of the more difficult scalability problems are discussed in the sessions and hallways. If you are directly involved with the design or management of a large computer site, then it is an excellent conference to attend.

My first LISA was LISA-VI in 1992 where I presented a paper that Matt Long and I wrote, User-setup: A System for Custom Configuration of User Environments, or Helping Users Help Themselves, now hanging out on SourceForge. The original source was published on usenet -- which is how we did such things at the time. I suppose I could search around and find it archived somewhere...

Much has changed from the environments we had in 1992, but the problem of managing complex application environments continues to live on. I think that the more modern approaches to this problem, as clearly demonstrated by connected devices like the iPhone, is to leverage the internet and the browser-like interfaces to hide much of the complexity behind the scenes.  In a sense, this is the approach ZFS takes to managing disks -- hide some of the mundane trivia and provide a view of storage that is more intuitive to the users of storage. The more things change, the more the problems stay the same.

Please attend LISA'08 and join the ZFS workshop.

Thursday Jun 26, 2008

Awesome disk AFR! Or, is it...

I was hanging out in the ZFS chat room when someone said they were using a new Seagate Barracuda 7200.11 SATA 3Gb/s 1-TB Hard Drive. A quick glance at the technical specs revealed a reliability claim of 0.34% Annualized Failure Rate (AFR).  Holy smokes!  This is well beyond what we typically expect from disks.  Doubling the reliability would really make my day. My feet started doing a happy dance.

So I downloaded the product manual to get all of the gritty details. It looks alot like most of the other large, 3.5" SATA drive specs out there, so far so good. I get to the Reliability Section (section 2.11, page 18) to look for more nuggets.

Immediately, the following raised red flags with me and my happy feet stubbed a toe.

The product shall achieve an Annualized Failure Rate (AFR) of 0.34% (MTBF of 0.7 million hours) when operated in an environment of ambient air temperatures of 25°C. Operation at temperatures outside the specifications in Section 2.8 may increase the product AFR (decrease MTBF). AFR and MTBF are population statistics that are not relevant to individual units.

AFR and MTBF specifications are based on the following assumptions for desktop personal computer environments:
• 2400 power-on-hours per year.

Argv! OK, here's what happened. When we design enterprise systems, we use AFR with a 24x7x365 hour year (8760 operation hours/year). A 0.34% AFR using a 8760 hour year is equivalent to an MTBF of 2.5 million hours (really good for a disk). But the disk is spec'ed at 0.7 million hours, which, in my mind is an AFR of 1.25%, or about half as reliable as enterprise disks. The way they get to the notion that an AFR of 0.34% equates to an MTBF of 0.7 million hours is by changing the definition of operation to 2,400 hours per year (300 8-hour days). The math looks like this:

    24x7x365 operation = 8760 hours/year (also called power-on-hours, POH)

    AFR = 100% \* (POH / MTBF)

For an MTBF of 700,000 hours,

    AFR = 100% \* (8760 / 700,000) = 1.25%

or, as Seagate specifies for this disk:

    AFR = 100% \* (2400 / 700,000) = 0.34%

The RAS community has better luck explaining failure rates using AFR rather than MTBF. With AFR you can expect the failures to be a percentage of the population per year. The math is simple and intuitive.  MTBF is not very intuitive and causes all sorts of misconceptions. The lesson here is that AFR can mean different things to different people and can be part of the marketing games people play. For a desktop environment, a large population might see 0.34% AFR with this product (and be happy).  You just need to know the details when you try to compare with the enterprise environments.

Unrecoverable Error on Read (UER) rate is 1e-14 errors/bits read, which is a bit of a disappointment, but consistent with consumer disks.  Enterprise disks usually claim 1e-15 errors/bits read, by comparison. This worries me as the disks are getting bigger because of what it implies.  The product manual says that there is guaranteed to be at least 1,953,525,168 512 byte sectors available.

    Total bits = 1,953,525,168 sectors/disk \* 512 bytes/sector \* 8 bits/byte= 8e12 bits/disk

If the UER is 1e-14 errors/bits read then you can expect an unrecoverable read once every 12.5 times you read the entire disk. Not a very pleasant thought, even if you are using a file system which can detect such errors, like ZFS.  Fortunately, field failure data tends to see a better UER than the manufacturers claim.  If you are worried about this sort of thing, I'll recommend using ZFS.

All-in-all, this looks like a nice disk for desktop use. But you should know that in enterprise environments we expect much better reliability specifications.

Friday Jun 06, 2008

on /var/mail and quotas

About every other month or so, someone comes onto the ZFS forum, complains about quotas, and holds up the shared /var/mail directory as an example of where UFS quotas are superior to ZFS quotas. This is becoming very irritating as it makes an assumption about /var/mail which we proved doesn't scale decades ago.  Rather than trying to respond explaining this again and again, I'm blogging about it.  Enjoy.

When we started building large e-mail servers using sendmail in the late 1980s, we ran right into the problem of scaling the mail delivery directory.  Recall that back then relatively few people were on the internet or using e-mail, a 40MHz processor was leading edge, a 200 MByte hard disk was just becoming affordable, RAID was mostly a white paper, and e-mail attachments were very uncommon. It is often limited resources which cause scaling problems, and putting thousands of users into a single /var/mail quickly exposes issues.

Many sites implemented quotas during that era, largely because of the high cost and relative size of hard disks.  The computing models were derived from the timeshare systems (eg UNIX) and that model was being stretched as network computing was evolving (qv rquotad).  A common practice for Sun sites was to mount /var/mail on the NFS clients so that the mail clients didn't have to know anything about the network.

As we scaled, the first, obvious change was to centralize the /var/mail directory. This allowed you to implement a site-wide mail delivery where you could send mail to instead of   This is a cool idea and worked very well for many years. But it wasn't the best solution.

As we scaled some more, and the "administration" demanded quotas, we found that the very nature of distributed systems didn't match the quota model very well. For example, the "administration's" view was that a user may be given a quota of Q for the site. But the site now had many different file systems and a quota only really works on a single file system. We had already centralized everyone onto a single mail store and you needed some quota for the home directory and another subset of Q for the mail store. You also had to try and limit the quota on other home directories because the clever users would discover where the quotas weren't and use all of the space. Back at the mail store, it became increasingly more difficult to manage the space because, as everybody knows, the managers never delete e-mail and they complain loudly when they run out of space. So, quotas in a large, shared directory don't work very well.<\\p>

The next move was to deliver mail into the user's home directory. This is trivially easy to setup in sendmail (now). In this model, the quota only needs to be set by the userin their home directory and when they run out, you can do work. This solution bought another few years of scalability, but still has its limitations. A particularly annoying limitation is that sending mail to someone who was over quota is not handled very well. And if the sys-admins use mail to tell people they are near quota, then it might not be deliverable (recall, managers don't delete e-mail :-)

There is also a potential problem with mail bombs.  In the sendmail model, each message was copied to each user's mailbox. In the old days, you could implement a policy where sendmail would reject mail messages of a large size. You can still do that today, but before attachments you could put the limit at something small, say 100 kBytes.  There is no way you can do that today. So a mischievous user could send a large mail message to everyone, blow out the /var/mail directory or the quotas.

A better model is to have only one copy of an e-mail message and just use pointers for each of the recipients. But while this model can save large amounts of disk space, it is not compatible with quotas because there is no good way to assign the space to a given user.

The next problem to be solved was the clients. Using an NFS mounted /var/mail worked great for UNIX users, but didn't work very well for PCs (which were now becoming network citizens). The POP and IMAP protocols fixed this problem.

Today mail systems can scale to millions of users, but not by using a shared file system or file system quotas. In most cases, there is a database which contains info on the user and their messages. The messages themselves are placed in a database of sorts and there is usually only one copy of the message. Mail quotas can be easily implemented and the mailer can reply to a sender explaining that the recipient is over mail quota, or whatever.  Automation sends a user a near-quota warning message.  But this is not implemented via file system quotas.

So, please, if you want to describe shared space and file system quotas, find some other example than mail. If you can't find an example, then perhaps we can drop the whole quota argument altogether.

If your "administration" demands that you implement quotas, then you have my sympathy.  Just remind them that you probably have more space in your pocket than quota on the system...

Tuesday Apr 08, 2008

more on holey files

My colleague Christine asked me some questions about my holey files posts. These are really good questions, and I'm just a little surprised that more people didn't ask them... hey, that is what the comments section is for!  So, I thought I would reply publically, helping to stimulation some conversations.

Q1. How could you have a degraded pool and data corruption w/o a repair?  I assume this pool must be raidz or mirror.

A1. No, this was a simple pool, not protected at the pool level. I used the ZFS copies parameter to set the number of redundant data copies to 2. For more information on how copies works, see my post with pictures.

There is another, hidden question here.  How did I install Indiana such that it uses copies=2? By opening a shell and becoming root prior to beginning the install, I was able to set the copies=2 property just after the storage pool was created. By default, it gets inherited by any subsequent file system creation.  Simple as that.  OK, so it isn't that simple.  I've also experimented with better ways to intercept the zpool create, but am not really happy with my hacks thus far.  A better solution is for the installer to pick up a set of properties, but it doesn't, at least for now.

Q2.  Can a striped pool be in a degraded state?  Wouldn't a device faulting in that pool renders it unusable and therefore faulted?

A2. Yes, a striped storage pool can be in a degraded state. To understand this, you need to know the definitions of DEGRADED and FAULTED.  Fortunately, they are right there in the zpool manual page.



One or more top-level vdevs is in the degraded state because one or more component devices are offline. Sufficient replicas exist to continue functioning.



One or more top-level vdevs is in the faulted state because one or more component devices are offline. Insufficient replicas exist to continue functioning.


By default, there are multiple replicas, so for a striped volume it is possible to be in a DEGRADED state. However, I expect that the more common case will be a FAULTED state. In other words, I do tend to recommend a more redundant storage pool: mirror, raidz, raidz2. 

Q3. What does filling the corrupted part with zero do for me?  It doesn't fix it, those bits weren't zero to begin with.

A3. Filling with zeros will just make sure that the size of the "recovered" file is the same as the original. Some applications get to data in a file via a seek to an offset (random access), so this is how you would want to recover the file.  For applications which process files sequentially, it might not matter.

Thursday Mar 13, 2008

dd tricks for holey files

Bob Netherton took a look at my last post on corrupted file recovery (?) and asked whether I had considered using the noerror option to dd. Yes, I did experiment with dd and the noerror option.

The noerror option is described in dd(1) as:

    noerror Does not stop processing on an input error.
            When an input error occurs, a diagnostic mes-
            sage is written on standard error, followed
            by the current input and output block counts
            in the same format as used at completion. If
            the sync conversion is specified, the missing
            input is replaced with null bytes and pro-
            cessed normally. Otherwise, the input block
            will be omitted from the output.

This looks like the perfect solution, rather than my dd and iseek script. But I didn't post this because, quite simply, I don't really understand what I get out of it.

Recall that I had a corrupted file which is 2.9 MBytes in size. Somewhere around 1.1 MBytes into the file, the data is corrupted and fails the ZFS checksum test.

# zpool scrub zpl_slim
# zpool status -v zpl_slim
  pool: zpl_slim
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the
        entire pool from backup.

scrub: scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42 2008

       zpl_slim    DEGRADED     0     0     9
         c2t0d0s0  DEGRADED     0     0     9

errors: Permanent errors have been detected in the following files:
# ls -ls /mnt/root/lib/amd64/
4667 -rwxr-xr-x 1 root bin 2984368 Oct 31 18:04 /mnt/root/lib/amd64/

I attempted to use dd with the noerror flag using several different block sizes to see what I could come up with. Here are those results:

# for i in 1k 8k 16k 32k 128k 256k 512k
> do
dd of=/tmp/whii.$i bs=$i conv=noerror
> done
read: I/O error
1152+0 records in
1152+0 records out
ls -ls /tmp/whii\*
3584 -rw-r--r-- 1 root root 1835008 Mar 13 11:27 /tmp/whii.128k
2464 -rw-r--r-- 1 root root 1261568 Mar 13 11:27 /tmp/whii.16k
2320 -rw-r--r-- 1 root root 1184768 Mar 13 11:27 /tmp/whii.1k
4608 -rw-r--r-- 1 root root 2359296 Mar 13 11:27 /tmp/whii.256k
2624 -rw-r--r-- 1 root root 1343488 Mar 13 11:27 /tmp/whii.32k
7168 -rw-r--r-- 1 root root 3670016 Mar 13 11:27 /tmp/whii.512k
2384 -rw-r--r-- 1 root root 1220608 Mar 13 11:27 /tmp/whii.8k

hmmm... all of these files are of different sizes, so I'm really unsure what I've ended up with. None of them are the same size as the original file, which is a bit unexpected.

# dd of=/tmp/whaa.1k bs=1k conv=noerror
read: I/O error
1152+0 records in
1152+0 records out
read: I/O error
1153+0 records in
1153+0 records out
read: I/O error
1154+0 records in
1154+0 records out
read: I/O error
1155+0 records in
1155+0 records out
read: I/O error
1156+0 records in
1156+0 records out
read: I/O error
1157+0 records in
1157+0 records out
# ls -ls /tmp/whaa.1k
2320 -rw-r--r-- 1 root root 1184768 Mar 13 11:12 /tmp/whaa.1k

hmmm... well, dd did copy some of the file, but seemed to give up after around 5 attempts and I only seemed to get the first 1.1 MBytes of the file. What is going on here? A quick look at the dd source (open source is a good thing) shows that there is a definition of BADLIMIT which is how many times dd will try before giving up. The default compilation sets BADLIMIT to 5. Aha! A quick download of the dd code and I set BADLIMIT to be really huge and tried again.

# bigbaddd of=/tmp/whbb.1k bs=1k conv=noerror
read: I/O error
1152+0 records in
1152+0 records out
read: I/O error
3458+0 records in
3458+0 records out
\^C I give up
# ls -ls /tmp/whbb.1k
6920 -rw-r--r-- 1 root root 3543040 Mar 13 11:47 /tmp/whbb.1k

As dd processes the input file, it doesn't really do a seek, so it can't really get past the corruption. It is getting something, because od shows that the end of the whbb.1k file is not full of nulls. But I really don't believe this is the data in a form which could be useful. And I really can't explain why the new file is much larger than the original. I suspect that dd gets stuck at the corrupted area and does not seek beyond it. In any case, it appears that letting dd do the dirty work by itself will not acheive the desired results. This is, of course, yet another opportunity...

Wednesday Mar 12, 2008

Holy smokes! A holey file!

I was RASing around with ZFS the other day, and managed to find a file which was corrupted.

# zpool scrub zpl_slim
# zpool status -v zpl_slim
  pool: zpl_slim
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the
        entire pool from backup.
 scrub: scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42 2008
        NAME        STATE     READ WRITE CKSUM
        zpl_slim    DEGRADED     0     0     9
          c2t0d0s0  DEGRADED     0     0     9

errors: Permanent errors have been detected in the following files:

# ls -ls /mnt/root/lib/amd64/
4667 -rwxr-xr-x 1 root bin 2984368 Oct 31 18:04 /mnt/root/lib/amd64/

argv! Of course, this particular file is easily extracted from the original media, it does't contain anything unique. For those who might be concerned that it is the C runtime library, and thus very critical to running Solaris, the machine in use is only 32-bit, so the 64-bit (amd64) version of this file is never used. But suppose this were an important file for me and I wanted to recover something from it? This is a more interesting challenge...

First, let's review a little bit about how ZFS works. By default, when ZFS writes anything, it generates a checksum which is recorded someplace else, presumably safe. Actually, the checksum is recorded at least twice, just to be doubly sure it is correct. And that record is also checksummed. Back to the story, the checksum is computed on a block, not for the whole file. This is an important distinction which will come into play later. If we perform a storage pool scrub, ZFS will find the broken file and report it to you (see above), which is a good thing -- much better than simply ignoring it, like many other file systems will do.

OK, so we know that somewhere in the midst of this 2.8 MByte file, we have some corruption. But can we at least recover the bits that aren't corrupted? The answer is yes. But if you try a copy, then it bails with an error.

# cp /mnt/root/lib/amd64/ /tmp
/mnt/root/lib/amd64/ I/O error

Since the copy was not successful, there is no destination file, not even a partial file. It turns out that cp uses mmap(2) to map the input file and copies it to the output file with a big write(2). Since the write doesn't complete correctly, it complains and removes the output file. What we need is something less clever, dd.

# dd if=/mnt/root/lib/amd64/ of=/tmp/whee
read: I/O error
2304+0 records in
2304+0 records out
# ls -ls /tmp/whee
2304 -rw-r--r-- 1 root root 1179648 Mar 12 18:53 /tmp/whee

OK, from this experiment we know that we can get about 1.2 MBytes by directly copying with dd. But this isn't all, or even half of the file. We can get a little more clever than that. To make it simpler, I wrote a little ksh script:

integer i=0
while ((i < 23))
    typeset -RZ2 j=$i
    dd if=$1 of=$2.$j bs=128k iseek=$i count=1

This script will write each of the first 23 128kByte blocks from the first argument (a file) to a unique filename as a number appended to the second argument. dd is really dumb and doesn't offer much error handling which is why I hardwired the count into the script. An enterprising soul with a little bit of C programming skill could do something more complex which handles the more general case. Ok, that was difficult to understand, and I wrote it. To demonstrate, I first appologize for the redundant verbosity:

# ./getaround.ksh /tmp/zz
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
read: I/O error
0+0 records in
0+0 records out

1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
0+1 records in
0+1 records out
# ls -ls /tmp/zz.\*
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.00
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.01
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.02
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.03
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.04
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.05
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.06
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.07
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.08
   0 -rw-r--r-- 1 root root      0 Mar 12 19:00 /tmp/zz.09
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.10
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.11
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.12
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.13
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.14
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.15
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.16
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.17
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.18
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.19
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.20
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.21
 200 -rw-r--r-- 1 root root 100784 Mar 12 19:00 /tmp/zz.22

So we can clearly see that the 10th (128kByte) block is corrupted, but the rest of the blocks are ok. We can now reassemble the file with a zero-filled block.

# dd if=/dev/zero of=/tmp/zz.09 bs=128k count=1
1+0 records in
1+0 records out
# cat /tmp/zz.\* > /tmp/zz
# ls -ls /tmp/zz
5832 -rw-r--r-- 1 root root 2984368 Mar 12 19:03 /tmp/zz

Now I have recreated the file with a zero-filled hole where the data corruption was. Just for grins, if you try to compare with the previous file, you should get what you expect.

# cmp /tmp/zz+
cmp: EOF on

How is this useful?

Personally, I'm not sure this will be very useful for many corruption cases. As a RAS guy, I advocate many verified copies of important data placed on diverse systems and media. But most folks aren't so inclined. Everytime we talk about this on the zfs-discuss alias, somebody will say that they don't care about corruption in the middle of their mp3 files. I'm no audiophile, but I prefer my mp3s to be hole-less. So I did this little exercise to show how you can regain full access to the non-corrupted bits of a corrupted file in a more-or-less easy way. Consider this a proof of concept. There are many possible variations, such as filling with spaces instead of nulls when you are missing parts of a text file -- opportunities abound.

Wednesday Oct 03, 2007

Adaptec webinar on disks and error handling

Adaptec has put together a nice webinar called Nearline Data Drives and Error Handling. If you work with disks or are contemplating building your own home data server, I recommend that you take 22 minutes to review the webinar. As a systems vendor, we are often asked why we made certain design decisions to favor data over costs, and I think this webinar does a good job of showing how some of the complexity of systems design covers a large number of decision points.  Here in the RAS Engineering group we tend to gravitate towards the best reliability and availability of systems, which still requires a staggering number of design trade-offs.  Rest assured that we do our best to make these decisions with your data in mind.

For the ZFSers in the world, this webinar also provides some insight into how RAID systems like ZFS are designed, and why end-to-end data protection is vitally important.

Enjoy!  And if you don't want your Starbuck's gift card, send it to me :-)

Tuesday Sep 18, 2007

Space Maps from Space

Jeff Bonwick recently blogged about why ZFS uses space maps for keeping track of allocations. In my recent blog on looking at ZFS I teased you with a comment about the space map floating near the Channel Islands. Now that Jeff has explained how they work, I'll show you what they look like as viewed from space.

 Space map

 This is a view of a space map for a ZFS file system which was created as a recursive copy of the /usr directory followed by a recursive remove of the /usr/share directory. This allows you to see how some space is allocated and some space is free.

I wrote an add-on to NASA's Worldwind to parse zdb output looking for the space map information. Each allocation appears as a green rectangle with a starting offset and length mapped onto a square field floating above the earth. The allocations are green and the frees are yellow.  The frees are also floating 100m above the allocations, though it is not easy to see from this view. Each map entry also has an optional user-facing icon which shows up as a shadowed green or yellow square. I snagged these from the StarOffice bullets images. If you hover the mouse over an icon, then a tool tip will appear showing the information about the space.  In this example, the tooltip says "Free, txg=611, pass=1, offset=53fe000, size=800"

I can think of about a half dozen cool extensions to make for this, such as showing metaslab boundaries.  I also need to trim the shadow field to fit; it extends too far on the right.  So much to do, so little time...


Wednesday Aug 29, 2007

ZFS I/Os in motion

I'm walking a line - I'm thinking about I/O in motion
I'm walking a line - Just barely enough to be living
Get outta the way - No time to begin
This isn't the time - So nothing was biodone
Not talking about - Not many at all
I'm turning around - No trouble at all
You notice there's nothing around you, around you
I'm walking a line - Divide and dissolve.

[theme song for this post is Houses in Motion by the Talking Heads]

Previously, I mentioned a movie. OK, so perhaps it isn't a movie, but an animated GIF.


This is a time-lapse animation of some of the data shown in my previous blog on ZFS usage of mirrors. Here we're looking at one second intervals and the I/O to the slow disk of a two-disk mirrored ZFS file system. The workload is a recursive copy of the /usr/share directory into this file system.

The yellow areas on the device field are write I/O operations. For each time interval, the new I/O operations are shown with their latency elevators. Shorter elevators mean lower latency. Green elevators mean the latency is 10ms or less, yellow until 25ms, and red beyond 25ms. This provides some insight into the way the slab allocator works for ZFS. If you look closely, you can also see the redundant uberblock updates along the lower-right side near the edge. If you can't see that in the small GIF, click on the GIF for a larger version which is easier to see.

ZFS makes redundant copies of the metadata. By preference, these will be placed in a different slab. You can see this in the animation as there are occasionally writes further out than the bulk of the data writes. As the disk begins to fill, the gaps become filled.  Interestingly, the writes to the next slab (metadata) do not have much latency - they are in the green zone. This is a simple IDE disk, so there is a seek required by these writes. This should help allay one of the fears of ZFS, that the tendency to have data spread out will be a performance problem - I see no clear evidence of that here. 

I have implemented this as a series of Worldwind layers. This isn't really what Worldwind was designed to do, so there are some inefficiencies in the implementation, or it may be that there is still some trick I have yet to learn.  But it is functional in that you can see I/Os in motion.

Looking at ZFS

A few months ago, I blogged about why I wasn't at JavaOne and mentioned that I was looking at some JOGL code. Now I'm ready to show you some cool pictures which provide a view into how ZFS uses disks.

The examples here show a mirrored disk pair. I created a mirrored zpool and use the default ZFS settings. I then did a recursive copy of /usr/share into the ZFS file system. This is a write-mostly workload.

There are several problems with trying to visualize this sort of data:

  1. There is a huge number of data points.  A 500 GByte disk has about a billion blocks.  Mirror that and you are trying to visualize two billion data points. My workstation screen size is only 1.92 million pixels (1600x1200) so there is no way that I could see this much data.
  2. If I look at an ASCII table of this data, then it may be hundreds of pages long.  Just for fun, try looking at the output of zdb -dddddd to get an idea of how the data might look in ASCII, but I'll warn you in advance, try this only on a small zpool located on a non-production system.
  3. One dimensional views of the data are possible.  Actually, this is what zdb will show for you.  There is some reasoning here because a device is accessed as a single set of blocks using an offset and size for read or write operations. But this doesn't scale well, especially to a billion data points.
  4. Two dimensional views are also possible, where we basically make a two dimensional array of the one dimensional data.  This does hide some behaviour, as disks are two dimensional, but they are stacks of circles of different sizes.  These physical details are cleverly hidden and subject to change on a per-case basis.  So, perhaps we can see some info in two dimensions that would help us understand what is happening.
  5. Three dimensional views can show even more data.  This is where JOGL comes in, it is a 3-D libary for JAVA.

It is clear that some sort of 3-D visualization system could help provide some insight into this massive amount of data.  So I did it.

Where is the data going? 

mirrored write iops

This is a view of the two devices in the mirror after they have been filled by the recursive copy. Yellow blocks indicate write operations, green blocks are read operations.  Since this was a copy into the file system, there aren't very many reads. I would presume that your browser window is not of sufficient resolution to show the few, small reads anyway, so you'll just have to trust me.

What you should be able to see, even at a relatively low resolution, is that we're looking at a 2-D representation of each device from a 3-D viewpoint. Zooming, panning, and moving the viewpoint allows me to observe more or less detail.

To gather this data, I used TNF tracing.  I could also write a dtrace script to do the same thing. But I decided to use TNF data because it has been available since Solaris 8 (7-8 years or so) and I have an archive of old TNF traces that I might want to take a look at some day. So what you see here are the I/O operations for each disk during the experiment.

How long did it take?  (Or, what is the latency?)

The TNF data also contains latency information.  The latency is measured as the difference in time between the start of the I/O and its completion. Using the 3rd dimension, I put the latency in the Z-axis.


Ahhh... this view tells me something interesting. The latency is shown as a line emitting from the starting offset of the block being written. You can see some regularity over the space as ZFS will coalesce writes into 128 kByte I/Os. The pattern is more clearly visible on the device on the right.

 But wait! What about all of the red?  I color the latency line green when the latency is less than 10ms, yellow until 25ms, and red for latency > 25ms.  The height of the line is a multiple of its actual latency.  Wow!  The device on the left has a lot of red, it sure looks slow.  And it is.  On the other hand, the device on the right sure looks fast.  And it is. But this view is still hard to see, even when you can fly around and look at it from different angles. So, I added some icons...

I put icons at the top of the line. If I hover the mouse over an icon, it will show a tooltip which contains more information about that data point. In this case, the tooltip says, "Write, block=202688, size=64, flags=3080101, time=87.85"  The size is in blocks, the flags are defined in a header file somewhere, and the time is latency in milliseconds.  So we wrote 32 kBytes at block 202,688 in 87.85 ms.  This is becoming useful!  By cruising around, it becomes apparent that for this slow device, small writes are faster than large writes, which is pretty much what you would expect.

Finding a place in the world

Now for the kicker.  I implemented this as an add-on to NASA's Worldwind.



 I floated my devices at 10,000 m above the ocean off the west coast of San Diego! By leveraging the Worldwind for Java SDK, I was able to implement my visualization by writing approximately 2,000 lines of code. This is a pretty efficient way of extending a GIS tool into non-GIS use, while leveraging the fact that GIS tools are inherently designed to look at billions of data points in 3-D.

More details of the experiment

The two devices are intentionally very different from a performance perspective. The device on the left is an old, slow, relatively small IDE disk. The device on the right is a ramdisk. 

I believe that this technique can lead to a better view of how systems work under the covers, even beyond disk devices.  I've got some cool ideas, but not enough days in the hour to explore them all.  Drop me a line if you've got a cool idea. 

The astute observer will notice another view of the data just to the north of the devices. This is the ZFS space map allocation of one of the mirror vdevs. More on that later... I've got a movie to put together...


Tuesday May 08, 2007

Who's not at JavaONE?

I'm not going to JavaONE this year.  And I'm a little sad about that, but I'll make it one day.  I was part of the very first JavaONE prelude.  "Prelude" is the operative term because at that time Sun employees were actively discouraged from attending.  This was a dain bramaged policy which has since been fixed, but at the time the idea was to fill the event with customers and developers.  Now it is just full of every sort of person.  Y'all have fun up there!

So to celebrate this year's JavaONE, I'm learning JOGL.   Why?  Well, I've got some data that I'd like to visualize and I've not found a reasonable tool for doing it in an open, shareable manner.  Stay tuned...




« April 2014