Tuesday May 27, 2008

OpenSolaris Home Server: ZFS and USB Disks

My home server with a couple of USB disksA couple of weeks ago, OpenSolaris 2008.05, project Indiana, saw its first official release. I've been looking forward to this moment so I can upgrade my home server and work laptop and start benefiting from the many cool features. If you're running a server at home, why not use the best server OS on the planet for it?

This is the first in a small series of articles about using OpenSolaris for home server use. I did a similar series some time ago and got a lot of good and encouraging feedback, so this is an update, or a remake, or home server 2.0, if you will.

I'm not much of a PC builder, but Simon has posted his experience with selecting hardware for his home server. I'm sure you'll find good tips there. In my case, I'm still using my trusty old Sun Java W1100z workstation, running in my basement. And for storing data, I like to use USB disks.

USB disk advantages

This is the moment where people start giving me that "Yeah, right" or "Are you serious?" looks. But USB disk storage has some cool advantages:

  • It's cheap. About 90 Euros for half a TB of disk from a major brand. Can't complain about that.
  • It's hot-pluggable. What happens if your server breaks and you want to access your data? With USB it's as easy as unplug from broken server, plug into laptop and you're back in business. And there's no need to shut down or open your server if you just want to add a new disk or change disk configuration.
  • It scales. I have 7 disks running in my basement. All I needed to do to make them work with my server was to buy a cheap 15 EUR 4-port USB card to expand my existing 5 USB ports. I still have 3 PCI slots left, so I could add 12 disks more at full USB 2.0 speed if I wanted.
  • It's fast enough. I measure about 10MB/s in write performance with a typical USB disk. That's about as fast as you can get over a 100 MBit/s LAN network which most people use at home. As long as the network remains the bottleneck, USB disk performance is not the problem.

ZFS and USB: A Great Team

But this is not enough. The beauty of USB disk storage lies in its combination with ZFS. When adding some ZFS magic to the above, you also get:

  • Reliability. USB disks can be mirrored or used in a RAID-Z/Z2 configuration. Each disk may be unreliable (because they're cheap) individually, but thanks to ZFS' data integrity and self-healing properties, the data will be safe and FMA will issue a warning early enough so disks can be replaced before any real harm can happen.
  • Flexibility. Thanks to pooled storage, there's no need to wonder what disks to use for what and how. Just build up a single pool with the disks you have, then assign filesystems to individual users, jobs, applications, etc. on an as-needed basis.
  • Performance. Suppose you upgrade your home network to Gigabit Ethernet. No need to worry: The more disks you add to the pool, the better your performance will be. Even if the disks are cheap.

Together, USB disks and ZFS make a great team. Not enterprise class, but certainly an interesting option for a home server.

ZFS & USB Tips & Tricks

So here's a list of tips, tricks and hints you may want to consider when daring to use USB disks with OpenSolaris as a home server:

  • Mirroring vs. RAID-Z/Z2: RAID-Z (or its more reliable cousin RAID-Z2) is tempting: You get more space for less money. In fact, my earlier versions of zpools at home were a combination of RAID-Z'ed leftover slices with the goal to squeeze as much space as possible at some reliability level out of my mixed disk collection.
    But say you have a 3+1 RAID-Z and want to add some more space. Would you buy 4 disks at once? Isn't that a bit big, granularity-wise?
    That's why I decided to keep it simple and just mirror. USB disks are cheap enough, no need to be even more cheap. My current zpool has a pair of 1 TB USB disks and a pair of 512 GB USB disks and works fine.
    Another advantage of this aproach is that you can organically modernize your pool: Wait until one of your disks starts showing some flakyness (FMA and ZFS will warn you as soon as the first broken data block has been repaired). Then replace the disk with a bigger one, then its mirror with the same, bigger size. That will give you more space without the complexity of too many disks and keep them young enough to not be a serious threat to your data. Use the replaced disks for scratch space or less important tasks.
  • Instant replacement disk: A few weeks ago, one of my mirrored disks showed its first write error. It was a pair of 320GB disks, so I ordered a 512GB replacement (with the plan to order the second one later). But now, my mirror may be vulnerable: What if the second disk starts breaking before the replacement has arrived?
    That's why having a few old but functional disks around can be very valuable: In my case, took a 200GB and a 160GB disk and combined them into their own zpool:
    zpool create temppool c11t0d0 c12t0d0
    Then, I created a new ZVOL sitting on the new pool:
    zfs create -sV 320g temppool/tempvol
    Here's out temporary replacement disk! I then attached it to my vulnerable mirror:
    zfs attach santiago c10t0d0 /dev/zvol/dsk/temppool/tempvol
    And voilá, my precious production pool stated resilvering the new virtual disk. After the new disk arrived and has been resilvered, the temporary disk can be detached, destroyed and its space put to some other good use.
    Storage virtualization has never been so easy!
  • Don't forget to scrub: Especially with cheap USB disks, regular scrubbing is important. Scrubbing will check each and every block of your data on disk and make sure it's still valid. If not, it will repair it (since we're mirroring or using RAID-Z/Z2) and tell you what disk had a broken block so you can decide whether it needs to be replaced or not just yet.
    How often you want to or should scrub depends on how much you trust your hardware and how much your data is being read out anyway (any data that is read out is automatically checked, so that particular portion of the data is already "scrubbed" if you will). I find scrubbing once every two weeks a useful cycle, othery may prefer once a month or once a week.
    But scrubbing is a process that needs to be initiated by the administrator. It doesn't happen by itself, so it is important that you think of issuing the "zpool scrub" command regularly, or better, set up a cronjob for it to happen automatically.
    As an example, the following line:
    23 01 1,15 \* \* for i in `zpool list -H -o name`; do zpool scrub $i; done
    in your crontab will start a scrub for each of your zpools twice a month on the 1st and the 15th at 01:23 AM.
  • Snapshot often: Snapshots are cheap, but they can save the world if you accientally deleted that important file. Same rule as with scrubbing: Do it. Often enough. Automatically. Tim Foster did a great job of implementing an automatic ZFS snapshot service, so why don't you just install it now and set up a few snapshot schemes for your favourite ZFS filesystems?
    The home directories on my home server are snapshotted once a month (and all snapshots are kept), once a week (keeping 52 snapshots) and once a day (keeping 31 snapshots). This gives me a time-machine with daily, weekly and monthly granularities depending on how far back in time I want to travel through my snapshots.

So, USB disks aren't bad. In fact, thanks to ZFS, USB disks can be very useful building blocks for your own little cost-effective but reliable and integrity-checked data center.

Let me know what experiences you made while using USB storage at home, or with ZFS and what tips and tricks you have found to work well for you. Just enter a comment below or send me email!

Tuesday Apr 10, 2007

A lesson in patience...

About a week ago, I bought myself two new 320GB external USB disks. They are destined to become my new home storage infrastructure, of course based on OpenSolaris and ZFS.

The first disk was recognized by Solaris with no problem: devfsadm(1M), then rmformat(1) say:

     3. Logical Node: /dev/rdsk/c7t0d0p0
        Physical Node: /pci@0,0/pci1022,7460@6/pci108e,534d@3,2/storage@1/disk@0,0
        Connected Device: WD       3200JB External  0108
        Device Type: Removable
        Bus: USB
        Size: 305.2 GB
        Label: <None>
        Access permissions: Medium is not write protected.

Fine. Then I attached the second disk, but devfsadm would not come back. At least not after a minute or so. Scratching my head, I turned off the drive and then /var/adm/messages logged:

Apr  6 10:33:26 condorito devfsadmd[143]: [ID 937045 daemon.error] failed to
lookup dev name for /pci@0,0/pci1022,7460@6/pci108e,534d@3,2/storage@2/disk@0,0

Hmm. It got worse when I tried rebooting with the drive attached: Solaris won't make it past the initial message. Hmm. Is the disk broken? Is there a problem with the Solaris USB or disk drive kernel support? And why just the second disk and not the first, they are equal, aren't they?

This is what the USB subsystem knows about the drives. Drive 1, the good one looks like this:

Apr  9 09:55:04 condorito usba: [ID 349649 kern.info]   Western Digital External
HDD 5743414C3733393036383035

But the second, the bad one looks like this:

Apr  9 09:55:44 condorito usba: [ID 349649 kern.info]   Western Digital External
HDD 57442D5743414D5234313836363532

Has Western Digital changed their serial numbering scheme right in the middle of the same product? And why is that a problem for Solaris?

Both drives worked fine with my Powerbook and with my Ferrari 4000 Laptop running Solaris so why can't my home machine (A Sun Java Workstation W1100z) cope with this?

As a last resort, I decided to try devfsadm again, this time waiting for however long it takes. Maybe it can come up with a more useful error message. And, voila, after 14 minutes(!) it did come back and everything was fine. This time, rmformat correctly recognizes the second disk as well:

     1. Logical Node: /dev/rdsk/c4t0d0p0
        Physical Node: /pci@0,0/pci1022,7460@6/pci108e,534d@3,2/storage@2/disk@0,0
        Connected Device: WD       3200JB External  0108
        Device Type: Removable
        Bus: USB
        Size: 305.2 GB
        Label: <None>
        Access permissions: Medium is not write protected.

Running devfsadm again just takes seconds, so something probably has been updated inside the device tree so that everything is fine again. I still don't know what it is and I'd be glad if someone would explain to me what happened, but I'm sure glad that I can now continue migrating my current ZFS filesystems onto the new mirrored ZFS pool...

Lesson learned: If something takes longer than expected, it might still not be broken. Just be patient...

Friday Mar 09, 2007

CSI:Munich - How to save the world with ZFS and 12 USB Sticks

Here's a fun video that shows how cool the Sun Fire X4500 (codename: Thumper) is and how you can create your own Thumper experience on your laptop using Solaris ZFS and 12 USB sticks:

This is finally the english dubbed version of a German video that a couple of colleagues and I produced some weeks ago. If you don't mind the german language, you might enjoy the original german version, too (It turns out that the english language has a lot less redundancy than the german one, so please forgive the occasional soundless lip motions).

If you liked the video(s), let us know, we'll be glad to answer any questions, receive any leftover Oscars or accept any new ideas for future episodes.

Here are a few more details, in case you really want to try this at home:

The first hurdle to overcome is to teach Solaris how to accept more than 10 USB storage devices. On a plain vanilla Solaris 10 system, it turns out that there is a limitation: Connecting more than 10 USB sticks through 3 USB-powered Hubs yields a Connecting device on port n failed error. Thanks to a colleague from engineering, the fix is to set ehci:ehci_qh_pool_size = 120 in /etc/system.

The second issue is briefly explained in the video itself: Not all USB sticks (particularly the cheap ones) are created equal. Small variations in the components create small variations in their storage space. So, when creating a zpool, you need to use -f to tell zpool to ignore differing device sizes.

If you pay close attention to the video, you'll notice around 7:20 that pulling a hub wasn't so harmless at all: "errors: 8 data errors, use '-v' for a list" can be seen at the bottom of the teminal window. In fact, zpool status reports 6 checksum errors in c21t0d0p0. Well, using cheap USB sticks means that block errors can occur in practice and once you don't have enough redundancy (like after unplugging a USB hub for show effect) they may hurt you. Fortunately, they didn't hurt our particular demo, since on one hand ZFS' prefetch algorithm had most of the video in memory anyway, while on the other hand zpool scrub fixed any broken blocks after re-plugging the USB hub. So, the cheaper the storage the more redundancy one should add. In this case, RAID-Z2 would have been better. Perhaps we can get some more USB sticks and hubs from any sponsors?

Finally, it took us a couple of retrys until the remove-sticks-mix-then-replug stunt worked, because it turned out that the laptop's USB implementation wasn't as reliable as we needed it to be. And yes, it does help to wait until they've finished blinking before removing any sticks :).

All in all, it was great fun for us producing this video and thanks to the tireless efforts of Marc, our beloved but invisible video editor, we now can proudly present an english version. Actually, we were quite surprised by this video's success: We published it in early February and just a day later, it got noticed by a couple of Solaris engineering people. Now, we have more than 9000 views of the german version (counting the Google video and the YouTube edition together) and are still counting. Hopefully, we can cross the 10,000 views barrier with the english version, now that we have increased the potential audience :).

After watching the video, feel free to try out Solaris ZFS for yourself. There's nothing like building your own pool, then watching ZFS take care of your data. At home, ZFS keeps my photos, music and TV videos nice and tidy, including weekly snapshots thanks to Tim Foster's automatic snapshot SMF service. Just this tuesday, my weekly zpool scrub cron job told me it had fixed a broken block on one of my disks. One that I'd never found out with any other storage system.

To get started, get OpenSolaris here or download it here. All you need to do is check out the docs though real system heroes only need two man-pages: zpool(1M) and zfs(1M).

P.S.: CSI of course stands for "Computer Systems Integration". Any similarities to the popular TV show are purely coincidence. Really. Hmm, but maybe having a dead body or two in one of the next episodes might spice up things a little...

P.P.S.: The cool rock music at the beginning is from XING a great rock band where one of our colleagues plays drums in. Go XING!

Update: Here is a much higher quality version, in case you want to show this video around on your laptop.


Tune in and find out useful stuff about Sun Solaris, CPU and System Technology, Web 2.0 - and have a little fun, too!


« July 2016