Thursday Jun 26, 2008

ZFS and Mac OS X Time Machine: The Perfect Team

A few months ago, I wrote about "X4500 + Solaris ZFS + iSCSI = Perfect Video Editing Storage". And thanks to you, my readers, it became one of my most popular blog entries. Then I wrote about "VirtualBox and ZFS: The Perfect Team", which turned out to be another very popular blog article. Well, I'm glad to introduce you to another perfect team now: Solaris ZFS and Mac OS X Time Machine.

Actually, it began a long time ago: In December '06, Ben Rockwood wrote about the beauty of ZFS and iSCSI integration, and immediatley I thought "That's the perfect solution to back up my Mac OS X PowerBook!" No more strings attached, just back up over WLAN to a really good storage device that lives on Solaris ZFS, while still using the Mac OS X native file system peculiarities. But Apple didn't have an iSCSI initiator yet (they still don't have one now) and the only free iSCSI initiator I could find was buggy, unstable and didn't like Solaris targets at all.

Then, Apple announced their Time Machine technology. Many people thought that this was related to them supporting ZFS and in fact, it's easy to believe that Time Machine's travels back in time are supported by ZFS snapshots. But they aren't. In reality, it's just a clever use of hardlinks. And not a very efficient one, too: Whenever a file changes, the whole file gets backed up again, even if you just changed a little bit of it.

Last week, a colleague of mine told me that Studio Networks Solutions had updated their globalSAN iSCSI Initiator software for Mac OS X and that it now works well with Solaris iSCSI targets. I decided to give it another try. So, here are two ZFS ZVOLs sitting on my OpenSolaris 2008.05 home server:

Sun Microsystems Inc.   SunOS 5.11      snv_86  January 2008
-bash-3.2$ zfs list -rt volume 
NAME                             USED  AVAIL  REFER  MOUNTPOINT
santiago/volumes/aperturevault  6.50G   631G  6.50G  -
santiago/volumes/mbptm           193G   631G   193G  -
-bash-3.2$

They have both been shared as iSCSI targets through a single zfs set shareiscsi=on santiago/volumes command, thanks to ZFS' attribute inheritance:

-bash-3.2$ zfs get shareiscsi santiago/volumes
NAME              PROPERTY    VALUE             SOURCE
santiago/volumes  shareiscsi  on                local
-bash-3.2$ zfs get shareiscsi santiago/volumes/aperturevault
NAME                            PROPERTY    VALUE                           SOURCE
santiago/volumes/aperturevault  shareiscsi  on                              inherited from santiago/volumes
-bash-3.2$ zfs get shareiscsi santiago/volumes/mbptm
NAME                    PROPERTY    VALUE                   SOURCE
santiago/volumes/mbptm  shareiscsi  on                      inherited from santiago/volumes
-bash-3.2$ 

On the Mac side, they show up in the globalSAN GUI just nicely:

globalSAN iSCSI GUI

And Disk Utility can format them perfectly as if they were real disks:

Disk Utility with 2 iSCSI disks 

Time Machine happily accepted one of the iSCSI disks and synced more than 190GB to it just fine and as I type these lines, Aperture is busy syncing more than 40GB of photos to the other iSCSI disk (it wouldn't accept a network share). Sometimes, they're busy working simultaneously :).

Of course, iSCSI performance heavily depends on network performance, so for larger transfers, a cable connection is mandatory. But the occasional Time Machine or Aperture sync in the background runs just fine over WLAN.

So finally, Solaris and Mac fans can have a Time Machine based on ZFS, with real data integrity, redundancy, robustness, two different ways of travelling through time (ZVOLs can be snapshotted just like regular ZFS file systems) and much more.

Many thanks to Christiano for letting me know and to the guys at Studio Network Solutions for making this possible. And of course to the ZFS team for a wonderful piece of open storage software!

Tuesday May 27, 2008

OpenSolaris Home Server: ZFS and USB Disks

My home server with a couple of USB disksA couple of weeks ago, OpenSolaris 2008.05, project Indiana, saw its first official release. I've been looking forward to this moment so I can upgrade my home server and work laptop and start benefiting from the many cool features. If you're running a server at home, why not use the best server OS on the planet for it?

This is the first in a small series of articles about using OpenSolaris for home server use. I did a similar series some time ago and got a lot of good and encouraging feedback, so this is an update, or a remake, or home server 2.0, if you will.

I'm not much of a PC builder, but Simon has posted his experience with selecting hardware for his home server. I'm sure you'll find good tips there. In my case, I'm still using my trusty old Sun Java W1100z workstation, running in my basement. And for storing data, I like to use USB disks.

USB disk advantages

This is the moment where people start giving me that "Yeah, right" or "Are you serious?" looks. But USB disk storage has some cool advantages:

  • It's cheap. About 90 Euros for half a TB of disk from a major brand. Can't complain about that.
  • It's hot-pluggable. What happens if your server breaks and you want to access your data? With USB it's as easy as unplug from broken server, plug into laptop and you're back in business. And there's no need to shut down or open your server if you just want to add a new disk or change disk configuration.
  • It scales. I have 7 disks running in my basement. All I needed to do to make them work with my server was to buy a cheap 15 EUR 4-port USB card to expand my existing 5 USB ports. I still have 3 PCI slots left, so I could add 12 disks more at full USB 2.0 speed if I wanted.
  • It's fast enough. I measure about 10MB/s in write performance with a typical USB disk. That's about as fast as you can get over a 100 MBit/s LAN network which most people use at home. As long as the network remains the bottleneck, USB disk performance is not the problem.

ZFS and USB: A Great Team

But this is not enough. The beauty of USB disk storage lies in its combination with ZFS. When adding some ZFS magic to the above, you also get:

  • Reliability. USB disks can be mirrored or used in a RAID-Z/Z2 configuration. Each disk may be unreliable (because they're cheap) individually, but thanks to ZFS' data integrity and self-healing properties, the data will be safe and FMA will issue a warning early enough so disks can be replaced before any real harm can happen.
  • Flexibility. Thanks to pooled storage, there's no need to wonder what disks to use for what and how. Just build up a single pool with the disks you have, then assign filesystems to individual users, jobs, applications, etc. on an as-needed basis.
  • Performance. Suppose you upgrade your home network to Gigabit Ethernet. No need to worry: The more disks you add to the pool, the better your performance will be. Even if the disks are cheap.

Together, USB disks and ZFS make a great team. Not enterprise class, but certainly an interesting option for a home server.

ZFS & USB Tips & Tricks

So here's a list of tips, tricks and hints you may want to consider when daring to use USB disks with OpenSolaris as a home server:

  • Mirroring vs. RAID-Z/Z2: RAID-Z (or its more reliable cousin RAID-Z2) is tempting: You get more space for less money. In fact, my earlier versions of zpools at home were a combination of RAID-Z'ed leftover slices with the goal to squeeze as much space as possible at some reliability level out of my mixed disk collection.
    But say you have a 3+1 RAID-Z and want to add some more space. Would you buy 4 disks at once? Isn't that a bit big, granularity-wise?
    That's why I decided to keep it simple and just mirror. USB disks are cheap enough, no need to be even more cheap. My current zpool has a pair of 1 TB USB disks and a pair of 512 GB USB disks and works fine.
    Another advantage of this aproach is that you can organically modernize your pool: Wait until one of your disks starts showing some flakyness (FMA and ZFS will warn you as soon as the first broken data block has been repaired). Then replace the disk with a bigger one, then its mirror with the same, bigger size. That will give you more space without the complexity of too many disks and keep them young enough to not be a serious threat to your data. Use the replaced disks for scratch space or less important tasks.
  • Instant replacement disk: A few weeks ago, one of my mirrored disks showed its first write error. It was a pair of 320GB disks, so I ordered a 512GB replacement (with the plan to order the second one later). But now, my mirror may be vulnerable: What if the second disk starts breaking before the replacement has arrived?
    That's why having a few old but functional disks around can be very valuable: In my case, took a 200GB and a 160GB disk and combined them into their own zpool:
    zpool create temppool c11t0d0 c12t0d0
    Then, I created a new ZVOL sitting on the new pool:
    zfs create -sV 320g temppool/tempvol
    Here's out temporary replacement disk! I then attached it to my vulnerable mirror:
    zfs attach santiago c10t0d0 /dev/zvol/dsk/temppool/tempvol
    And voilá, my precious production pool stated resilvering the new virtual disk. After the new disk arrived and has been resilvered, the temporary disk can be detached, destroyed and its space put to some other good use.
    Storage virtualization has never been so easy!
  • Don't forget to scrub: Especially with cheap USB disks, regular scrubbing is important. Scrubbing will check each and every block of your data on disk and make sure it's still valid. If not, it will repair it (since we're mirroring or using RAID-Z/Z2) and tell you what disk had a broken block so you can decide whether it needs to be replaced or not just yet.
    How often you want to or should scrub depends on how much you trust your hardware and how much your data is being read out anyway (any data that is read out is automatically checked, so that particular portion of the data is already "scrubbed" if you will). I find scrubbing once every two weeks a useful cycle, othery may prefer once a month or once a week.
    But scrubbing is a process that needs to be initiated by the administrator. It doesn't happen by itself, so it is important that you think of issuing the "zpool scrub" command regularly, or better, set up a cronjob for it to happen automatically.
    As an example, the following line:
    23 01 1,15 \* \* for i in `zpool list -H -o name`; do zpool scrub $i; done
    in your crontab will start a scrub for each of your zpools twice a month on the 1st and the 15th at 01:23 AM.
  • Snapshot often: Snapshots are cheap, but they can save the world if you accientally deleted that important file. Same rule as with scrubbing: Do it. Often enough. Automatically. Tim Foster did a great job of implementing an automatic ZFS snapshot service, so why don't you just install it now and set up a few snapshot schemes for your favourite ZFS filesystems?
    The home directories on my home server are snapshotted once a month (and all snapshots are kept), once a week (keeping 52 snapshots) and once a day (keeping 31 snapshots). This gives me a time-machine with daily, weekly and monthly granularities depending on how far back in time I want to travel through my snapshots.

So, USB disks aren't bad. In fact, thanks to ZFS, USB disks can be very useful building blocks for your own little cost-effective but reliable and integrity-checked data center.

Let me know what experiences you made while using USB storage at home, or with ZFS and what tips and tricks you have found to work well for you. Just enter a comment below or send me email!

Tuesday Feb 19, 2008

VirtualBox and ZFS: The Perfect Team

I've never installed Windows in my whole life. My computer history includes systems like the Dragon 32, the Commodore 128, then the Amiga, Apple PowerBook (68k and PPC) etc. plus the occasional Sun system at work. Even the laptop my company provided me with only runs Solaris Nevada, nothing else. Today, this has changed. 

A while ago, Sun announced the acquisition of Innotek, the makers of the open-source virtualization software VirtualBox. After having played a bit with it for a while, I'm convinced that this is one of the coolest innovations I've seen in a long time. And I'm proud to see that this is another innovative german company that joins the Sun family, Welcome Innotek!

Here's why this is so cool.

Windows XP running on VirtualBox on Solaris Nevada

After having upgraded my laptop to Nevada build 82, I had VirtualBox up and running in a matter of minutes. OpenSolaris Developer Preview 2 (Project Indiana) runs fine on VirtualBox, so does any recent Linux (I tried Ubuntu). But Windows just makes for a much cooler VirtualBox demo, so I did it:

After 36 years of Windows freedom, I ended up installing it on my laptop, albeit on top of VirtualBox. Safer XP if you will. To the top, you see my VirtualBox running Windows XP in all its Tele-Tubby-ish glory.

As you can see, this is a plain vanilla install, I just took the liberty of installing a virus scanner on top. Well, you never know...

So far, so good. Now let's do something others can't. First of all, this virtual machine uses a .vdi disk image to provide hard disk space to Windows XP. On my system, the disk image sits on top of a ZFS filesystem:

# zfs list -r poolchen/export/vm/winxp
NAME                                                          USED  AVAIL  REFER  MOUNTPOINT
poolchen/export/vm/winxp                                     1.22G  37.0G    20K  /export/vm/winxp
poolchen/export/vm/winxp/winxp0                              1.22G  37.0G  1.05G  /export/vm/winxp/winxp0
poolchen/export/vm/winxp/winxp0@200802190836_WinXPInstalled   173M      -   909M  -
poolchen/export/vm/winxp/winxp0@200802192038_VirusFree           0      -  1.05G  -

Cool thing #1: You can do snapshots. In fact I have two snapshots here. The first is from this morning, right after the Windows XP installer went through, the second has been created just now, after installing the virus scanner. Yes, there has been some time between the two snapshots, with lots of testing, day job and the occasional rollback. But hey, that's why snapshots exist in the first place.

Cool thing #2: This is a compressed filesystem:

# zfs get all poolchen/export/vm/winxp/winxp0
NAME                             PROPERTY         VALUE                    SOURCE
poolchen/export/vm/winxp/winxp0  type             filesystem               -
poolchen/export/vm/winxp/winxp0  creation         Mon Feb 18 21:31 2008    -
poolchen/export/vm/winxp/winxp0  used             1.22G                    -
poolchen/export/vm/winxp/winxp0  available        37.0G                    -
poolchen/export/vm/winxp/winxp0  referenced       1.05G                    -
poolchen/export/vm/winxp/winxp0  compressratio    1.53x                    -
...
poolchen/export/vm/winxp/winxp0  compression      on                       inherited from poolchen

ZFS has already saved me more than half a gigabyte of precious storage capacity already! 

Next, we'll try out Cool thing #3: Clones. Let's clone the virus free snapshot and try to create a second instance of Win XP from it:

# zfs clone poolchen/export/vm/winxp/winxp0@200802192038_VirusFree poolchen/export/vm/winxp/winxp1
# ls -al /export/vm/winxp
total 12
drwxr-xr-x   5 constant staff          4 Feb 19 20:42 .
drwxr-xr-x   6 constant staff          5 Feb 19 08:44 ..
drwxr-xr-x   3 constant staff          3 Feb 19 18:47 winxp0
drwxr-xr-x   3 constant staff          3 Feb 19 18:47 winxp1
dr-xr-xr-x   3 root     root           3 Feb 19 08:39 .zfs
# mv /export/vm/winxp/winxp1/WindowsXP_0.vdi /export/vm/winxp/winxp1/WindowsXP_1.vdi

The clone has inherited the mountpoint from the upper level ZFS filesystem (the winxp one) and so we have everything set up for VirtualBox to create a second Win XP instance from. I just renamed the new container file for clarity. But hey, what's this?

VirtualBox Error Message 

Damn! VirtualBox didn't fall for my sneaky little clone trick. Hmm, where is this UUID stored in the first place?

# od -A d -x WindowsXP_1.vdi | more
0000000 3c3c 203c 6e69 6f6e 6574 206b 6956 7472
0000016 6175 426c 786f 4420 7369 206b 6d49 6761
0000032 2065 3e3e 0a3e 0000 0000 0000 0000 0000
0000048 0000 0000 0000 0000 0000 0000 0000 0000
0000064 107f beda 0001 0001 0190 0000 0001 0000
0000080 0000 0000 0000 0000 0000 0000 0000 0000
\*
0000336 0000 0000 0200 0000 f200 0000 0000 0000
0000352 0000 0000 0000 0000 0200 0000 0000 0000
0000368 0000 c000 0003 0000 0000 0010 0000 0000
0000384 3c00 0000 0628 0000 06c5 fa07 0248 4eb6
0000400 b2d3 5c84 0e3a 8d1c
8225 aae4 76b5 44f5
0000416 aa8f 6796 283f db93 0000 0000 0000 0000
0000432 0000 0000 0000 0000 0000 0000 0000 0000
0000448 0000 0000 0000 0000 0400 0000 00ff 0000
0000464 003f 0000 0200 0000 0000 0000 0000 0000
0000480 0000 0000 0000 0000 0000 0000 0000 0000
\*
0000512 0000 0000 ffff ffff ffff ffff ffff ffff
0000528 ffff ffff ffff ffff ffff ffff ffff ffff
\*
0012544 0001 0000 0002 0000 0003 0000 0004 0000

Ahh, it seems to be stored at byte 392, with varying degrees of byte and word-swapping. Some further research reveals that you better leave the first part of the UUID alone (I spare you the details...), instead, the last 6 bytes: 845c3a0e1c8d, sitting at byte 402-407 look like a great candidate for an arbitrary serial number. Let's try changing them (This is a hack for demo purposes only. Don't do this in production, please):

# dd if=/dev/random of=WindowsXP_1.vdi bs=1 count=6 seek=402 conv=notrunc
6+0 records in
6+0 records out
# od -A d -x WindowsXP_1.vdi | more
0000000 3c3c 203c 6e69 6f6e 6574 206b 6956 7472
0000016 6175 426c 786f 4420 7369 206b 6d49 6761
0000032 2065 3e3e 0a3e 0000 0000 0000 0000 0000
0000048 0000 0000 0000 0000 0000 0000 0000 0000
0000064 107f beda 0001 0001 0190 0000 0001 0000
0000080 0000 0000 0000 0000 0000 0000 0000 0000
\*
0000336 0000 0000 0200 0000 f200 0000 0000 0000
0000352 0000 0000 0000 0000 0200 0000 0000 0000
0000368 0000 c000 0003 0000 0000 0010 0000 0000
0000384 3c00 0000 0628 0000 06c5 fa07 0248 4eb6
0000400 b2d3 2666 6fbb c1ca 8225 aae4 76b5 44f5
0000416 aa8f 6796 283f db93 0000 0000 0000 0000
0000432 0000 0000 0000 0000 0000 0000 0000 0000
0000448 0000 0000 0000 0000 0400 0000 00ff 0000
0000464 003f 0000 0200 0000 0000 0000 0000 0000
0000480 0000 0000 0000 0000 0000 0000 0000 0000
\*
0000512 0000 0000 ffff ffff ffff ffff ffff ffff
0000528 ffff ffff ffff ffff ffff ffff ffff ffff
\*
0012544 0001 0000 0002 0000 0003 0000 0004 0000

Who needs a hex editor if you have good old friends od and dd on board? The trick is in the "conv=notruc" part. It tells dd to leave the rest of the file as is and not truncate it after doing it's patching job. Let's see if it works:

VirtualBox with two Windows VMs, one ZFS-cloned from the other.

Heureka, it works! Notice that the second instance is running with the freshly patched harddisk image as shown in the window above.

Windows XP booted without any problem from the ZFS-cloned disk image. There was just the occasional popup message from Windows saying that it found a new harddisk (well observed, buddy!).

Thanks to ZFS clones we can now create new virtual machine clones in just seconds without having to wait a long time for disk images to be copied. Great stuff. Now let's do what everybody should be doing to Windows once a virus scanner is installed: Install Firefox:

Clones WinXP instance, running FireFox

I must say that the performance of VirtualBox is stunning. It sure feels like the real thing, you just need to make sure to have enough memory in your real computer to support both OSes at once, otherwise you'll run into swapping hell...

BTW: You can also use ZFS volumes (called ZVOLs) to provide storage space to virtual machines. You can snapshot and clone them just like regular file systems, plus you can export them as iSCSI devices, giving you the flexibility of a SAN for all your virtualized storage needs. The reason I chose files over ZVOLs was just so I can swap pre-installed disk images with colleagues. On second thought, you can dump/restore ZVOL snapshots with zfs send/receive just as easily...

Anyway, let's see how we're doing storage-wise:

# zfs list -rt filesystem poolchen/export/vm/winxp
NAME                              USED  AVAIL  REFER  MOUNTPOINT
poolchen/export/vm/winxp         1.36G  36.9G    21K  /export/vm/winxp
poolchen/export/vm/winxp/winxp0  1.22G  36.9G  1.05G  /export/vm/winxp/winxp0
poolchen/export/vm/winxp/winxp1   138M  36.9G  1.06G  /export/vm/winxp/winxp1

Watch the "USED" column for the winxp1 clone. That's right: Our second instance of Windows XP only cost us a meager 138 MB on top of the first instance's 1.22 GB! Both filesystems (and their .vdi containers with Windows XP installed) represent roughly a Gigabyte of storage each (the REFER column), but the actual physical space our clone consumes is just 138MB.

Cool thing #4: ZFS clones save even more space, big time!

How does this work? Well, when ZFS creates a snapshot, it only creates a new reference to the existing on-disk tree-like block structure, indicating where the entry point for the snapshot is. If the live filesystem changes, only the changed blocks need to be written to disk, the unchanged ones remain the same and are used for both the live filesystem and the snapshot.

A clone is a snapshot that has been marked writable. Again, only the changed (or new) blocks consume additional disk space (in this case Firefox and some WinXP temporary data), everything that is unchanged (in this case nearly all of the WinXP installation) is shared between the clone and the original filesystem. This is de-duplication done right: Don't create redundant data in the first place!

That was only one example of the tremenduous benefits Solaris can bring to the virtualization game. Imagine the power of ZFS, FMA, DTrace, Crossbow and whatnot for providing the best infrastructure possible to your virtualized guest operating systems, be they Windows, Linux, or Solaris. It works in the SPARC world (through LDOMs), and in the x86/x64 world through xVM server (based on the work of the Xen community) and now joined by VirtualBox. Oh, and it's free and open source, too.

So with all that: Happy virtualizing, everyone. Especially to everybody near Stuttgart.

Thursday Dec 06, 2007

X4500 + Solaris ZFS + iSCSI = Perfect Video Editing Storage

Digital video editing is one of those applications that tend to be very data hungry. At SD PAL resolution, we're talking about 720 pixels x 576 lines x 3 bytes of color x 25 full frames per second = about 30 MB/s of data. That's about 224 GB for a 2 hour feature film. Not counting audio (that would only be around 3-4 GB). And we (in Germany) haven't looked at HD or Digital Cinema a lot yet...

During the last couple of weeks I worked with a customer who bought a Sun Fire X4500 server (you know, Thumper). The plan is to run Solaris ZFS on it, then provide big iSCSI volumes to the video editing systems, which tend to be specialized Windows or Mac OS X machines. Wonderful idea: Just use zpool create to combine a number of disks with some RAID level into a pool, then zfs create -V to create a ZVOL. Thanks to zfs shareiscsi=on, sharing the volume over iSCSI is dead easy.

But it didn't work.

First, Windows wouldn't mount the iSCSI volume. After some trying, we discovered that there must be an upper limit of 2TB to the size of iSCSI volumes that Windows can mount (we initially tried something like 5 ot 10TB). So be it: zfs create -V 2047G videopool/videovolume.

Now it mounted ok, we formatted the disk with NTFS (yuck!) and started the editing system's speed test. Then came the real issue: The test reported a write performance of 8-10 MB/s, but the editing system needs something like 30 MB/s sustained to be able to record reliably!

After some trying, we started the systematic approach:

  • A simple dd from one disk to another yielded >39 MB/s.
  • dd'ing from one small ZFS pool to another exceeded 120 MB/s (I later learned that cp is a better benchmark because it works asynchronously with large chunks of data vs. dd's synchronous block approach), so that was again more than we needed.
  • We tried re-attaching our ZVOL through iscsiadm to test the iSCSI stack's performance and ran into a TCP fusion issue. Ok, I've always wanted to play with mdb, so we followed the workaround instructions and we were able to attach our own ZVOL over the loopback interface. Slightly less performance (due to up the stack, down the stack effects, I presume) but still way more than we needed. So, it wasn't the X4500's nor ZFS' fault.

Finally, Danilo pointed me into the right direction: Nagle's algorithm. What usually helps maximize network bandwidth turns out to be a killer for iSCSI performance. For Solaris iSCSI clients, we know this already,  but how do we turn off Nagle on Windows?

The answer is deeply buried inside the Microsoft's iSCSI Initiator user guide: The "Addressing Slow Performance with iSCSI Clusters" chapter mentions a similar issue (although they talk about read not write performance) and they do mention RFC 1122's delayed ACK feature, which is related to Nagle's algorithm. The Microsoft document suggests a workaround which involves setting a variable in the registry, so it was worth a try (and my vengeance for having to use mdb before).

And low and behold, the speed test now yielded 90-100 MB/s (Close to a GBE's raw performance)! Yipee that was it! One little registry entry on the client side gave us a 10x improvement in iSCSI performance!

Now, can someone explain to me, why on Windows 2000 you need to set "TcpAckDelTicks=0" while on Windows 2003 the same thing is accomplished by saying "TcpAckFrequency=1" (which is the same thing, only seen from the other side of the division sign)?

So, to all you storage hungry video editors out there: The Sun Fire X4500 with Solaris ZFS and iSCSI is a great solution for reliable, fast, easy to use and inexpensive video storage. You just need to know how to tell your TCP/IP stack to not delay ACKs...
 

Friday Oct 12, 2007

Final CEC Reflections: The Wynn, ZFS Under the Hood, Messaging wrap-up

I'm now back home, sorting through emails and cleaning up some stuff before a regular week of work begins. Here are some highlights from Tuesday and Wednesday during the Sun CEC 2007 conference in Las Vegas:

  • The Wynn: After visiting the CEC Party, Barton, Yan, Henning and I decided to have dinner at the Wynn. It's one of the newest hotels in town and a must-see. This place sure has style! We went to the Daniel Boulud Brasserie which is located inside the hotel at the Lake of Dreams. This is one of the few restaurants in Las Vegas where you can actually eat outside the ever present air-conditioning and enjoy a great view. The lake features a gigantic rectangular waterfall surrounded by a forest. The lake and the waterfall are part of several mini-shows that occur at regular intervals in the afternoon, featuring impressive animatronics such as a  head coming out of the water (with projected animated faces) or a gigantic frog leaning over the waterfall back which also serves as a huge video screen. Music, light and animation are perfectly synchronized so that for instance the head emerging from the water perfectly matches its projected upon face or the light ripples running over the lake perfectly match the animation on screen.
    This is definitely my favourite Vegas hotel now, I wonder where our stock price needs to be to afford having our nect CEC at their convention center :).
  • ZFS Under the Hood: This was a great session done by Jason Bantham and Jarod Nash. They went through the ZFS Source Tour diagram and explained the little boxes one by one while describing the basic processes, mechanisms and data flow that ZFS uses to write and read data. And they were fun speakers too! Plus each attendee that asked a question got a piece of chocolate thrown at as an extra incentive to participate :).
  • Podcasting: After the closing session, Franz, Matthias, Brian, Dave and I recorded episode 3 of the CEC 2007 Podcast. We reflected on our impressions of the conference and on our project to aggregate and display audience messages during the main sessions. Actually, I'm cleaning up and commenting the JavaFX code as we speak to publish it in the next post for your code-reading pleasure :).

Monday Sep 17, 2007

The Importance of Archiving (For the Rest of us)

A few weeks ago I was on the road with Dave Cavena. He works as an SE in Hollywood and helps our customers there understand the importance of digitally archiving their movies. The issue here is a simple one: Today, movies are being archived by storing rolls of film on shelves in gigantic warehouses and hoping they'll survive for a few years to come. "Few" could be tens or maybe hundreds of years, but nobody really knows how long they'll really survive and how good the movies will look after a couple of decades of archiving. Will the colours look natural? Will there be scratches? Will the film material degrade so that the movie rips right in the middle of the most important scene? Or will it spontaneously decompose into a heap of dust when someone opens the door after 150 years to see what the heck people were keeping in that warehouse anyway?

Digital Data on the other hand can be kept indefinitely and it can stay perfect through eternity, if people store it right. "Right" means things like keeping redundant copies in geographically distant places (so that the movie survices a warehouse fire or an earthquake), periodical integrity checking and fault healing based on those redundant copies (so silent data corruption can be detected and corrected) and periodic copy and conversion cycles so the data can survive format and media evolution. Try playing "Dragon's Lair", a classic arcade game from the 80ies which was originally produced for Laserdisc-based arcade machines. I loved the game back them and I was glad I found it on DVD. Now look it up on Amazon: You'll find Blu-Ray and HD-DVD versions as well!

Dave and some other very bright people have written an interesting white paper on "Archiving Movies in a Digital World". It is a great read: It shows why archiving movies the digital way is so important (so they can't get lost), how to do it and why this is actually cheaper than keeping rolls of film in warehouses (Hint: Archiving bits takes up less real estate and looks a lot cooler if you use one of these. Your archives may even become smart by using one of these, too!).

That got me thinking: What will happen to all those photos that people are taking using their private digital camera? Parties, Vacations, Babies, Families, etc.? Yes, if people don't start thinking about a good archiving strategy soon, they will all be lost in the next couple of decades. By the time my little daughter gets married, I might have lost all of her baby pictures if I don't do something real quick (as in: This decade).

Storing them on a file server running a serious, enterprise-class OS with ZFS on a set of redundant storage media (could be disks, could be more disks, iSCSI devices, could be USB sticks, it really doesn't matter) is a good start, because it can provide 100% data integrity and self-heal any damages before they become permanent. But this is still not enough. They need to be stored in multiple locations and they need to be periodically copied into more recent media.

I'll figure out the multiple locations part someday (probably a second server that replicates the first file server's ZFS file systems through a couple of zfs send/receive scripts) and the periodical copying means I'll still have an interesting hobby when my daughter has children of her own. Meanwhile, just to be sure, I've started to copy all of my photos to a popular photo-sharing service called Flickr on the net. Yes, all of them. This means I can still decide who can look at my pictures and who not, I get access to all of my pictures from everywhere near a web browser and I can store as many photos as I want for just a small annual fee. And they'll still be there should my basement catch fire or should all of my disks die for some strange statistical reason or when I start taking 3D photographs of my grandchidren. What more could anybody want?

Now what if the movie industry found out about this and started archiving their movies on Flickr as well, image after image, all of them, for just the same small annual fee?

Update (Sep. 18th, 2007): Thanks to Jesse's comment, I've now tried SmugMug and I love it! They offer a 50% discount for Flickr refugees during the first year and there's a nice tool called SmuggLr that makes migration a snap. Thank you Jesse!

Thursday Sep 06, 2007

7 Easy Tips for ZFS Starters

So you're now curious about ZFS. Maybe you read Jonathan's latest blog entry on ZFS or you've followed some other buzz on the Solaris ZFS file system or maybe you saw a friend using it. Now it's time for you to try it out yourself. It's easy and here are seven tips to get you started quickly and effortlessly:

1. Check out what Solaris ZFS can do for you

First, try to compose yourself a picture of what the Solaris ZFS filesystem is, what features it has and how it can work to your advantage. Check out the CSI:Munich video for a fun demo on how Solaris ZFS can turn 12 cheap USB memory sticks into highly available, enterprise-class, robust storage. Of course, what works with USB sticks also works with your own harddisks or any other storage device. Also, there are great ZFS screencasts that show you some more powerful features in an easy to follow way. Finally, there's a nice writeup on "What is ZFS?" at the OpenSolaris ZFS Community's homepage.

2. Read some (easy) documentation

It's easy to configure Solaris ZFS. Really. You just need to know two commands: zpool (1M) and zfs (1M). That's it. So, get your hands onto a Solaris system (or download and install it for free) and take a look at those manpages. If you still want more, then there's of course the ZFS Administration Guide with detailed planning, configuration and troubleshooting steps. If you want to learn even more, check out the OpenSolaris ZFS Community Links page. German-speaking readers are invited to read my german white paper on ZFS or listen to episode #006 of the POFACS podcast.

3. Dive into the pool

Solaris ZFS manages your storage devices in pools. Pools are a convenient way of abstracting storage hardware and turning it into a repository of blocks to store your data in. Each pool takes a number of devices and applies an availability scheme (or none) to it. Pools can then be easily expanded by adding more disks to them. Use pools to manage your hardware and its availability properties. You could create a mirrored pool for data that should be protected against disk failure and that needs fast access to hardware. Then, you could add another pool using RAID-Z (which is similar, but better than RAID-5) for data that needs to be protected but where performance is not the first priority. For scratch, test or demo data, a pool without any RAID scheme is ok, too. Pools are easily created:

zpool create mypool mirror c0d0 c1d0

Will create a mirror out of the two disk devices c0d0 and c1d0. Similarly, you can easily create a RAID-Z pool by saying:

zpool create mypool raidz c0d0 c1d0 c2d0

The easiest way to turn a disk into a pool is:

zpool create mypool c0d0

It's that easy. All the complexity of finding, sanity-checking, labeling, formatting and managing disks is hidden behind this simple command.

If you don't have any spare disks to try this out with, then you can just create yourself some files, then use them as if they were block devices:

# mkfile 128m /export/stuff/disk1
# mkfile 128m /export/stuff/disk2
# zpool create testpool mirror /export/stuff/disk1 /export/stuff/disk2
# zpool status testpool
pool: testpool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
testpool ONLINE 0 0 0
mirror ONLINE 0 0 0
/export/stuff/disk1 ONLINE 0 0 0
/export/stuff/disk2 ONLINE 0 0 0

errors: No known data errors

The cool thing about this procedure is that you can create as many virtual disks as you like and then test ZFS's features such as data integrity, self-healing, hot spares, RAID-Z and RAID-Z2 etc. without having to find any free disks.

When creating a pool for production data, think about redundancy. There are three basic properties to storage: availability, performance and space. And it's a good idea to prioritize them in that order: Make sure you have redundancy (mirroring, RAID-Z, RAID-Z2) so ZFS can self-heal data when stuff goes wrong at the hardware level. Then decide how much performance you want. Generally, mirroring is faster and more flexible than RAID-Z/Z2, especially if the pool is degraded and ZFS needs to reconstruct data. Space is the cheapest of all three, so don't be greedy and try to give priority to the other two. Richard Elling has some great recommendations on RAID, space and MTTDL. Roch has also posted a great article on mirroring vs. RAID-Z.

4. The power to give

Once you have set up your basic pool, you can already access your new ZFS file system: Your pool has been automatically mounted for you in the root directory. If you followed the examples above, then you can just cd to /mypool and start using ZFS!

But there's more: Creating additional ZFS file systems that use your pool's resources is very easy, just say something like:

zfs create mypool/home
zfs create mypool/home/johndoe
zfs create mypool/home/janedoe

Each of these commands only takes seconds to complete and every time you will get a full new file system, already set up and mounted for you to start using it immediately. Notice that you can manage your ZFS filesystems hierarchically as seen above. Use pools to manage storage properties at the hardware level, use filesystems to present storage to your users and applications. Filesystems have properties (compression, quotas, reservations, etc.) that you can easily administer using zfs set and that are inherited across the hierarchy. Check out Chris Gerhard's blog on more thoughts about file system organization.

5. Snapshot early, snapshot often

ZFS snapshots are quick, easy and cheap. Much cheaper than the horrible experience when you realize that you just deleted a very important file that hasn't been backed up yet! So, use snapshots whenever you can. If you think about whether to snapshot or not, just do it. I recently spent only about $220 on two 320 GB USB disks for my home server to expand my pool with. At these prices, the time you spend thinking about whether to snapshot or not may be more worth than just buying more disk.

Again, Chris has some wisdom on this topic in his ZFS snapshot massacre blog entry. He once had over 60000 snapshots and he's snapshotting filesystems by the minute! Since snapshots in ZFS “just work” and since they only take up the space that actually changes between snapshots, there's really no reason to not doing snapshots all the time. Maybe once per minute is a little bit exaggerated, but once a week, once per day or once an hour per active filesystem is definitely good advice.

Instead of time based snapshotting, Chris came up with the idea to snapshot a file system shared with Samba whenever the Samba user logs in!

6. See the Synergy

ZFS by itself is very powerful. But the full beauty of it can be unleashed by combining ZFS with other great Solaris 10 features. Here are some examples:

  • Tim Foster has written a great SMF service that will snapshot your ZFS filesystems on a regular basis. It's fully automatic, configurable and integrated with SMF in a beautiful way.

  • ZFS can create block devices, too. They are called zvols. Since Nevada build 54, they are fully integrated into the Solaris iSCSI infrastructure. See Ben Rockwood's blog entry on the beauty of iSCSI with ZFS.

  • A couple of people are now elevating this concept even further: Take two Thumpers, create big zvols inside them, export them through iSCSI and mirror over them with ZFS on a server. You'll get a huge, distributed storage subsystem that can be easily exported and imported on a regular network. A poor man's SAN and a powerful shared storage for future HA clusters thanks to ZFS, iSCSI and Thumper! Jörg Möllenkamp is taking this concept a bit further by thinking about ZFS, iSCSI, Thumper and SAM-FS.

  • Check out some cool Sun StorageTek Availability Suite and ZFS demos here.

  • ZFS and boot support is still in the works, but if you're brave, you can try it out with the newer Solaris Nevada distributions on x64 systems. Think about the possibilities together with Solaris Live Upgrade! Create a new boot environment in seconds while not needing to find or dedicate a new partition, thanks to snapshots, while saving most of the needed disk space!

And that's only the beginning. As ZFS becomes more and more adopted, we'll see many more creative uses of ZFS with other Solaris 10 technologies and other OSes.

7. Beam me up, ZFS!

One of the most amazing features of ZFS is zfs send/receive. zfs send will turn a ZFS filesystem into a bitstream that you can save to a file, pipe through bzip2 for compression or send through ssh to a distant server for archiving or for remote replication through the corresponding zfs receive command. It also supports incremental sending and receiving out of subsequent snapshots through the -i modifier.

This is a powerful feature with a lot of uses:

  • Create your Solaris zone as a ZFS filesystem, complete with applications, configuration, automation scripts, users etc., zfs send | bzip2 >zone_archive.zfs.bz2 it for later use. Then, unpack and create hundreds of cloned zones out of this master copy.

  • Easily migrate ZFS filesystems between pools on the same machine or on distant machines (through ssh) with zfs send/receive.

  • Create a crontab entry that takes a snapshot every minute, then zfs send -i it over ssh to a second machine where it is piped into zfs receive. Tadah! You'll get free, finely-grained, online remote replication of your precious data.

  • Easily create efficient full or incremental backups of home directories (each in their own ZFS filesystems) through ZFS send. Again, you can compress them and treat them like you would, say, treat a tar archive.

See? It is easy, isn't it? I hope this guide helps you find your way around the world of ZFS. If you want more, drop by the OpenSolaris ZFS Community, we have a mailing list/forum where bright and friendly people hang out that will be glad to help you.

Thursday Aug 16, 2007

ZFS Snapshot Replication Script

This script recursively copies ZFS filesystems and its snapshots to another pool. It can also generate a new snapshot for you while copying as well as handle unmounts, remounts and SMF services to be temporarily brought down during file system migration. I hope it's useful to you as well![Read More]

Sunday Aug 12, 2007

ZFS Interview in the POFACS Podcast (German)

Last week, I've been interviewed by the german podcast POFACS, the podcast for alternative computer systems. Today, the interview went live, so if you happen to understand the german language and want to learn about ZFS while driving to work or while jogging, you're invited to listen to the interview.

I was actually amazed at how long the interview turned out: It's 40 minutes, while recording the piece only felt like 20 minutes or so. The average commute time in germany is about 20 minutes, so this interview will easily cover both ways to and from work. But there's more: This episode of POFACS also introduces you to the NetBSD operating system, the German Unix User Group GUUG. Finally, the guys at POFACS were also so kind to feature the HELDENFunk podcast in a short introductory interview. Thanks!

So with a total playing time if 1 hour and 20 minutes, this episode has you covered for at least two commutes or a couple of jogging runs :).

Tuesday Jul 31, 2007

New Year's Resolutions

Yesterday, we've announced good financial results for the last fiscal year 07. Very good financial results. I like working for a profitable company, it makes so many things so much easier.

Tomorrow, I'm going to have a meeting with my managers to discuss what to do next. Since we're early in the new financial year 08, I'm thinking about what to do next. So, here are some new year's priorities for my FY08 at Sun:

  • Web 2.0: I've been talking to customers, partners and Sun people in Germany about Web 2.0 a number of times. Every time, the feedback has been very clear: We want More! So I'm going to do more Web 2.0 related stuff: More blogging, podcasting, perhaps a successor to the now famous ZFS movie, more participation in social networking sites, including del.icio.us, XING and Facebook, more evangelizing and of course more insight into where this journey is headed to.
  • Technology: Sun is all about technology. We create, apply and leverage technology to enable the participation age. (Did you know that we've proclaimed the participation age before Tim O'Reilly published his famous Web 2.0 article?)
    We've seen Niagara changing the rules of processor technology and building the backbone of the web, again, and we've already disclosed some information on Niagara 2. We've seen the Constellation System debut during ISC 2007. You may have noticed that the Sun Ultra 40 Workstation is the best workstation on the planet, and BTW, we're changing the economics of true Video-On-Demand Streaming as well, just to name a few favourite technologies on my list.
    The biggest problem to solve now is: Spreading the word. Let me explain. Whenever I participate in a Sun day (A customer meeting in which Sun people present on new Sun technologies), two effects consistently happen: First, more people than originally planned show up (I once had people join in over a video conference line). Second, the meeting takes much longer than originally anticipated, because customers want to hear so much more about our technologies.
    Since we don't have much money to spend on advertising, sponsoring or other forms of traditional awareness generation, we need to do a lot more of these Sun days, and talk to customers one by one. Is this more difficult and time-consuming? Yes. Does this have a more lasting effect than traditional marketing? You bet. Only by talking to the experts at our customers are we able to verify that what we do is right and make sure our technology meets the people that want/need/develop for/join/use/participate in it. In FY08, I'm going to participate in more Sun days and talk to as many customers about Sun technology as I can.
  • Solaris: This may be a sub-topic of "Technology", but it really is a topic of its own: I use Solaris at home, on my laptop, evangelize it to customers, and it feeds my need as a computer scientist to learn about interesting things every day. In FY08, I'm going to use more new Solaris features at home and at work, write more about it (German readers: Check out this ZFS whitepaper), participate more in the OpenSolaris communities and make sure OpenSolaris gets the attention with developers, customers and partners that it deserves.
All in all, I'm sure FY08 is going to be interesting and fun. FY07 has been the year of technology announcements, FY08 will be the year of seeing them all in action. A year of interesting times.
About

Tune in and find out useful stuff about Sun Solaris, CPU and System Technology, Web 2.0 - and have a little fun, too!

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
TopEntries
Blogroll
OldTopEntries