Thursday Dec 03, 2009

OSDevCon 2009 Paper: Implementing a simple ZFS Auto-Scrub Service with SMF, RBAC, IPS and Visual Panels Integration - Lessons learned

A while ago, I wrote a little tool that helps you keep your ZFS pools clean by automatically running regular scrubs, similar to what the OpenSolaris auto-snapshot service does.

The lessons I learned during development of this service went into an OSDevCon 2009 paper that was presented in September 2009 in Dresden. It is a nice summary of things to keep in mind when developing SMF services of your own and it includes a tutorial on writing a GUI based on the OpenSolaris Visual Panels project.

Check out the Whitepaper here, the slides here, the SMF service here and if you want to take a peek at the Service's Visual Panels Java code, you'll find it here.

Sunday Oct 25, 2009

A Small and Energy-Efficient OpenSolaris Home Server

In an earlier entry, I outlined my most important requirements for an optimal OpenSolaris Home Server. It should:

  1. Run OpenSolaris in order to fully leverage ZFS,
  2. Support ECC memory, so data is protected at all times,
  3. Be power-efficient, to help the environment and control costs,
  4. Use a moderate amount of space and be quiet, for some extra WAF points.

So I went shopping and did some research on possible components. Here's what I came up with:

Choosing a Platform: AMD or Intel?

Disclosure: My wife works for AMD, so I may be slightly biased. But I think the following points are still very valid.

Intel is currently going through a significant change in architecture: The older Core 2 microarchitecture was based on the Front Side Bus (FSB), where the CPU connects to the Northbridge which contains the memory controller and connects to the memory, while also connecting to the Southbridge which connects to I/O.

Now, they are switching to the new Nehalem microarchitecture which has a memory-controller built into the CPU and a scalable I/O bus called Quickpath Interconnect that connects CPUs with other CPUs and/or IO.

Unfortunately, none of these architectures seem to support ECC memory at consumer level pricing. The cheapest Intel-based ECC-motherboard I could find had still more than double the cost of an AMD-based one. Even though the new Intel Core i7 series is based on Nehalem and thus could support ECC memory easily in theory, Intel somehow chose to not expose this feature. In addition, Core i7 CPUs are relatively new and there are not yet any power efficient versions available.

The Intel Atom processor series may be interesting for a home server from a pure power-saving perspective, but again, Atom motherboards don't support ECC and once your workload becomes a little more demanding (like transcoding or some heavier compiling), you'll miss the performance of a more powerful CPU.

AMD on the other hand has a number of attractive points for the home server builder:

  • AMD consumer CPUs use the same microarchitecture than their professional CPUs (currently, it's the K10 design). They only vary by number of cores, cache size, number of HT channels, TDP and frequency, which are all results of the manufacturing process. All other microarchitecture features are the same. When using an AMD consumer CPU, you essentially get a "smaller brother" of their high end CPUs.
  • This means you'll also get a built-in memory-controller that supports ECC.
  • This also means less chips to build a system (no Northbridge needed) and thus lower power-consumption.
  • AMD has been using the HyperTransport Interconnect for quite a while now. This is a fast, scaleable interconnect technology that has been on the market for quite a while so chipsets are widely available, proven and low-cost.

So it was no suprise that even low-cost AMD motherboards at EUR 60 or below are perfectly capable of supporting ECC memory which gives you an important server feature at economic cost.

My platform conclusion: Due to ECC support, low power consumption and good HyperTransport performance at low cost, AMD is an excellent platform for building a home server.

AMD Athlon II X2 240e: A Great Home Server CPU

An AMD Athlon II X2 240e CPU

While I was shopping around for AMD Athlon CPUs and just before I was about to decide for an AMD Athlon II X2 variant, AMD offered me one of their brand new AMD Athlon II X2 240e for testing, provided that I blogged about it. Thank you, AMD!

Introduced in October 20th, this CPU is part of the newest energy-efficient range of consumer CPUs from AMD. It has 2 cores (hence X2), snazzy 2.8 GHz and a 2 MB L2 cache. What's most important: The TDP for this CPU is only 45W, meaning that even under the highest stress, this CPU will never exceed 45W of power consumption. Including the memory controller. As you've guessed already, the "e" in the model number stands for "efficient".

There's an important trade-off to consider for home server CPUs: For instance, the AMD Phenom II series would have been more powerful because it has an additional L3 cache, but their TDP is at 65W minimum. While big caches (both with AMD and Intel) are good for compute-intensive operations and games, they can't help much in a home server context: Home servers spend most of their non-idle time transferring data from A to B (files, videos, music) and a cache doesn't help much here, cause it's just another stop between I/O and CPU to pass by. Transferred data hardly gets re-used.

Instead, for home servers, sacrificing the L3 cache for lower power consumption makes a lot of sense: You pay less for the CPU and you pay less for your power bill without sacrificing too much (if any) server relevant performance.

My CPU conclusion: For home servers, AMD Athlon II "e" series are perfect, because they save power and money and do the job very well. For games you might choose a more powerful Phenom II processor, which delivers better compute power at a slightly higher power bill.

Finding the Right Motherboard

After nailing the Platform and CPU question, I needed a motherboard. This can be a confusing process: For each CPU there are different chipsets, then there are different vendors offering motherboards based on these chipset, and then they offer different variants with different features. What should a good home server motherboard offer?

  • OpenSolaris support: Most motherboards "just work" with OpenSolaris, especially if they've been available for some time and/or use some well-known chipsets. To remove any doubts, consult the OpenSolaris Hardware Compatibility List.
  • ECC: Yes, AMD's ECC support is in the CPU, but just in case, we want to make sure that the motherboard exposes this feature to the user.
  • Integrated graphics: We only need a graphics card for installation, BIOS-settings/updates or debugging. The graphics shouldn't consume much power and doesn't need to deliver a lot of performance. Therefore, integrated graphics is just fine for a home server, and it saves some precious slot space, too.
  • Enough SATA ports: You need two ports for a mirrored root pool, another two for a mirrored data pool and then some more for hot-spare or if you want to spread your data pool across 4 or more disks. Many motherboards come with 6 SATA ports, which is ok.
  • Stability: If you have the choice between different variants, try to go for the "business" or "quality" version. It will likely have slightly better components and be better tuned towards 24/7 operation than the "performance", "gaming" or "overclocker" type boards which try to squeeze out more performance at the expense of durability.

Here's a very useful email thread on the OpenSolaris ZFS-discuss mailing list about CPU and motherboard options, pros and cons and user experiences. In this discussion, F.Wessels recommended the M3A78 series from Asus so I went for the M3A78-CM motherboard, which is their "business class" variant, priced at around 60 Euros and it has 6 SATA and 12(!) USB ports.

My motherboard conclusion: The Asus M3A78-CM motherboard has everything I need for a home server at a very low cost, and it's proven to run OpenSolaris just fine.

The Case: Antec NSK-1380

I won't go into much details about the case. My goal was to find one that can support at least 4 disks while being as compact as possible. The Antec NSK-1380 was the smallest case I could find that supports 4 disks. It comes with a built-in power supply, an extra fan, some features to help with keeping noise down and it looked ok for a PC case.

Miscellaneous Tips&Tricks

While putting everything together, I ran into some smaller issues here and there. Here's what I came up with to solve them:

  • CPU cooler: My CPU being a gift from AMD, it came without a cooler. I chose the Zalman CNPS 7000C AL-CU because it was the cheapest cooler that was less than 6cm high, which is a limit imposed by the case. Unfortunately, it collides with one of the 4 harddisk slots by a mere couple of millimeters, so I need to figure out how to best cut or bend the blades to make room for the disk without hurting the cooler too much. I'm not very familiar with PC CPU coolers, but I suspect that with 45W TDP one could even get away with a passive cooler. The Zalman came with a throttle circuit which I set to the minimum speed. This seems to be more than enough and it makes the system really silent, but I need some more thorough testing to confirm. Drop me a comment if you are familiar with passive cooling of 45W TDP CPUs.
  • Boot disks: We need something to boot our home server from, so boot disks are unavoidable. I say "disks", because they should always be mirrored. But other than providing a way of booting, they tend to get in the way, especially when space is a precious resource in a small case. Some people therefore boot from CF cards or other small flash media. This can be a nice solution, especially combined with an IDE-to-CF adapter, but consumer level flash media is either very slow (a typical OS install can take many hours!) or very expensive. While looking for alternatives, I found a nice solution: The Scythe Slot Rafter fits into an unused PCI slot (taking up the breadth of two) and provides space for mounting four 2.5" disks at just EUR 5. These disks are cheap, good enough and I had an unused one lying around anyway, so that was a perfect solution for me.
  • Extra NIC: The Asus M3A78-CM comes with a Realtek NIC and some people complained about driver issues with OpenSolaris. So I followed the advice on the aforementioned Email thread and bought an Intel NIC which is well supported, just in case.
  • USB boot: I couldn't get the M3A78-CM to boot from USB at all. I tried a USB stick and a USB disk and different boot settings in the BIOS to no avail. I gave up and built in a DVD-ROM drive from my old server just to install OpenSolaris, then built it out again. Jan Brosowski has the same motherboard and he found out that you need to update to the latest BIOS revision and use the USB port that's right below the Ethernet port for USB boot to work. YMMV :).

The Result

And now for the most important part: How much power does the system consume? I did some testing with one boot disk and 4GB of ECC RAM and measured about 45W idle. While stressing CPU cores, RAM and the disk with multiple instances of sysbench, I could not get the system to consume more than 80W. All in all, I'm very pleased with the numbers, which are about half of what my old system used to consume. I didn't do any detailed performance tests yet, but I can say that the system feels very responsive and compile runs just rush along the screen. CPU temperature won't go beyond the low 50Cs on a hot day, despite using the lowest fan speed, so cooling seems to work well, too.

I just started full 24/7 operation of my new home server this weekend, so I hope I'll have some more long-term experience about performance and stability in a few months. Meanwhile, I'm in the middle of configuring the system, installing some services and implementing a new way of managing my home server. But that's probably the topic of another blog post...

Do you agree with the home server conclusions I reached in this post? Or would you suggest alternatives? Do you have experiences to share with the mentioned components? Or do you have suggestions and tips on how to get the most out of them? Let me know by posting a comment here!

Many thanks go to Michael Schmid of AMD for sending me the AMD Athlon II X2 240e CPU.

Saturday Sep 19, 2009

New OpenSolaris ZFS Home Server: Requirements

Old OpenSolaris Home Server with blinkenlights USB drives

A few months ago, I decided it was time for a new home server. The old one (see picture) is now more than 3 years old (the hardware is 2 years older), so it was time to plan ahead for the inevitable hardware failure. Curiously enough, my old server started to refuse working with some of my external USB disks only a few weeks ago, which confirmed my need for a new system. This is the beginning of a series of blog articles around building a new OpenSolaris home server.

Home Server Goals

Let's go over some goals for a home server to help us decide on the hardware. IMHO, a good home server should:

  1. Run OpenSolaris. This means I don't want an appliance because this is too limiting and I'd end up hacking it to make it run something it doesn't do by itself anyway. It should therfore use a real OS.
    It also means I don't want to use Linux, because quite frankly the whole Linux landscape is too unstable, confused and de-focused (don't get me wrong: It's nice for experimentation and as a hobbyist-OS, but I want something more serious to guard my data).
    Windows is out of the question because it delivers too little for too high a price.
    I like BSD (and used to run NetBSD on my Amiga 4000 back then in the mid nineties), but it seems to be more oriented to some (albeit interesting) niches for my taste.
    Right now I prefer OpenSolaris because it's rock-solid, clean, well-documented, well-designed and it has lots of advanced features that other OSes only dream of. Yes, I'd still write and do the same if I weren't a Sun employee.
  2. Leverage ZFS. This should be a no-brainer, but I just wanted to point out that any system that is serious about its data should absolutely run ZFS because of the end-to-end-integrity. Period. And then there are many useful features such as compression, send/receive, snapshots, ease of administration, no fscks and much more. Oh, and I'm looking forward to leveraging encryption and de-duplication at home in the near future, too!
  3. Use ECC Memory: What's the use of having end-to-end data integrity with ZFS if your data is corrupted before ZFS can create it's checksum? That's why you need ECC Memory. Simply put: Use ECC memory and kiss those unexpected, unexplicable system crashes and broken data surprises good bye.
  4. Be Power Efficient: Think 1.5 Euros of electricity bill per Watt per Year for a system running 24/7. The difference between your typical gaming PC and a power-efficient home server can easily be 50W or more when idle, so you're looking at an extra 75 Euros and more of free cash if you just pick your components more carefully. Notice that I'm not saying "Low-Power". There are a lot of compromises when trying to reach absolute low-powerness. Like many optimization problems, squeezing the last few Watts out of your system means investing a lot of money and effort while sacrificing important features. So I want this system to be power-efficient, but without too many sacrifices.
  5. Use a Moderate Amount of Space: While my home server sits in the basement, form doesn't matter. But I may move into a new apartment where networking to the basement is not an option. Then the server needs to be living-room capable and have a decent WAF. Which also brings us to:
  6. Be quiet: A power-efficient server needs less cooling which helps with being quiet. Again, we don't want to stretch the limits of quietness at all costs, but we want to make sure we don't do any obvious mistakes here that sacrifice the living-room capabilities of the system

What's Next

In the next blog entry, we'll discuss a few processor and platform considerations and reveal a cool, yet powerful option that presented itself to me. Meanwhile, feel free to check out other home server resources, such as Simon Breden's blog, Matthias Pfuetzner's blog, Jan Brosowski's Blog (German) or one of the many home server discussions on the zfs-discuss mailing list.

What are your requirements for a good home server? What do you currently use at home to fulfill your home server needs? What would you add to the above list of home server requirements? Feel free to add a comment below!

Thursday Sep 17, 2009

New OpenSolaris ZFS Auto-Scrub Service Helps You Keep Proper Pool Hygiene

A harddisk that is being scrubbed

One of the most important features of ZFS is the ability to detect data corruption through the use of end-to-end checksums. In redundant ZFS pools (pools that are either mirrored or use a variant of RAID-Z), this can be used to fix broken data blocks by using the redundancy of the pool to reconstruct the data. This is often called self-healing.

This mechanism works whenever ZFS accesses any data, because it will always verify the checksum after reading a block of data. Unfortunately, this does not work if you don't regularly look at your data: Bit rot happens and with every broken block that is not checked (and therefore not corrected), the probability increases that even the redundant copy will be affected by bit rot too, resulting in data corruption.

Therefore, zpool(1M) provides the useful scrub sub-command which will systematically go through each data block on the pool and verify its checksum. On redundant pools, it will automatically fix any broken blocks and make sure your data is healthy and clean.

It should now be clear that every system should regularly scrub their pools to take full advantage of the ZFS self-healing feature. But you know how it is: You set up your server and often those little things get overlooked and that cron(1M) job you wanted to set up for regular pool scrubbing fell off your radar etc.

Introducing the ZFS Auto-Scrub SMF Service

Here's a service that is easy to install and configure that will make sure all of your pools will be scrubbed at least once a month. Advanced users can set up individualized schedules per pool with different scrubbing periods. It is implemented as an SMF service which means it can be easily managed using svcadm(1M) and customized using svccfg(1M).

The service borrows heavily from Tim Foster's ZFS Auto-Snapshot Service. This is not just coding laziness, it also helps minimize bugs in common tasks (such as setting up periodic cron jobs) and provides better consistency across multiple similar services. Plus: Why invent the wheel twice?


The ZFS Auto-Scrub service assumes it is running on OpenSolaris. It should run on any recent distribution of OpenSolaris without problems.

More specifically, it uses the -d switch of the GNU variant of date(1) to parse human-readable date values. Make sure that /usr/gnu/bin/date is available (which is the default in OpenSolaris).

Right now, this service does not work on Solaris 10 out of the box (unless you install GNU date in /usr/gnu/bin). A future version of this script will work around this issue to make it easily usable on Solaris 10 systems as well.

Download and Installation

You can download Version 0.5b of the ZFS Auto-Scrub Service here. The included README file explains everything you need to know to make it work:

After unpacking the archive, start the install script as a privileged user:

pfexec ./

The script will copy three SMF method scripts into /lib/svc/method, import three SMF manifests and start a service that creates a new Solaris role for managing the service's privileges while it is running. It also installs the OpenSolaris Visual Panels package and adds a simple GUI to manage this service.

ZFS Auto-Scrub GUI

After installation, you need to activate the service. This can be done easily with:

svcadm enable auto-scrub:monthly

or by running the GUI with:

vp zfs-auto-scrub

This will activate a pre-defined instance of the service that makes sure each of your pools is scrubbed at least once a month.

This is all you need to do to make sure all your pools are regularly scrubbed.

If your pools haven't been scrubbed before or if the time or their last scrub is unknown, the script will proceed and start scrubbing. Keep in mind that scrubbing consumes a significant amount of system resources, so if you feel that a currently running scrub slows your system too much, you can interrupt it by saying:

pfexec zpool scrub -s <pool name>

In this case, don't worry, you can always start a manual scrub at a more suitable time or wait until the service kicks in by itself during the next scheduled scrubbing period.

Should you want to get rid of this service, use:

pfexec ./ -d

The script will then disable any instances of the service, remove the manifests from the SMF repository, delete the scripts from /lib/svc/method, remove the special role and the authorizations the service created and finally remove the GUI. Notice that it will not remove the OpenSolaris Visual Panels package in case you want to use it for other purposes. Should you want to get rid of this as well, you can do so by saying:

pkg uninstall OSOLvpanels

Advanced Use

You can create your own instances of this service for individual pools at specified intervals. Here's an example:

  constant@fridolin:~$ svccfg
  svc:> select auto-scrub
  svc:/system/filesystem/zfs/auto-scrub> add mypool-weekly
  svc:/system/filesystem/zfs/auto-scrub> select mypool-weekly
  svc:/system/filesystem/zfs/auto-scrub:mypool-weekly> addpg zfs application
  svc:/system/filesystem/zfs/auto-scrub:mypool-weekly> setprop zfs/pool-name=mypool
  svc:/system/filesystem/zfs/auto-scrub:mypool-weekly> setprop zfs/interval=days 
  svc:/system/filesystem/zfs/auto-scrub:mypool-weekly> setprop zfs/period=7
  svc:/system/filesystem/zfs/auto-scrub:mypool-weekly> setprop zfs/offset=0
  svc:/system/filesystem/zfs/auto-scrub:mypool-weekly> setprop zfs/verbose=false
  svc:/system/filesystem/zfs/auto-scrub:mypool-weekly> end
  constant@fridolin:~$ svcadm enable auto-scrub:mypool-weekly

This example will create and activate a service instance that makes sure the pool "mypool" is scrubbed once a week.

Check out the zfs-auto-scrub.xml file to learn more about how these properties work.

Implementation Details

Here are some interesting aspects of this service that I came across while writing it:

  • The service comes with its own Solaris role zfsscrub under which the script runs. The role has just the authorizations and profiles necessary to carry out its job, following the Solaris Role-Based Access Control philosophy. It comes with its own SMF service that takes care of creating the role if necessary, then disables itself. This makes a future deployment of this service with pkg(1) easier, which does not allow any scripts to be started during installation, but does allow activation of newly installed SMF services.
  • While zpool(1M) status can show you the last time a pool has been scrubbed, this information is not stored persistently. Every time you reboot or export/import the pool, ZFS loses track of when the last scrub of this pool occurred. This has been filed as CR 6878281. Until that has been resolved, we need to take care of remembering the time of last scrub ourselves. This is done by introducing another SMF service that periodically checks the scrub status, then records the completion date/time of the scrub in a custom ZFS property called in the pool's root filesystem when finished. We call this service whenever a scrub is started and it deactivates itself once it's job is done.
  • As mentioned above, the GUI is based on the OpenSolaris Visual Panels project. Many thanks to the people on its discussion list to help me get going. More about creating a visual panels GUI in a future blog entry.

Lessons learned

It's funny how a very simple task like "Write an SMF service that takes care of regular zpool scrubbing" can develop into a moderately complex thing. It grew into three different services instead of one, each with their own scripts and SMF manifests. It required an extra RBAC role to make it more secure. I ran into some zpool(1M) limitations which I now feel are worthy of RFEs and working around them made the whole thing slightly more complex. Add an install and de-install script and some minor quirks like using GNU date(1) instead of the regular one to have a reliable parser for human-readable date strings, not to mention a GUI and you cover quite a lot of ground even with a service as seemingly simple as this.

But this is what made this project interesting to me: I learned a lot about RBAC and SMF (of course), some new scripting hacks from the existing ZFS Auto-Snapshot service, found a few minor bugs (in the ZFS Auto-Snapshot service) and RFEs, programmed some Java including the use of the NetBeans GUI builder and had some fun with scripting, finding solutions and making sure stuff is more or less cleanly implemented.

I'd like to encourage everyone to write their own SMF services for whatever tools they install or write for themselves. It helps you think your stuff through, make it easy to install and manage, and you get a better feel of how Solaris and its subsystems work. And you can have some fun too. The easiest way to get started is by looking at what others have done. You'll find a lot of SMF scripts in /lib/svc/method and you can extract the manifests of already installed services using svccfg export. Find an SMF service that is similar to the one you want to implement, check out how it works and start adapting it to your needs until your own service is alive and kicking.

If you happen to be in Dresden for OSDevCon 2009, check out my session on "Implementing a simple SMF Service: Lessons learned" where I'll share more of the details behind implementing this service including the Visual Panels part.

Edit (Sep. 21st) Changed the link to CR 6878281 to the externally visible OpenSolaris bug database version, added a link to the session details on OSDevCon.

Edit (Jun. 27th, 2011) As the Mediacast service was decommissioned, I have re-hosted the archive in my new blog and updated the download link. Since vpanels has changed a lot lately, the vpanels integration doesn't work any more, but the SMF service still does.

Monday Jun 15, 2009

OpenSolaris meets Mac OS X in Munich

Last Wednesday, Wolfgang and I had the honor to present at "Mac Treff München", Munich's local Mac User Group. There are quite a few touching points between OpenSolaris and Mac OS X, such as ZFS, DTrace and VirtualBox, we thought it would be a good idea to contact them out of our Munich OpenSolaris User Group and talk a little bit about OpenSolaris.

Breaking the Ice

We were a little bit nervous about what would happen. Do Mac people care about the innards of a different, seemingls non-GUIsh OS? Are they just fanboys or are they open to other people's technologies? Will talking about redundancy, BFU, probes and virtualization bore them to death?

Fortunately, the 30-40 people that attended the event proved to be a very nice, open and tolerant group. They let us talk about OpenSolaris in General including some of the nitty-grittyness of the development process, before we started talking about the features that are more interesting to Mac users. We then talked about ZFS, DTrace and VirtualBox:

ZFS for Mac OS X (or not (yet)?)

Explaining the principles behind ZFS to people who are only used to draging'n'dropping icons, shooting photos or video and using computers to get work done, without having to care about what happens inside, is not easy. We concentrated on getting the basics of the tree structure, copy-on-write, check-summing and using redundancy to self-heal while using real world examples and metaphors to illustrate the principles. Here's the deal: If you have lots of important data (photos, recording, videos, anyone?) and care about it (content creators...), then you need to be concerned about data availability and integrity. ZFS solves that, it's that simple. A little animation in the slides were quite helpful in explaining that, too :).

The bad news is that ZFS seems to have vanished from all of Apple's communication about the upcoming Mac OS X Snow Leopard release. That's really bad, because many developers and end-users were looking forward to take advantage of it.

The good news is that there are still ways to take advantage of ZFS as a Mac User: Run an OpenSolaris file server for archiving your data or using it as a TimeMachine store, or even run a small OpenSolaris ZFS Server inside your Mac through VirtualBox.

DTrace: A Mac Developer/Admin's Heaven, Albeit in Jails

Next, we dove a little bit into DTrace and how it makes the OS really transparent for admins, developers and users. In addition to the dtrace(1) command, Apple created a nice GUI called "Instruments" as part of their XCode development environment that leverages the DTrace infrastructure to collect useful data about your application in realtime.

Alas, as with ZFS, there's another downer, and this time it's more subtle: While you can enjoy the power of DTrace in Mac OS X now, it's still kinda crippled, as Adam Leventhal pointed out: Processes can escape the eyes of DTrace at will, which counters the absolute observability idea of DTrace quite massively. Yes, there are valid reasons for both sides of the debate, but IMHO, legal things should be enforced using legal means, and software should be treated as software, meaning it is not a reliable way of enforcing any license contracts - with or without powerful tools such as DTrace.

OpenSolaris for all: VirtualBox

Finally, a free present to the Mac OS X community: VirtualBox. I still get emails asking me to spend 80+ dollars on some virtualization software for my Mac. There are at least two choices in that price range: VMware Workstation and Parallels. Well, the good news is that you can save your 80 bucks and use VirtualBox instead.

This may not be new to you, since as a reader of my blog you've likely heard of VirtualBox before, but it's always amazing for me to see how slowly these things spread. So, after reading this article, do your Mac friends a favour and tell them they can save precious money buy just downloading VirtualBox instead of spending money on other virtualization solutions for the Mac. It's really that simple.

Indeed, this was the part where the attendees took most of their notes, and asked a lot of questions about (ZFS being a close first in terms of discussion/questions).


After our presentations, a lot of users came up and asked questions about how to install OpenSolaris on their hardware and on VirtualBox. Some even asked where to buy professional services for installing them an OpenSolaris ZFS fileserver in their company. The capabilities of ZFS clearly struck some chords inside the Mac OS X community, which is no wonder: If you have lots of Audio/Video/Photo data and care about quality and availability, then there's no way around FS.

I used this event as an excuse to try out keynote, which worked quite well for me, especially because it helped me create some easy to understand animations about the mechanics of ZFS. I also liked the automatic guides a lot which help you position elements on your slides very easily and seem to guess very well what your layout intentions were. I'd love the OpenOffice folks to check out Keynote's guides and see if they can come up with something similar. So, here's a Keynote version of my "OpenSolaris for Mac Users" slides as well as a PDF version (both in German) for you to check out and re-use if you like.

Update: Wolfgang's introductory slides are now available for download as well and Klaus, the organizer of the event, posted a review in the Mac Treff München Blog with some pictures, too.

Tuesday Apr 21, 2009

Video: Top 5 Cool Features of the Sun Storage 7000 Unified Storage Systems

A couple of weeks ago, Marc (our producer from the HELDENFunk Podcast) and I sat down and put together a video about the top 5 reasons why the new Sun Storage 7000 systems are so cool. We even "invited" Brendan Gregg to show us his latest trick:

For the next video, I'll try to learn more phrases by heart and look less at the prompter screen for a more natural feel. I apologize for my German accent (some people say it adds credibility :) ). Still, people seem to like the video, at least it has been viewed about 200 times already.

There's a lot of discussion around the Sun Storage 7000, most of it is very positive. In Germany, we like to complain a lot so of course we also hear a lot of constructive criticism. Most of the comments I hear fall into one of the two following categories:

  1. The Storage 7000 systems are cool, but I know ZFS/OpenSolaris can do "X" and I really want this to be in the Storage 7000 GUI as well!
    Yes, we know that there are still many features we'd like to see in the Storage 7000 and we're working on making them available. Make sure your Sun contact knows about your wishlist, so she can forward it to our engineers. Please remember that the Storage 7000 systems are meant to be easy-to-use appliances: Taking your "X" feature from ZFS/OpenSolaris and building a GUI around it is a hard thing to do, especially if you want it to work reliably and if you want it to be self-explanatory and self-serviceable. Please be patient, we're most probably working on your favourite features already.

  2. The Storage 7000 systems are cool, but I want more control. I want to change the hardware/hack them/take them apart/add more functionality/get them to do exactly what I want, etc.
    Sure, that feature is called "OpenSolaris". Please go to, download the CD, install it on your favourite hardware and off you go!
    But, can I have the GUI, too, maybe as an SDK of some sort?
    No. The Storage 7000 systems are not "just a GUI". They are full-blown appliances which means that they're more than just the hardware and a GUI. A big part of the ease-of-use, stability, performance and predictability of these products is in the way configuration options are selected, tested and yes, limited, as well as a careful consideration of which features to implement at what time and which not. Only then comes the GUI on top, which is tailored to the overall product as a whole. In other words: You wouldn't go to BMW and ask them to give you their dashboard, radio and the lights so you can bolt them onto a Volkswagen, would you?

You see, either you build your own storage machine out of the building blocks you have, and get all the functionality and flexibility you want at the expense of some configuration effort,
or you buy the car as a whole, nice, round, sweet package, so you don't worry about configuration, implementation details, complexity, etc. Asking for anything in between will get you into trouble: Either you'll spend more effort than you want, or you won't get the kind of control you want.

If you understand German, there's some discussion of this topic as well as a great overview of the MySQL future plus a primer on SSDs in the latest episode of the HELDENFunk podcast.

And if you like the Sun Storage 7000 Unified Storage Systems as much as I do, here are the slides in StarOffice format, as well as in PDF format, so you can tell your colleagues and friends as well.

Wednesday Mar 25, 2009

Think Twice Before Deleting Stuff (Or Better Not at All!)

Some piggy banks

No, this is not going to be another "Remember to do snapshots" post. I'm also not going to talk about backups. Instead, let's look at some very practical aspects of deleting files.

So, why delete a file? "Trivial", you think, "so I can save space!". Sure, dear reader, but at the expense of what?

Let's stop and think for a minute. Our lives try to center around doing cool, worthwhile, meaningful, useful stuff. Deleting files isn't really cool, nor fun, it is a necessity we're forced to do. Don't you hate it when that dreaded "Your startup disk is almost full" message appears while you're in the middle of downloading new photos from your latest exciting vacation trip?

Actually, the seemingly simple act of deleting is really a challenge: "Will I need this again?", "Wouldn't it be better to archive this instead?", "Last time I was really glad I kept that email from 2 years ago, so why delete this one?". Sometimes I surprise myself thinking a long time before I really press that "ok" button or hit "Enter" after the "rm".

The reality is: Storage is cheap, so why delete stuff in the first place?

To put things in perspective, let's try an ROI analysis of deleting files. Let's say we need about 6 seconds of thinking time before we can decide whether a particular file can really be deleted without regret. Let's also assign some value to our time, say $12 per hour (I hope you're getting paid much more than that, but this is just to keep the numbers simple).

Storage is cheap, and last time I checked, a 1 TB USB hard drive cost about $100 at a major electronics retailer, with prices falling by the hour.

Now, how much space does the act of deleting a file need to free up so it justifies the effort of deciding whether to delete or keep it?

Well, our $12 per hour conveniently breaks down to $0.20 per minute, which allows us to perform 10 delete-it-or-not decisions per minute at $0.02 each. Fine. Deleting seems to be cheap, doesn't it?

Now, for that $0.02 you can buy a 1/5000th of a 1 TB hard drive. Wait a minute, 1TB/5000 still amounts to 200 MB of data per $0.02! That's more than you need to store a 10 minute video, or a full CD of music, compressed at high quality! Or 20 presentations at 10MB each! Not to mention countless emails, source code and other files!

So, unless the file you're pondering is bigger than 200MB, it's not really worth even considering to delete it. I'll call this 200MB boundary the "Destructive Utility Heuristic (DUH)".

The result is therefore: Save your time, buy more harddisk space (or upgrade your old hard drive to a bigger one before it dies) and move on. Life's too precious to waste it on deleting stuff. Create good stuff instead! Only think about deleting stuff if the file in question is bigger than 200MB.

I can hear some "Wait, but!"'s in the audience, ok, one at a time:

  • "But I can delete much faster than 6 seconds!"
    No big deal. So you can delete 1 file per second, that's still a threshold of 33MB, more than 5 songs worth or even the biggest practical business presentation or the source code to a major open source project. And harddisks are getting cheaper every day, while your time will become more and more precious as you age. Yes, if you're dead sure that file is useless junk and don't need to think about it, go ahead and delete it, but why did you save it in the first place?

  • "But I like my directories to be clean and tidy!"
    Congratulations, that's a good habit! Keeping files organized doesn't mean you need to delete stuff, though. Set up an "Archive" folder somewhere and dump everything you think you may or may not use again there. Use one archive folder for each year if you want. File search technology is pretty advanced these days so you should be able to find your archived files quicker than the time you'd take to decide which ones you'll never want to find again. Then, you can still decide to delete your whole archive from 3 years ago because you never used it, and it will likely make some sense, because its size may be above the destructive utility heuristic, but chances are you won't really care because storage will have become even cheaper after those 3 years so you won't save a big deal, relatively speaking.

  • "That still doesn't help me when that damn 'Your startup disk is almost full' message comes!"
    You're right. The point is: It's often hard to sift through data and decide what to keep and what not. That's why we dread deleting stuff and instead wait until that message comes. I'm only offering relief to those that felt that the act of having to delete stuff isn't really rewarding, and it isn't (at least while you're below the DUH). Go buy a bigger harddrive for your laptop, it's really the cost effective option. Use the numbers above to help you justify that towards your finance department.

  • "I'm still not convinced. I actually kinda like going through my files and delete them once in a while..."
    Sure, go ahead. Just know that you could use that time to do more productive stuff, such as checking out the Sun cloud, installing OpenSolaris or testing our new Sun OpenStorage products.

  • "Wait, aren't you supposed to write about OpenSolaris, ZFS and this stuff anyway?"
    I'm glad you mentioned that :). Actually, OpenSolaris and ZFS make it even easier for you to both not care about deleting stuff while keeping your files organized at the same time. The amazing ZFS auto snapshot SMF service will create snapshots of your data automagically every 15 minutes, so it won't matter whether you delete files or not. You can then choose to either not delete them at all and just move them to some archive, or you can delete whatever you want, without the 6 seconds of thinking (just to keep stuff tidy), knowing that you'll always be able to recover those files with Time Slider later. You could then use zfs send/receive to dump your data incrementally to a file server as a backup mechanism and the hooks are already there to automate this.

See, once you think of it, there's not really a need to delete files at all any more. At least not for mere mortals like us with file sizes that are typically below the destructive utility heuristic of currently 200MB (and rising...) most of the time. Music has already reached the point where a song can be stored at studio quality with lossless compression at manageable file sizes so that kind of data won't see significant growth any more. And photos and videos will soon follow. This means we'll need to care less and less about restricting personal data storage. Instead, we now need to focus more on managing personal storage.

Now there's a completely different problem that'll keep us entertained for some time...

Monday Mar 02, 2009

The Inner Life of ZFS: Cool ZFS On-Disk Block Structure Movies

Pascal Gienger of Konstanz University published a nifty DTrace script that captures ZFS' on-disk block activity and published it on his Southbrain blog.

The cool thing: He animated the data. That's right. Using a Perl script, he draws greener or redder dots depending on whether a particular range of blocks on disk sees more reads or writes. By aggregating data over many hours while doing interesting tasks such as backup, he created a series of very cool animations.

In his first post, he shows us the inner life of a Postfix mail queue as an animated GIF:

ZFS on-disk block animation

Then, he compared the write patterns of UFS vs. ZFS using a MySQL workload to produce a cool MPEG-4 movie.

In his latest ZFS animation work, he shows us 18 hours of a mirrored file server including some backup, night rest and user action (Download MPEG-4 Movie here).

Congratulations, Pascal, this is way cool stuff. You really should upload these to YouTube so people can embed them in their blogs :).

Update: Meanwhile, pascal told me that he uploaded his videos on YouTube already. He has a full playlist full of them. Enjoy!

Wednesday Aug 13, 2008

ZFS Replicator Script, New Edition

Many crates on a bicycle. A metaphor for ZFS snapshot replicationAbout a year ago, I blogged about a useful script that handles recursive replication of ZFS snapshots across pools. It helped me migrate my pool from a messy configuration into the clean two-mirrored-pairs configuration I have now.

Meanwhile, the fine guys at the ZFS developer team introduced recursive send/receive into the ZFS command, which makes most of what the script does a simple -F flag to the zfs(1M).

Unfortunately, this new version of the ZFS command has not (yet?) been ported back to Solaris 10, so my ZFS snapshot replication script is still useful for Solaris 10 users, such as Mike Hallock from the School of Chemical Sciences at the University of Illinois at Urbana-Champaign (UIUC). He wrote:

Your script came very close to exactly what I needed, so I took it upon myself to make changes, and thought in the spirit of it all, to share those changes with you.

The first change he in introduced was the ability to supply a pattern (via -p) that selects some of the potentially many snapshots that one wants to replicate. He's a user of Tim Foster's excellent automatic ZFS snapshot service like myself and wanted to base his migration solely on the daily snapshots, not any other ones.

Then, Mike wanted to migrate across two different hosts on a network, so he introduced the -r option that allows the user to specify a target host. This option simply pipes the replication data stream through ssh at the right places, making ZFS filesystem migration across any distance very easy.

The updated version including both of the new features is available as zfs-replicate_v0.7.tar.bz2. I didn't test this new version but the changes look very good to me. Still: Use at your own risk.

Thanks a lot, Mike! 

Tuesday Aug 12, 2008

ZFS saved my data. Right now.

Kid with a swimming ring.As you know, I have a server at home that I use for storing all my photos, music, backups and more using the Solaris ZFS filesystem. You could say that I store my life on my server.

For storage, I use Western Digital's MyBook Essential Edition USB drives because they are the cheapest ones I could find from a well-known brand. The packaging says "Put your life on it!". How fitting.

Last week, I had a team meeting and a colleague introduced us to some performance tuning techiques. When we started playing with iostat(1M), I logged into my server to do some stress tests. That was when my server said something like this:

constant@condorito:~$ zpool status

(data from other pools omitted)

  pool: santiago
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
 scrub: scrub completed after 16h28m with 0 errors on Fri Aug  8 11:19:37 2008

	santiago     DEGRADED     0     0     0
	  mirror     DEGRADED     0     0     0
	    c10t0d0  DEGRADED     0     0   135  too many errors
	    c9t0d0   DEGRADED     0     0    20  too many errors
	  mirror     ONLINE       0     0     0
	    c8t0d0   ONLINE       0     0     0
	    c7t0d0   ONLINE       0     0     0

errors: No known data errors

This tells us 3 important things:

  • Two of my disks (c10t0d0 and c9t0d0) are happily giving me garbage back instead of my data. Without knowing it.
    Thanks to ZFS' checksumming, we can detect this, even though the drive thinks everything is ok.
    No other storage device, RAID array, NAS or file system I know of can do this. Not even the increasingly hyped (and admittedly cool-looking) Drobo [1].
  • Because both drives are configured as a mirror, bad data from one device can be corrected by reading good data from the other device. This is the "applications are unaffected" and "no known data errors" part.
    Again, it's the checksums that enable ZFS to distinguish good data blocks from bad ones, and therefore enabling self-healing while the system is reading stuff from disk.
    As a result, even though both disks are not functioning properly, my data is still safe, because (luckily, albeit with millions of blocks per disk, statistics is on my side here) the erroneous blocks don't overlap in terms of what pieces of data they store.
    Again, no other storage technology can do this. RAID arrays only kick in when the disk drives as a whole are unacessible or when a drive  diagnoses itself to be broken. They do nothing against silent data corruption, which is what we see here and what all people on this planet that don't use ZFS (yet) can't see (yet). Until it's too late.
  • Data hygiene is a good thing. Do a "zpool scrub <poolname>" once in a while. Use cron(1M) to do this, for example every other week for all pools.

Over the weekend, I ordered myself a new disk (sheesh, they dropped EUR 5 in price already after just 5 days...) and after a "zpool replace santiago c10t0d0 c11t0d0" on monday, my pool started resilvering:

constant@condorito:~$ zpool status

(data from other pools omitted)

  pool: santiago
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
 scrub: resilver in progress for 1h13m, 6.23% done, 18h23m to go

        NAME           STATE     READ WRITE CKSUM
        santiago       DEGRADED     0     0     0
          mirror       DEGRADED     0     0     0
            replacing  DEGRADED     0     0     0
              c10t0d0  DEGRADED     0     0   135  too many errors
              c11t0d0  ONLINE       0     0     0
            c9t0d0     DEGRADED     0     0    20  too many errors
          mirror       ONLINE       0     0     0
            c8t0d0     ONLINE       0     0     0
            c7t0d0     ONLINE       0     0     0

errors: No known data errors

The next step for me is to send the c10t0d0 drive back and ask for a replacement under warranty (it's only a couple of months old). After receiving c10's replacement, I'll consider sending in c9 for replacement (depending on how the next scrub goes).

Which makes me wonder: How will drive manufacturers react to a new wave of warranty cases based on drive errors that were not easily detectable before?

[1] To the guys at Drobo: Of course you're invited to implement ZFS into the next revision of your products. It's open source. In fact, Drobo and ZFS would make a perfect team!


Tune in and find out useful stuff about Sun Solaris, CPU and System Technology, Web 2.0 - and have a little fun, too!


« April 2014