Monday Jul 19, 2010

A Logzilla for your ZFS box

A key component of the ZFS Hybrid Storage Pool is Logzilla, a very fast device to accelerate synchronous writes. This component hides the write latency of disks to enable the use of economical, high-capacity drives. In the Sun Storage 7000 series, we use some very fast SAS and SATA SSDs from STEC as our Logzilla &mdash the devices are great and STEC continues to be a terrific partner. The most important attribute of a good Logzilla device is that it have very low latency for sequential, uncached writes. The STEC part gives us about 100μs latency for a 4KB write — much much lower than most SSDs. Using SAS-attached SSDs rather than the more traditional PCI-attached, non-volatile DRAM enables a much simpler and more reliable clustering solution since the intent-log devices are accessible to both nodes in the cluster, but SAS is much slower than PCIe...

DDRdrive X1

Christopher George, CTO of DDRdrive was kind enough to provide me with a sample of the X1, a 4GB NV-DRAM card with flash as a backing store. The card contains 4 DIMM slots populated with 1GB DIMMs; it's a full-height card which limits its use in Sun/Oracle systems (typically half-height only), but there are many systems that can accommodate the card. The X1 employs a novel backup power solution; our Logzilla used in the 7000 series protects its DRAM write cache with a large super-capacitor, and many NV-DRAM cards use a battery. Supercaps can be limiting because of their physical size, and batteries have a host of problems including leaking and exploding. Instead, the DDRdrive solution puts a DC power connector on the PCIe faceplate and relies on an external source of backup power (a UPS for example).

Performance

I put the DDRdrive X1 in our fastest prototype system to see how it performed. A 4K write takes about 51μs — better than our SAS Logzilla — but the SSD outperformed the X1 at transfer sizes over 32KB. The performance results on the X1 are already quite impressive, and since I ran those tests the firmware and driver have undergone several revisions to improve performance even more.

As a Logzilla

While the 7000 series won't be employing the X1, uses of ZFS that don't involve clustering and for which external backup power is an option, the X1 is a great and economical Logzilla accelerator. Many users of ZFS have already started hunting for accelerators, and have tested out a wide array of SSDs. The X1 is a far more targeted solution, and is a compelling option. And if write performance has been a limiting factor in deploying ZFS, the X1 is a good reason to give ZFS another look.

Monday Feb 23, 2009

HSP talk at the OpenSolaris Storage Summit

The organizers of the OpenSolaris Storage Summit asked me to give a presentation about Hybrid Storage Pools and ZFS. You can download the presentation titled ZFS, Cache, and Flash. In it, I talk about flash as a new caching tier in the storage hierarchy, some of the innovations in ZFS to enable the HSP, and an aside into the how we implement an HSP in the Sun Storage 7410.

Tuesday Jun 10, 2008

Flash, Hybrid Pools, and Future Storage

Jonathan had a terrific post yesterday that does an excellent job of presenting Sun's strategy for flash for the next few years. With my colleagues at Fishworks, an advanced product development team, I've spent more than a year working with flash and figuring out ways to integrate flash into ZFS, the storage hierarchy, and our future storage products — a fact to which John Fowler, EVP of storage, alluded recently. Flash opens surprising new vistas; it's exciting to see Sun leading in this field, and it's frankly exciting to be part of it.

Jonathan's post sketches out some of the basic ideas on how we're going to be integrating flash into ZFS to create what we call hybrid storage pools that combine flash with conventional (cheap) disks to create an aggregate that's cost-effective, power-efficient, and high-performing by capitalizing on the strengths of the component technologies (not unlike a hybrid car). We presented some early results at IDF which has already been getting a bit of buzz. Next month I have an article in Communications of the ACM that provides many more details on what exactly a hybrid pool is and how exactly it works. I've pulled out some excerpts from that article and included them below as a teaser and will be sure to post an update when the full article is available in print and online.

While its prospects are tantalizing, the challenge is to find uses for flash that strike the right balance of cost and performance. Flash should be viewed not as a replacement for existing storage, but rather as a means to enhance it. Conventional storage systems mix dynamic memory (DRAM) and hard drives; flash is interesting because it falls in a sweet spot between those two components for both cost and performance in that flash is significantly cheaper and denser than DRAM and also significantly faster than disk. Flash accordingly can augment the system to form a new tier in the storage hierarchy – perhaps the most significant new tier since the introduction of the disk drive with RAMAC in 1956.

...

A brute force solution to improve latency is to simply spin the platters faster to reduce rotational latency, using 15k RPM drives rather than 10k RPM or 7,200 RPM drives. This will improve both read and write latency, but only by a factor of two or so. ...

...

ZFS provides for the use of a separate intent-log device, a slog in ZFS jargon, to which synchronous writes can be quickly written and acknowledged to the client before the data is written to the storage pool. The slog is used only for small transactions while large transactions use the main storage pool – it's tough to beat the raw throughput of large numbers of disks. The flash-based log device would be ideally suited for a ZFS slog. ... Using such a device with ZFS in a test system, latencies measure in the range of 80-100µs which approaches the performance of NVRAM while having many other benefits. ...

...

By combining the use of flash as an intent-log to reduce write latency with flash as a cache to reduce read latency, we can create a system that performs far better and consumes less power than other system of similar cost. It's now possible to construct systems with a precise mix of write-optimized flash, flash for caching, DRAM, and cheap disks designed specifically to achieve the right balance of cost and performance for any given workload with data automatically handled by the appropriate level of the hierarchy. ... Most generally, this new flash tier can be thought of as a radical form of hierarchical storage management (HSM) without the need for explicit management.

Updated July, 1: I've posted the link to the article in my subsequent blog post.

Saturday Jun 07, 2008

Apple updates DTrace

Back in January, I posted about a problem with Apple's port of DTrace to Mac OS X. The heart of the issue is that their port would silently drop data such that certain experiments would be quietly invalid. Unfortunately, most reactions seized on a headline paraphrasing a line of the post — albeit with the critical negation omitted (the subject and language were, perhaps, too baroque to expect the press to read every excruciating word). The good news is that Apple has (quietly) fixed the problem in Mac OS X 10.5.3.

One issue was that timer based probes wouldn't fire if certain applications were actively executing (e.g. iTunes). This was evident both by counting periodic probe firings, and by the absence of certain applications when profiling. Apple chose to solve this problem by allowing the probes to fire while denying any inspection of untraceable processes (and generating a verbose error in that case). This script which should count 1000 firings per virtual CPU gave sporadic results on earlier revisions of Mac OS X 10.5:

profile-1000
{
	@ = count();
}

tick-1s
{
	printa(@);
	clear(@);
}

On 10.5.3, the output is exactly what one would expect on a 2-core CPU (1,000 executions per core):

  1  22697                         :tick-1s 
             2000

  1  22697                         :tick-1s 
             2000

On previous revisions, profiling to see what applications were spending the most time on CPU would silently omit certain applications. Now, while we can't actually peer into those apps, we can infer the presence of stealthy apps when we encounter an error:

profile-199
{
	@[execname] = count();
}

ERROR
{
	@["=stealth app="] = count();
}

Running this DTrace script will generate a lot of errors as we try to evaluate the execname variable for secret applications, but at the end we'll end up with a table like this:

  Adium                                                             1
  GrowlHelperApp                                                    1
  iCal                                                              1
  kdcmond                                                           1
  loginwindow                                                       1
  Mail                                                              2
  Activity Monito                                                   3
  ntpd                                                              3
  pmTool                                                            6
  mlb-nexdef-auto                                                  12
  Terminal                                                         14
  =stealth app=                                                    29
  WindowServer                                                     34
  kernel_task                                                     307
  Safari                                                          571

A big thank you to Apple for making progress on this issue; the situation is now much improved and considerably more palatable. That said, there are a couple of problems. The first is squarely the fault of team DTrace: we should probably have a mode where errors aren't printed particularly if the script is already handling them explicitly using an ERROR probe as in the script above. For the Apple folks: I'd argue that revealing the name of otherwise untraceable processes is no more transparent than what Activity Monitor provides — could I have that please? Also, I'm not sure if this has always been true, but the ustack() action doesn't seem to work from the profile action so simple profiling scripts like this one produce a bunch of errors and no output:

profile-199
/execname == "Safari"/
{
	@[ustack()] = count();
}

But to reiterate: thank you thank you thank you, Steve, James, Tom, and the rest of the DTrace folks at Apple. It's great to see these issues being addressed. The whole DTrace community appreciates it.

Sunday May 04, 2008

dtrace.conf post-post-mortem

This originally was going to be a post-mortem on dtrace.conf, but so much time has passed, that I doubt it qualifies anymore. Back in March, we held the first ever DTrace (un)conference, and I hope I speak for all involved when I declare it a terrific success. And our t-shirts (logo pictured) were, frankly, bomb. Here are some fairly random impressions from the day:

Notes on the demographics at dtrace.conf: Macs were the most prevalent laptops by quite a wide margin, and a ton of demos were done under VMware for the Mac. There were a handful of dvorak users who far outnumbered the Esperanto speakers (there were none) despite apparently similarly rationales. There were, by a wide margin, more live demonstrations that I'd seen during a day of technical talks; there were probably fewer individual slides than demos -- exactly what we had in mind.

My favorite session brought the authors of the three DTrace ports to the front of the room to talk about porting, and answer questions (mostly from the DTrace team). I was excited that they agreed to work together on a wiki and on a DTrace porting project. Both would be great for new ports and for building a repository that could integrate all the ports into a single repository. I just have to see if I can get them to follow through now several weeks removed from the DTrace love-in...

Also particularly interesting were a demonstration of a DTrace-enabled Adobe Air prototype and the very clever mechanism behind the Java group's plan for native Java static probes (JSDT). Essentially, they're using the same technique as normal USDT, but dynamically generating the tracing description structures and sending them down to the kernel (slick).

The most interesting discussion resulted from Keith's presentation of vprobes -- a DTrace... um... inspired facility in VMware. While it is necessary to place a unified tracing mechanism at the lowest level of software abstraction (in DTrace's case, the kernel), it may also make sense to embed collaborating tracing frameworks at other levels of the stack. For example, the JVM could include a micro-DTrace which communicated with DTrace in the kernel as needed. This would both improve enabled performance (not a primary focus of DTrace), and allow for better domain-specific instrumentation and expression. I'll be interested to see how vprobes executes on this idea.

Requests from the DTrace community:

  • more providers ala the recent nfs and proposed ip providers
  • consistency between providers (kudos to those sending their providers to the DTrace discussion list for review)
  • better compatibility with the ports -- several people observed that while they love the port to Leopard, Apple's spurious exclusion of the -G option created tricky conflicts

Ben was kind enough to video the entire day. We should have the footage publicly available in about a week. Thanks to all who participated; several recent projects have already gotten me excited for dtrace.conf(09).

Thursday Apr 10, 2008

DTrace and JavaOne: The End of the Beginning

It was a good run, but Jarod and I didn't make the cut for JavaOne this year...

2005

In 2005, Jarod came up with what he described as a jacked up way to use DTrace to get inside Java. This became the basis of the Java provider (first dvm for the 1.4.2 and 1.5 JVMs and now the hotspot provider for Java 6). That year, I got to stand up on stage at the keynote with John Loiacono and present DTrace for Java for the first time (to 10,000 people -- I was nervous). John was then the EVP of software at Sun. Shortly after that, he parlayed our keynote success into a sweet gig at Adobe (I was considered for the job, but ultimately rejected, they said, because their door frames couldn't accommodate my fro -- legal action is pending).

That year we also started the DTrace challenge. The premise was that if we chained up Jarod in the exhibition hall, developers could bring him their applications and he could use DTrace to find a performance win -- or he'd fork over a free iPod. In three years Jarod has given out one iPod and that one deserves a Bondsian asterisk.

After the excitement of the keynote, and the frenetic pace of the exhibition hall (and a haircut), Jarod and I anticipated at least fair interest in our talk, but we expected the numbers to be down a bit because we were presenting in the afternoon on the last day of the conference. We got to the room 15 minutes early to set up, skirting what we thought must have been the line for lunch, or free beer, or something, but turned out to be the line for our talk. Damn. It turns out that in addition to the 1,000 in the room, there was an overflow room with another 500-1,000 people. That first DTrace for Java talk had only the most basic features like tracing method entry and return, memory allocation, and Java stack backtraces -- but we already knew we were off to a good start.

2006

No keynote, but the DTrace challenge was on again and our talk reprised its primo slot on the last day of the conference after lunch (yes, that's sarcasm). That year the Java group took the step of including DTrace support in the JVM itself. It was also possible to dynamically turn instrumentation of the JVM off and on as opposed to the start-time option of the year before. In addition to our talk, there was a DTrace hands-on lab that was quite popular and got people some DTrace experience after watching what it can do in the hands of someone like Jarod.

2007

The DTrace talk in 2007 (again, last day of the conference after lunch) was actually one of my favorite demos I've given because I had never seen the technology we were presenting before. Shortly before JavaOne started, Lev Serebryakov from the Java group had built a way of embedding static probes in a Java program. While this isn't required to trace Java code, it does mean that developers can expose the higher level semantics of their programs to users and developers through DTrace. Jarod hacked up an example in his hotel room about 20 minutes before we presented, and amazingly it all went off without a hitch. How money is that?

JSDT -- as the Java Statically Defined Tracing is called -- is in development for the next version of the JVM, and is the next step for DTrace support of dynamic languages. Java was the first dynamic language that we first considered for use with DTrace, and it's quite a tough environment to support due to the incredible sophistication of the JVM. That support has lead the way for other dynamic languages such as Ruby, Perl, and Python which all now have built-in DTrace providers.

2008

For DTrace and Java, this is not the end. It is not even the beginning of the end. Jarod and I are out, but Jon, Simon, Angelo, Raghavan, Amit, and others are in. At JavaOne 2008 next month there will be a talk, a BOF, and a hands-on lab about DTrace for Java and it's not even all Java: there's some php and JavaScript mixed in and both also have their own DTrace providers. I've enjoyed speaking at JavaOne these past three years, and while it's good to pass the torch, I'll miss doing it again this year. If I have the time, and can get past security I'll try to sneak into Jon and Simon's talk -- though it will be a departure from tradition for a DTrace talk to fall on a day other than the last.

Monday Apr 07, 2008

Expand-O-Matic RAID-Z

I was having a conversation with an OpenBSD user and developer the other day, and he mentioned some ongoing work in the community to consolidate support for RAID controllers. The problem, he was saying, was that each controller had a different administrative model and utility -- but all I could think was that the real problem was the presence of a RAID controller in the first place! As far as I'm concerned, ZFS and RAID-Z have obviated the need for hardware RAID controllers.

ZFS users seem to love RAID-Z, but a frustratingly frequent request is to be able to expand the width of a RAID-Z stripe. While the ZFS community may care about solving this problem, it's not the highest priority for Sun's customers and, therefore, for the ZFS team. It's common for a home user to want to increase his total storage capacity by a disk or two at a time, but enterprise customers typically want to grow by multiple terabytes at once so adding on a new RAID-Z stripe isn't an issue. When the request has come up on the ZFS discussion list, we have, perhaps unhelpfully, pointed out that the code is all open source and ready for that contribution. Partly, it's because we don't have time to do it ourselves, but also because it's a tricky problem and we weren't sure how to solve it.

Jeff Bonwick did a great job explaining how RAID-Z works, so I won't go into it too much here, but the structure of RAID-Z makes it a bit trickier to expand than other RAID implementations. On a typical RAID with N+M disks, N data sectors will be written with M parity sectors. Those N data sectors may contain unrelated data so adding modifying data on just one disk involves reading the data off that disk and updating both those data and the parity data. Expanding a RAID stripe in such a scheme is as simple as adding a new disk and updating the parity (if necessary). With RAID-Z, blocks are never rewritten in place, and there may be multiple logical RAID stripes (and multiple parity sectors) in a given row; we therefore can't expand the stripe nearly as easily.

A couple of weeks ago, I had lunch with Matt Ahrens to come up with a mechanism for expanding RAID-Z stripes -- we were both tired of having to deflect reasonable requests from users -- and, lo and behold, we figured out a viable technique that shouldn't be very tricky to implement. While Sun still has no plans to allocate resources to the problem, this roadmap should lend credence to the suggestion that someone in the community might work on the problem.

The rest of this post will discuss the implementation of expandable RAID-Z; it's not intended for casual users of ZFS, and there are no alchemic secrets buried in the details. It would probably be useful to familiarize yourself with the basic structure of ZFS, space maps (totally cool by the way), and the code for RAID-Z.

Dynamic Geometry

ZFS uses vdevs -- virtual devices -- to store data. A vdev may correspond to a disk or a file, or it may be an aggregate such as a mirror or RAID-Z. Currently the RAID-Z vdev determines the stripe width from the number of child vdevs. To allow for RAID-Z expansion, the geometry would need to be a more dynamic property. The storage pool code that uses the vdev would need to determine the geometry for the current block and then pass that as a parameter to the various vdev functions.

There are two ways to record the geometry. The simplest is to use the GRID bits (an 8 bit field) in the DVA (Device Virtual Address) which have already been set aside, but are currently unused. In this case, the vdev would need to have a new callback to set the contents of the GRID bits, and then a parameter to several of its other functions to pass in the GRID bits to indicate the geometry of the vdev when the block was written. An alternative approach suggested by Jeff and Bill Moore is something they call time-dependent geometry. The basic idea is that we store a record each time the geometry of a vdev is modified and then use the creation time for a block to infer the geometry to pass to the vdev. This has the advantage of conserving precious bits in the fixed-width DVA (though at 128 bits its still quite big), but it is a bit more complex since it would require essentially new metadata hanging off each RAID-Z vdev.

Metaslab Folding

When the user requests a RAID-Z vdev be expanded (via an existing or new zpool(1M) command-line option) we'll apply a new fold operation to the space map for each metaslab. This transformation will take into account the space we're about to add with the new devices. Each range [a, b] under a fold from width n to width m will become

[ m \* (a / n) + (a % n), m \* (b / n) + b % n ]

The alternative would have been to account for m - n free blocks at the end of every stripe, but that would have been overly onerous both in terms of processing and in terms of bookkeeping. For space maps that are resident, we can simply perform the operation on the AVL tree by iterating over each node and applying the necessary transformation. For space maps which aren't in core, we can do something rather clever: by taking advantage of the log structure, we can simply append a new type of space map entry that indicates that this operation should be applied. Today we have allocated, free, and debug; this would add fold as an additional operation. We'd apply that fold operation to each of the 200 or so space maps for the given vdev. Alternatively, using the idea of time-dependent geometry above, we could simply append a marker to the space map and access the geometry from that repository.

Normally, we only rewrite the space map if the on-disk, log-structure is twice as large as necessary. I'd argue that the fold operation should always trigger a rewrite since processing it always requires a O(n) operation, but that's really an ancillary point.

vdev Update

At the same time as the previous operation, the vdev metadata will need to be updated to reflect the additional device. This is mostly just bookkeeping, and a matter of chasing down the relevant code paths to modify and augment.

Scrub

With the steps above, we're actually done for some definition since new data will spread be written in stripes that include the newly added device. The problem is that extant data will still be stored in the old geometry and most of the capacity of the new device will be inaccessible. The solution to this is to scrub the data reading off every block and rewriting it to a new location. Currently this isn't possible on ZFS, but Matt and Mark Maybee have been working on something they call block pointer rewrite which is needed to solve a variety of other problems and nicely completes this solution as well.

That's It

After Matt and I had finished thinking this through, I think we were both pleased by the relative simplicity of the solution. That's not to say that implementing it is going to be easy -- there's still plenty of gaps to fill in -- but the basic algorithm is sound. A nice property that falls out is that in addition to changing the number of data disks, it would also be possible to use the same mechanism to add an additional parity disk to go from single- to double-parity RAID-Z -- another common request.

So I can now extend a slightly more welcoming invitation to the ZFS community to engage on this problem and contribute in a very concrete way. I've posted some diffs which I used sketch out some ideas; that might be a useful place to start. If anyone would like to create a project on OpenSolaris.org to host any ongoing work, I'd be happy to help set that up.

Sunday Aug 05, 2007

What-If Machine: DTrace Port


What if there were a port of DTrace to Linux?

What if there were a port of DTrace to Linux: could such a thing be done without violating either the GPL or CDDL? Read on before you jump right to the comments section to add your two cents.

In my last post, I discussed an attempt to create a DTrace knockoff in Linux, and suggested that a port might be possible. Naively, I hoped that comments would examine the heart of my argument, bemoan the apparent NIH in the Linux knockoff, regret the misappropriation of slideware, and maybe discuss some technical details -- anything but dwell on licensing issues.

For this post, I welcome the debate. Open source licenses are important, and the choice can have a profound impact on the success of the software and the community. But conversations comparing the excruciating minutia of one license and another are exhausting, and usually become pointless in a hurry. Having a concrete subject might lead to a productive conversation.

DTrace Port Details

Just for the sake of discussion, let's say that Google decide to port DTrace to Linux (everyone loves Google, right?). This isn't so far fetched: Google uses Linux internally, maybe they're using SystemTap, maybe they're not happy with it, but they definitely (probably) care about dynamic tracing (just like all good system administrators and developers should). So suppose some engineers at Google take the following (purely hypothetical) steps:

Kernel Hooks

DTrace has a little bit of functionality that lives in the core kernel. The code to deal with invalid memory accesses, some glue between the kernel's dynamic linker and some of the DTrace instrumentation providers, and some simple, low-level routines cover the bulk of it. My guess is there are about 1500 lines of code all told: not trivial, but hardly insurmountable. Google implements these facilities in a manner designed to allow the results to be licensed under the GPL. For example, I think it would be sufficient for someone to draft a specification and for someone else to implement it so long as the person implementing it hadn't seen the CDDL version. Google then posts the patch publically.

DTrace Kernel Modules

The other DTrace kernel components are divided into several loadable kernel modules. There's the main DTrace module and then the instrumentation provider modules that connect to the core framework through an internal interface. These constitute the vast majority of the in-kernel DTrace code. Google modifies these to use slightly different interfaces (e.g. mutex_enter() becomes mutex_lock()); the final result is a collection of kernel modules still licensed under the CDDL. Of course, Google posts any modifications to CDDL files.

DTrace Libraries and Commands

It wouldn't happen for free, but the DTrace user-land components could just be directly ported. I don't believe there are any legal issues here.

So let's say that this is Google's DTrace port: their own hacked up kernel, some kernel modules operating under a non-GPL license, and some user-land components (also under a non-GPL license, but, again, I don't think that matters). Now some questions:

1. Legal To Run?

If Google assembled such a system, would it be legal to run on a development desktop machine? It seems to violate the GPL no more than, say, the nVidia drivers (which are presumably also running on that same desktop). What if Google installed the port on a customer-facing machine? Are there any additional legal complications there? My vote: legit.

2. Legal To Distribute?

Google distributes the Linux kernel patch (so that others can construct an identical kernel), and elsewhere they distribute the Linux-ready DTrace modules (in binary or source form): would that violate either license? It seems that it would potentially violate the GPL if a full system with both components were distributed together, but distributed individually it would certainly be fine. My vote: legit, but straying into a bit of a gray area.

3. Patch Accepted?

I'm really just putting this here for completeness. Google then submits the changes to the Linux kernel and tries to get them accepted upstream. There seems to be a precedent for the Linux kernel not accepting code that's there merely to support non-GPL kernel modules, so I doubt this would fly. My vote: not gonna happen.

4. No Source?

What if Google didn't supply the source code to either component, and didn't distribute any of it externally? My vote: legal, but morally bankrupt.

You Make The Call

So what do you think? Note that I'm not asking if it would be "good", and I'm not concluding that this would obviate the need for direct support for a native dynamic tracing framework in the Linux kernel. What I want to know is whether or not this DTrace port to Linux would be legal (and why)? If not, what would happen to poor Google (e.g. would FSF ninjas storm the Googleplex)?

If you care to comment, please include some brief statement about your legal expertise. I for one am not a lawyer, have no legal background, have read both the GPL and CDDL and have a basic understanding of both, but claim to be an authority in neither. If you don't include some information with regard to that, I may delete your comment.

Wednesday Jan 31, 2007

gzip for ZFS update

The other day I posted about a prototype I had created that adds a gzip compression algorithm to ZFS. ZFS already allows administrators to choose to compress filesystems using the LZJB compression algorithm. This prototype introduced a more effective -- albeit more computationally expensive -- alternative based on zlib.

As an arbitrary measure, I used tar(1) to create and expand archives of an ON (Solaris kernel) source tree on ZFS filesystems compressed with lzjb and gzip algorithms as well as on an uncompressed ZFS filesystem for reference:

Thanks for the feedback. I was curious if people would find this interesting and they do. As a result, I've decided to polish this wad up and integrate it into Solaris. I like Robert Milkowski's recommendation of options for different gzip levels, so I'll be implementing that. I'll also upgrade the kernel's version of zlib from 1.1.4 to 1.2.3 (the latest) for some compression performance improvements. I've decided (with some hand-wringing) to succumb to the requests for me to make these code modifications available. This is not production quality. If anything goes wrong it's completely your problem/fault -- don't make me regret this. Without further disclaimer: pdf patch

In reply to some of the comments:

UX-admin One could choose between lzjb for day-to-day use, or bzip2 for heavily compressed, "archival" file systems (as we all know, bzip2 beats the living daylights out of gzip in terms of compression about 95-98% of the time).

It may be that bzip2 is a better algorithm, but we already have (and need zlib) in the kernel, and I'm loath to add another algorithm

ivanvdb25 Hi, I was just wondering if the gzip compression has been enabled, does it give problems when an ZFS volume is created on an X86 system and afterwards imported on a Sun Sparc?

That isn't a problem. Data can be moved from one architecture to another (and I'll be verifying that before I putback).

dennis Are there any documents somewhere explaining the hooks of zfs and how to add features like this to zfs? Would be useful for developers who want to add features like filesystem-based encryption to it. Thanks for your great work!

There aren't any documents exactly like that, but there's plenty of documentation in the code itself -- that's how I figured it out, and it wasn't too bad. The ZFS source tour will probably be helpful for figuring out the big picture.

Update 3/22/2007: This work was integrated into build 62 of onnv.


Technorati Tags:

Tuesday Dec 12, 2006

It's tested or it's broken

It's amazing how lousy software is. That we as a society have come to accept buggy software as an inevitability is either a testament to our collective tolerance, or -- much more likely -- the near ubiquity of crappy software. So we are guilty of accepting low standards for software, but the smaller we of software writers are guilty of setting those low expectations. And I mean we: all of us. Every programmer has at some time written buggy software (or has never written any software of any real complexity), and while we're absolutely at fault its not from lack of exertion. From time immemorial PhD candidates have scratched their whiteboard markers dry in attempts to eliminate bugs with new languages, analysis, programming techniques, and styles. The simplest method for finding bugs before they're released into the wild remains the most generally effective: testing.

Of course, programmers perform at least nominal checks before integrating new code, but there's only so much a person can test by hand. So we've invented tests suites -- collections of tests that require no interaction. Testing rigor is regarded by university computer science departments a bit like ditch-digging is by civil engineering departments: a bit pedestrian. So people tend to sort it out for themselves. Here are a couple of tips for software tests that have come out of my experience using and developing tests suites (and the DTrace test suite in particular):

It has to be easy to run

A favorite mantra of a colleague of mine is that software is only as good as its test suite. While slightly less pithy, I'd add that a test suite is only as good as one's ability to run it. At Sun we have test suites for all kinds of crazy things. Many of them require elaborate configurations, and complex installations. Even when you manage to get everything set up (or, as often as not, find someone else to get it set up) and run, comprehending the results can require a visit from the high priestess of QA to scrutinize the pigeon entrails of the output logs.

Installing and executing a test suite needs to be so simple that it can be done by any moron who might have the wherewithall to be able to modify the software it tests (hint: that's usually a lower bar than you'd like). The same goes for understanding the results. Building the DTrace test suite creates a package which you then install wherever you want to perform the testing. Running it (by executing a single command) produces output indicating how many tests passed and how many failed. A single failure represents a bug. I've used test suites where there are expected failures (things are no more broken than they were), and unexpected failures (you broke something), but differentiating the two can be nearly impossible for a novice. Keep it simple and easy to understand, or don't bother at all -- no one will run tests they can't figure out.

Complete and up-to-date

Now that people are executing the test suite because it's such a breeze, it actually needs to test the software. I think it's productive to write tests both from the perspective of the implementation and the documented behavior, but there just needs to be adequate coverage -- and the extent of the coverage is often you can test for with some accuracy. As the software is evolving, the test suite needs to evolve with it. Every enhancement or bug fix should be accompanied with new tests to verify the change, and to ensure that it's not regressed in the future. On projects I've worked on, the tests for certain features have required much more thought and effort than the feature itself, but skipping the test is absolutely unacceptable. In short: a test suite should completely test the target software at any given moment.

With the code

Originally we developed the DTrace test suite as a separate code base. This caused some unanticipated problems. Since they were in different places, we would often integrate a change to DTrace and forget about the test for a couple of days -- violating the constraint noted above. Also, projects that lagged behind the main repository would run the test suite and encounter a bunch of spurious failures because they were effectively testing out of date software. We had similar problems when back-porting new DTrace features and fixes to Solaris 10.

The solution -- in a rare split decision among the DTrace team -- was to integrate the test suite into the same repository as the code. This has absolutely been the right move. Now we can update the code and the test suite literally at the same time, and we're forced to think about testing sooner and more rigorously. It's also proved beneficial for the back-porting effort since a given snapshot of the source base contains the correct tests for that code.

Run automatically

Ideally it shouldn't be necessary, but automatically running tests is a great way to ensure that errors don't creep in because of sloppy engineering or seemingly unrelated changes. This is actually an area where DTrace is a less compelling role model. If we had put this procedure in place, it would have helped us to catch at least one bug quite a bit earlier. Solaris Nevada -- the code name for the next Solaris release -- recently changed compiler versions which resulted in a DTrace bug due to a newly aggressive optimizer on SPARC. The DTrace test suite picked this up immediately, but it wasn't run for at least a week after the compiler switch was made. We're working to have it run nightly, and our new project has been running nightly tests for a few weeks now.

Go forth and test

I've spent too many hours trying to figure out how to run arcane test suites -- just so I can't be accused of unduly contributing to the crappy state of software. I hope some of these (admittedly less-than-brilliant) lessons learned from testing DTrace have been helpfull. If you want to check out the DTrace test suite, you can see the code here and find the documentation for it here.


Technorati Tags:

About

Adam Leventhal, Fishworks engineer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today