Recent Posts


Fishworks history of SSDs

This year's flash memory summit got me thinking about our use of SSDs over the years at Fishworks. The picture of our left is a visual history of SSD evals in rough chronological order from the oldest at the bottom to the newest at the top (including some that have yet to see the light of day).Early DaysWhen we started Fishworks, we were inspired by the possibilities presented by ZFS and Thumper. Those components would be key building blocks in the enterprise storage solution that became the 7000 series. An immediate deficiency we needed to address was how to deliver competitive performance using 7,200 RPM disks. Folks like NetApp and EMC use PCI-attached NV-DRAM as a write accelerator. We evaluated something similar, but found the solution lacking because it had limited scalability (the biggest NV-DRAM cards at the time were 4GB), consumed our limited PCIe slots, and required a high-speed connection between nodes in a cluster (e.g. IB, further eating into our PCIe slot budget).The idea we had was to use flash. None of us had any experience with flash beyond cell phones and USB sticks, but we had the vague notion that flash was fast and getting cheaper. By luck, flash SSDs were just about to be where we needed them. In late 2006 I started evaluating SSDs on behalf of the group, looking for what we would eventually call Logzilla. At that time, SSDs were getting affordable, but were designed primarily for environments such as military use where ruggedness was critical. The performance of those early SSDs was typically awful.LogzillaSTEC — still Simpletech in those days — realized that their early samples didn't really suit our needs, but they had a new device (partly due to the acquisition of Gnutech) that would be a good match. That first sample was fibre-channel and took some finagling to get working (memorably it required metric screw of an odd depth), but the Zeus IOPS, an 18GB 3.5" SATA SSD using SLC NAND, eventually became our Logzilla (we've recently updated it with a SAS version for our updated SAS-2 JBODs). Logzilla addressed write performance economically, and scalably in a way that also simplified clustering; the next challenge was read performance.ReadzillaIntent on using commodity 7,200 RPM drives, we realized that our random read latency would be about twice that of 15K RPM drives (duh). Fortunately, most users don't access all of their data randomly (regardless of how certain benchmarks are designed). We already had much more DRAM cache than other storage products in our market segment, but we thought that we could extend that cache further by using SSDs. In fact, the invention of the L2ARC followed a slightly different thought process: seeing the empty drive bays in the front of our system (just two were used as our boot disks) and the piles of SSDs laying around, I stuck the SSDs in the empty bays and figured out how we'd use them.It was again STEC who stepped up to provide our Readzilla, a 100GB 2.5" SATA SSD using SLC flash.Next GenerationLogzilla and Readzilla are important features of the Hybrid Storage Pool. For the next generation expect the 7000 series to move away from SLC NAND flash. It was great for the first generation, but other technologies provide better $/IOPS for Logzilla and better $/GB for Readzilla (while maintaining low latency). For Logzilla we think that NV-DRAM is a better solution (I reviewed one such solution here), and for Readzilla MLC flash has sufficient performance at much lower cost and ZFS will be able to ensure the longevity.

This year's flash memory summit got me thinking about our use of SSDs over the years at Fishworks. The picture of our left is a visual history of SSD evals in rough chronological order from the oldest...


Farewell to Bryan Cantrill

Bryan Cantrill, VP of Engineering at Joyent, earning $15.I've been expecting this automated mail for a while now, but it was disheartening nonetheless: List: dtrace-discuss Member: bryan.cantrill@eng.sun.com Action: Subscription disabled. Reason: Excessive or fatal bounces.As one of the moderators of the DTrace discussion list, I see people subscribe and unsubscribe. Bryan has, of course, left Oracle and joined Joyent to be their VP of engineering.Bryan is a terrific engineer, and I count myself lucky to have worked with him for the past nine years first on DTrace and then on Fishworks. He taught me many things, but perhaps most important was his holistic view of engineering that encompasses all aspects of making a product successful including docs, pricing, talks, papers, and, of course, excellent code. Now Bryan is off to cut through the layers software that make up the cloud. Far from leaving the DTrace community, he's going to take DTrace to new places and I look forward to seeing the fruits of his labor as he sinks his teeth into a new onion of abstractions.... and, Robin, Bryan's certainly a smart guy, but "the smart guy behind Dtrace [sic]"?? Just don't refer to me and Mike as "the dumb guys behind DTrace" okay?

Bryan Cantrill, VP of Engineering at Joyent, earning $15. I've been expecting this automated mail for a while now, but it was disheartening nonetheless: List: dtrace-discuss Member: ...


What is RAID-Z?

The mission of ZFS was to simplify storage and to construct an enterprise level of quality from volume components by building smarter software — indeed that notion is at the heart of the 7000 series. An important piece of that puzzle was eliminating the expensive RAID card used in traditional storage and replacing it with high performance, software RAID. To that end, Jeff invented RAID-Z; it's key innovation over other software RAID techniques was to close the "RAID-5 write hole" by using variable width stripes. RAID-Z, however, is definitely not RAID-5 despite that being the most common comparison.RAID levelsLast year I wrote about the need for triple-parity RAID, and in that article I summarized the various RAID levels as enumerated by Gibson, Katz, and Patterson, along with Peter Chen, Edward Lee, and myself:RAID-0 Data is striped across devices for maximal write performance. It is an outlier among the other RAID levels as it provides no actual data protection.RAID-1 Disks are organized into mirrored pairs and data is duplicated on both halves of the mirror. This is typically the highest-performing RAID level, but at the expense of lower usable capacity.RAID-2 Data is protected by memory-style ECC (error correcting codes). The number of parity disks required is proportional to the log of the number of data disks.RAID-3 Protection is provided against the failure of any disk in a group of N+1 by carving up blocks and spreading them across the disks — bitwise parity. Parity resides on a single disk.RAID-4 A group of N+1 disks is maintained such that the loss of any one disk would not result in data loss. A single disks is designated as the dedicated parity disk. Not all disks participate in reads (the dedicated parity disk is not read except in the case of a failure). Typically parity is computed simply as the bitwise XOR of the other blocks in the row.RAID-5 N+1 redundancy as with RAID-4, but with distributed parity so that all disks participate equally in reads.RAID-6 This is like RAID-5, but employs two parity blocks, P and Q, for each logical row of N+2 disk blocks.RAID-7 Generalized M+N RAID with M data disks protected by N parity disks (without specifications regarding layout, parity distribution, etc).RAID-Z: RAID-5 or RAID-3?Initially, ZFS supported just one parity disk (raidz1), and later added two (raidz2) and then three (raidz3) parity disks. But raidz1 is not RAID-5, and raidz2 is not RAID-6. RAID-Z avoids the RAID-5 write hole by distributing logical blocks among disks whereas RAID-5 aggregates unrelated blocks into fixed-width stripes protected by a parity block. This actually means that RAID-Z is far more similar to RAID-3 where blocks are carved up and distributed among the disks; whereas RAID-5 puts a single block on a single disk, RAID-Z and RAID-3 must access all disks to read a single block thus reducing the effective IOPS.RAID-Z takes a significant step forward by enabling software RAID, but at the cost of backtracking on the evolutionary hierarchy of RAID. Now with advances like flash pools and the Hybrid Storage Pool, the IOPS from a single disk may be of less importance. But a RAID variant that shuns specialized hardware like RAID-Z and yet is economical with disk IOPS like RAID-5 would be a significant advancement for ZFS.

The mission of ZFS was to simplify storage and to construct an enterprise level of quality from volume components by building smarter software — indeed that notion is at the heart of the 7000 series....


A Logzilla for your ZFS box

A key component of the ZFS Hybrid Storage Pool is Logzilla, a very fast device to accelerate synchronous writes. This component hides the write latency of disks to enable the use of economical, high-capacity drives. In the Sun Storage 7000 series, we use some very fast SAS and SATA SSDs from STEC as our Logzilla &mdash the devices are great and STEC continues to be a terrific partner. The most important attribute of a good Logzilla device is that it have very low latency for sequential, uncached writes. The STEC part gives us about 100μs latency for a 4KB write — much much lower than most SSDs. Using SAS-attached SSDs rather than the more traditional PCI-attached, non-volatile DRAM enables a much simpler and more reliable clustering solution since the intent-log devices are accessible to both nodes in the cluster, but SAS is much slower than PCIe...DDRdrive X1Christopher George, CTO of DDRdrive was kind enough to provide me with a sample of the X1, a 4GB NV-DRAM card with flash as a backing store. The card contains 4 DIMM slots populated with 1GB DIMMs; it's a full-height card which limits its use in Sun/Oracle systems (typically half-height only), but there are many systems that can accommodate the card. The X1 employs a novel backup power solution; our Logzilla used in the 7000 series protects its DRAM write cache with a large super-capacitor, and many NV-DRAM cards use a battery. Supercaps can be limiting because of their physical size, and batteries have a host of problems including leaking and exploding. Instead, the DDRdrive solution puts a DC power connector on the PCIe faceplate and relies on an external source of backup power (a UPS for example).PerformanceI put the DDRdrive X1 in our fastest prototype system to see how it performed. A 4K write takes about 51μs — better than our SAS Logzilla — but the SSD outperformed the X1 at transfer sizes over 32KB. The performance results on the X1 are already quite impressive, and since I ran those tests the firmware and driver have undergone several revisions to improve performance even more.As a LogzillaWhile the 7000 series won't be employing the X1, uses of ZFS that don't involve clustering and for which external backup power is an option, the X1 is a great and economical Logzilla accelerator. Many users of ZFS have already started hunting for accelerators, and have tested out a wide array of SSDs. The X1 is a far more targeted solution, and is a compelling option. And if write performance has been a limiting factor in deploying ZFS, the X1 is a good reason to give ZFS another look.

A key component of the ZFS Hybrid Storage Pool is Logzilla, a very fast device to accelerate synchronous writes. This component hides the write latency of disks to enable the use of economical,...


The need for triple-parity RAID

When I first wrote about triple-parity RAID in ZFS and the Sun Storage 7000 series, I alluded a looming requirement for triple-parity RAID due to a growing disparity between disk capacity and throughput. I've written an article in ACM Queue examining this phenomenon in detail, and making the case for triple-parity RAID. Dominic Kay helped me sift through hard drive data for the past ten years to build a model for how long it takes to fully populate a drive. I've reproduced a graph here from the paper than displays the timing data for a few common drive types — the trends are obviously quite clear.The time to populate a drive is directly relevant for RAID rebuild. As disks in RAID systems take longer to reconstruct, the reliability of the total system decreases due to increased periods running in a degraded state. Today that can be four hours or longer; that could easily grow to days or weeks. RAID-6 grew out of a need for a system more reliable than what RAID-5 could offer. We are approaching a time when RAID-6 is no more reliable than RAID-5 once was. At that point, we will again need to refresh the reliability of RAID, and RAID-7, triple-parity RAID, will become the new standard.Triple-Parity RAID and BeyondADAM LEVENTHAL, SUN MICROSYSTEMSAs hard-drive capacities continue to outpace their throughput, the time has come for a new level of RAID.How much longer will current RAID techniques persevere? The RAID levels were codified in the late 1980s; double-parity RAID, known as RAID-6, is the current standard for high-availability, space-efficient storage. The incredible growth of hard-drive capacities, however, could impose serious limitations on the reliability even of RAID-6 systems. Recent trends in hard drives show that triple-parity RAID must soon become pervasive. In 2005, Scientific American reported on Kryder's law, which predicts that hard-drive density will double annually. While the rate of doubling has not quite maintained that pace, it has been close.Problematically for RAID, hard-disk throughput has failed to match that exponential rate of growth. Today repairing a high-density disk drive in a RAID group can easily take more than four hours, and the problem is getting significantly more pronounced as hard-drive capacities continue to outpace their throughput. As the time required for rebuilding a disk increases, so does the likelihood of data loss. The ability of hard-drive vendors to maintain reliability while pushing to higher capacities has already been called into question in this magazine. Perhaps even more ominously, in a few years, reconstruction will take so long as to effectively strip away a level of redundancy. What follows is an examination of RAID, the rate of capacity growth in the hard-drive industry, and the need for triple-parity RAID as a response to diminishing reliability.[...]

When I first wrote about triple-parity RAID in ZFS and the Sun Storage 7000 series, I alluded a looming requirement for triple-parity RAID due to a growing disparity between disk capacity and...


Logzillas: to mirror or stripe?

The Hybrid Storage Pool integrates flash into the storage hierarchy in two specific ways: as a massive read cache and as fast log devices. For read cache devices, Readzillas, there's no need for redundant configurations; it's a clean cache so the data necessarily also resides on disk. For log devices, Logzillas, redundancy is essential, but how that translates to their configuration can be complicated. How to decide whether to stripe or mirror?ZFS intent log devicesLogzillas are used as ZFS intent log devices (slogs in ZFS jargon). For certain synchronous write operations, data is written to the Logzilla so the operation can be acknowledged to the client quickly before the data is later streamed out to disk. Rather than the milliseconds of latency for disks, Logzillas respond in about 100μs. If there's a power failure or system crash before the data can be written to disk, the log will be replayed when the system comes back up, the only scenario in which Logzillas are read. Under normal operation they are effectively write-only. Unlike Readzillas, Logzillas are integral to data integrity and they are relied upon for data integrity in the case of a system failure.A common misconception is that a non-redundant Logzilla configuration introduces a single point of failure into the system, however this is not the case since the data contained on the log devices is also held in system memory. Though that memory is indeed volatile, data loss could only occur if both the Logzilla failed and the system failed within a fairly small time window.Logzilla configurationWhile a Logzilla doesn't represent a single point of failure, redundant configurations are still desirable in many situations. The Sun Storage 7000 series implements the Hybrid Storage Pool, and offers several different redundant disk configurations. Some of those configurations add a single level of redundancy: mirroring and single-parity RAID. Others provide additional redundancy: triple-mirroring, double-parity RAID and triple-parity RAID. For disk configurations that provide double disk redundancy of better, the best practice is to mirror Logzillas to achieve a similar level of reliability. For singly redundant disk configurations, non-redundant Logzillas might suffice, but there are conditions such as a critically damaged JBOD that could affect both Logzilla and controller more or less simultaneously. Mirrored Logzillas add additional protection against such scenarios.Note that the Logzilla configuration screen (pictured) includes a column for No Single Point of Failure (NSPF). Logzillas are never truly a single point of failure as previous discussed; instead, this column refers to the arrangement of Logzillas in JBODs. A value of true indicates that the configuration is resilient against JBOD failure.The most important factors to consider when deciding between mirrored or striped Logzillas are the consequences of potential data loss. In a failure of Logzillas and controller, data will not be corrupted, but the last 5-30 seconds worth of transactions could be lost. For example, while it typically makes sense to mirror Logzillas for triple-parity RAID configurations, it may be that the data stored is less important and the implications for data loss not worthy of the cost of another Logzilla device. Conversely, while a mirrored or single-parity RAID disk configuration provides only a single level of redundancy, the implications of data loss might be such that the redundancy of volatile system memory is insufficient. Just as it's important to choose the appropriate disk configuration for the right balance of performance, capacity, and reliability, it's at least as important to take care and gather data to make an informed decision about Logzilla configurations.

The Hybrid Storage Pool integrates flash into the storage hierarchy in two specific ways: as a massive read cache and as fast log devices. For read cache devices, Readzillas, there's no need for...


Triple-Parity RAID-Z

Double-parity RAID, or RAID-6, is the de facto industry standard forstorage; when I started talking about triple-parity RAID for ZFS earlierthis year, the need wasn't always immediately obvious. Double-parity RAID, ofcourse, provides protection from up to two failures (data corruption or the wholedrive) within a RAID stripe. The necessity of triple-parity RAIDarises from the observation that while hard drive capacity has roughly followedKryder's law, doubling annually, hard drive throughput has improved far moremodestly. Accordingly, the time to populate a replacement drive in a RAIDstripe is increasing rapidly. Today, a 1TB SAS drive takes about 4 hours to fill at itstheoretical peak throughput; in a real-world environment that number can easily double,and 2TB and 3TB drives expected this year and next won't move data much faster.Those long periods spent in a degraded state increase theexposure to the bit errors and other drive failures that would in turnlead to data loss.The industry moved to double-parity RAID because one parity disk was insufficient; longer resilver times mean that we're spending more and more time back at single-parity.From that it was obvious that double-parity will soon becomeinsufficient. (I'm working on an article that examines these phenomenaquantitatively so stay tuned... update Dec 21, 2009: you can find the article here)Last week Iintegratedtriple-parity RAID into ZFS. You can take a look at the implementation andthe details of the algorithmhere,but rather than describing the specifics, I wanted to describe itsgenesis. For double-parity RAID-Z, we drew on thework ofPeter Anvin which was also the basis of RAID-6 in Linux. This work was more orless a tutorial for systems programers, simplifying some of the more subtleunderlying mathematics with an eye towards optimization. While a systemsprogrammer by trade, I have a background in mathematics so was interested tounderstand the foundational work. James S. Plank's paperATutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems describes a technique for generalized N+M RAID.Not only was it simple to implement, but it could easily be made to perform well. I struggled for far too long trying to make the code work before discovering trivial flaws with the math itself. A bit more digging revealed that the author himself had published Note:Correction to the 1997 Tutorial on Reed-Solomon Coding8 years later addressing those same flaws.Predictably, the mathematically accurate version was far harder to optimize,stifling my enthusiasm for the generalized case. My more serious concern wasthat the double-parity RAID-Z code suffered some similar systemic flaw. This fearwas quickly assuaged as I verified that the RAID-6 algorithm was sound. Further, from this investigation I was able to find a related method for doingtriple-parity RAID-Z that was nearly as simple as its double-parity cousin.The math is a bit dense; but the key observation was that given that 3 is thesmallest factor of 255 (the largest value representable by an unsigned byte) itwas possible to find exactly of 3 different seed or generator valuesafter which there were collections of failures that formed uncorrectablesingularities. Using that technique I was able to implement a triple-parityRAID-Z scheme that performed nearly as well as the double-parity version.As far as generic N-way RAID-Z goes, it's still something I'd like toadd to ZFS. Triple-parity will suffice for quite a while, but we may wantmore parity sooner for a variety of reasons. Plank's revised algorithm is anexcellent start. The test will be if it can be made to perform well enough orif some new clever algorithm will need to be devised.Now, as for what to call these additional RAID levels, I'm not sure. RAID-7 or RAID-8 seem a bit ridiculous and RAID-TP and RAID-QP aren't any better. Fortunately, in ZFS triple-parity RAID is just raidz3.A little over three years ago, I integrateddouble-parityRAID-Zinto ZFS, a feature expected of enterprise class storage. This was in theearly days of Fishworks when much of our focus was on addressing functionalgaps. The move to triple-parity RAID-Z comes in the wake of a number of ourunique advancements to the state of the art such asDTrace-powered AnalyticsandtheHybrid Storage Pool as theSunStorage 7000 seriesproducts meet and exceed the standards set by the industry. Triple-parityRAID-Z will, of course, be a feature included in the next major softwareupdate for the 7000 series (2009.Q3).

Double-parity RAID, or RAID-6, is the de facto industry standard for storage; when I started talking about triple-parity RAID for ZFS earlierthis year, the need wasn't always immediately...


Sun Storage 7310

Today we're introducing a new member to the Sun Unified Storage family: the Sun Storage 7310. The 7310 is a scalable system from 12TB with a single half-populated J4400 JBOD up to 96TB with 4 JBODs. You can combine two 7310 head units to form a cluster. The base configuration includes a single quad-core CPU, 16GB of DRAM, a SAS HBA, and two available PCIe slots for NICs, backup cards, or the Fishworks cluster card. The 7310 can be thought of as a smaller capacity, lower cost version of the Sun Storage 7410. Like the 7410 it uses high density, low power disks as primary storage and can be enhanced with Readzilla and Logzilla flash accelerators for high performance. Like all the 7000 series products, the 7310 includes all protocols and software features without license fees.The 7310 is an entry-level clusterable, scalable storage server, but the performance is hardly entry-level. Brendan Gregg from the Fishworks team has detailed the performance of the 7410, and has published the results of those tests on the new 7310. Our key metrics are cached reads from DRAM, uncached reads from disk, and writes to disk all over two 10GbE links with 20 client systems. As shown in the graph, the 7310 is an absolute champ, punching well above its weight. The numbers listed are in units of MB/s. Notice that the recent 2009.Q2 software update brought significant performance improvements to the 7410, and that the 7310 holds its own. For owners of entry-level systems from other vendors, check for yourself, but the 7310 is a fire-breather.Added to the low-end 7110, the dense, expandable 7210, the high-end clusterable, expandable 7410, the 7310 fills an important role in the 7000 series product line: an entry-level clusterable, expandable system, with impressive performance, and an attractive price. If the specs and performance have piqued your interest, try out the user interface on the 7000 series with the Sun Storage 7000 simulator.

Today we're introducing a new member to the Sun Unified Storage family: the Sun Storage 7310. The 7310 is a scalable system from 12TB with a single half-populated J4400 JBOD up to 96TB with 4 JBODs....


Mirroring flash SSDs

As flash memory has become more and more prevalent in storage from the consumer to theenterprise people have been charmed by the performance characteristics, but get stuck on the longevity. SSDs based on SLC flash are typically rated at 100,000 to 1,000,000 write/erase cycles while MLC-based SSDs are rated for significantly less. For conventional hard drives, the distinct yet similar increase in failures over time has long been solved by mirroring (or other redundancy techniques). When applying this same solution to SSDs, a common concern is that two identical SSDs with identical firmware storing identical data would run out of write/erase cycles for a given cell at the same moment and thus data reliability would not be increased via mirroring. While the logic might seem reasonable, permit me to dispel that specious argument.The operating system and filesystemFrom the level of most operating systems or filesystems, an SSD appears like a conventional hard drive and is treated more or less identically (Solaris' ZFS being a notable exception). As with hard drives, SSDs can report predicted failures though SMART. For reasons described below, SSDs already keep track of the wear of cells, but one could imagine even the most trivial SSD firmware keeping track of the rapidly approaching write/erase cycle limit and notifying the OS or FS via SMART which would in turn the user. Well in advance of actual data loss, the user would have an opportunity to replace either or both sides of the mirror as needed.SSD firmwareProceeding down the stack to the level of the SSD firmware, there are two relevant features to understand: wear-leveling, and excess capacity. There is not a static mapping between the virtual offset of an I/O to an SSD and the physical flash cells that are chosen by the firmware to record the data. For a variety of reasons — flash call early mortality, write performance, bad cell remapping — it is necessary for the SSD firmware to remap data all over its physical flash cells. In fact, hard drives have a similar mechanism by which they hold sectors in reserve and remap them to fill in for defective sectors. SSDs have the added twist that they want to maximize the longevity of their cells each of which will ultimately decay over time. To do this, the firmware ensures that a given cell isn't written far more frequently than any other cell, a process called wear-leveling for obvious reasons.To summarize, subsequent writes to the same LBA, the same virtual location, on an SSD could land on different physical cells for the several reasons listed. The firmware is, more often than not, deterministic thus two identical SSDs with the exact same physical media and I/O stream (as in a mirror) would behave identically, but minor timing variations in the commands from operating software, and differences in the media (described below) ensure that the identical SSDs will behave differently. As time passes, those differences are magnified such that two SSDs that started with the same mapping between virtual offsets and physical media will quickly and completely diverge.Flash hardware and physicsIdentical SSDs with identical firmware, still have their own physical flash memory which can vary in quality. To break the problem apart a bit, an SSD is composed of many cells, and each cell's ability to retain data slowly degrades as it's exercised. Each cell is in fact a physical component of an integrated circuit composed. Flash memory differs from many other integrated circuits in that it requires far higher voltages than others. It is this high voltage that causes the oxide layer to gradually degrade over time. Further, all cells are not created equal — microscopic variations in the thickness and consistency of the physical medium can make some cells more resilient and others less; some cells might be DOA, while others might last significantly longer than the norm. By analogy, if you install new light bulbs in a fixture, they might burn out in the same month, but how often do they fail on the same day? The variability of flash cells impacts the firmware's management of the underlying cells, but more trivially it means that two SSDs in a mirror would experience dataloss of corrsponding regions at different rates.Wrapping upAs with conventional hard drives, mirroring SSDs is a good idea to preserve data integrity. The operating system, filesystem, SSD firmware, and physical properties of the flash medium make this approach sound both in theory and in practice. Flash is a new exciting technology and changes many of the assumptions derived from decades of experience with hard drives. As always proceed with care — especially when your data is at stake — but get the facts, and in this case the wisdom of conventional hard drives still applies.

As flash memory has become more and more prevalent in storage from the consumer to theenterprise people have been charmed by the performance characteristics, but get stuck on the longevity. SSDs based...


SS 7000 simulator update plus VirtualBox

On the heels of the 2009.Q2.0.0 release, we've posted an update to the Sun Storage 7000 simulator. The simulator contains the exact same software as the other members of the 7000 series, but runs inside a VM rather than on actual hardware. It supports all the same features, and has all the same UI components; just remember that an actual 7000 series appliance is going to perform significantly better than a VM running a puny laptop CPU. Download the simulator here.The new version of the simulator contains two enhancements. First, it comes with the 2009.Q2.0.0 release pre-installed. The Q2 release is the first to provide full support for the simulator, and as I wrote here you can simply upgrade your old simulator. In addition, while the original release of the simulator could only be run on VMware we now support both VMware and VirtualBox (version 2.2.2 or later). When we first launched the 7000 series back in November, we intended to support the simulator on VirtualBox, but a couple of issues thwarted us, in particular lack of OVF support and host-only networking. The recent 2.2.2 release of VirtualBox brought those missing features, so we're pleased to be able to support both virtualization platforms.As OVF support is new in VirtualBox, here's a quick installation guide for the simulator. After uncompressing the SunStorageVBox.zip archive, select "Import Appliance...", and select "Sun Storage VirtualBox.ovf". Clicking through will bring up a progress bar. Be warned: this can take a while depending on the speed of your CPU and hard drive.When that completes, you will see the "Sun Storage VirtualBox" VM in the VirtualBox UI. You may need to adjust settings such as the amount of allocated memory, or extended CPU features. Run the VM and follow the instructions when it boots up. You'll be prompted for some simple network information. If you're unsure how to fill in some of the fields, here are some pointers:Host Name - whatever you wantDNS Domain - "localdomain"Default Router - the same as the IP address but put 1 as the final octetDNS Server - the same as the IP address but put 1 as the final octetPassword - whatever you want and something you can rememberWhen you complete that form, wait until you're given a URL to copy into a web browser. Note that you'll need to use the version of the URL with the IP address (unless you've added an entry to your DNS server). In the above example, that would be: From the web browser, complete the appliance configuration, and then you can start serving up data, observing activity with Storage Analytics, and kicking the tires on a functional replica of a 7000 series appliance.

On the heels of the 2009.Q2.0.0 release, we've posted an update to the Sun Storage 7000 simulator. The simulator contains the exact same software as the other members of the 7000 series, but runs...


SSDs for HSPs

We're announcing a couple of new things in the flash SSD space. First, support the Intel X25-E SSD in a bunch of our servers. This can be used to create a Hybrid Storage Pool like in the Sun Storage 7000 series, or as just a little flash for high performance / low power / tough environmentals.Second, we're introducing a new open standard with the Open Flash Module. This creates a new form factor for SSDs bringing flash even closer to the CPU for higher performance and tighter system integration. SSDs in HDD form factors were a reasonable idea to gain market acceptance in much the same way as you first listened to your iPod over your car stereo with that weird tape adapter. Now the iPod is a first class citizen in many cars and, with the Open Flash Module, flash has found a native interface and form factor. This is a building block that we're very excited about, and it was designed specifically for use with ZFS and the Hybrid Storage Pool. Stay tuned: these flash miniDIMMs as they're called will be showing up in some interesting places soon enough. Speaking personally, this represents an exciting collaboration of hardware and software, and it's gratifying to see Sun showing real leadership around flash through innovation.

We're announcing a couple of new things in the flash SSD space. First, support the Intel X25-E SSD in a bunch of our servers. This can be used to create a Hybrid Storage Pool like in the Sun Storage...


Fishworks VM: the 7000 series on your laptop

In May of 2007 I was lined up to give my first customer presentation of whatwould become the Sun Storage 7000 series. I inherited a well-worn slide deckdescribing the product, but we had seen thereactions of prospective customers who saw the software live and had a chanceto interact with features such as Analytics; no slideswould elicit that kind of response. So with some tinkering, I hacked up ourinstaller and shoe-horned the prototype software into a virtual machine. Thelive demonstration was a hit despite some rocky software interactions.As the months passed, our software became increasingly aware of our hardware platforms;the patches I had used for the virtual machine version fell intodisrepair.Racing toward the product launch, neither I nor anyone else inthe Fishworks group had the time to nurse it back to health.I found myself using months old software for a customer demo— a useful tool, but embarrassing given the advances we had made.We knew that the VM was going to be great for presentations, and we hadtalked about releasing a version to the general public, but that, we thought,was something that we could sort out after the product launch.In the brief calm after the frenetic months finishing the product and just a few days before thelaunch in Las Vegas, our EVP ofstorage, John Fowler, paid a visit to the Fishworks office. When we mentionedthe VM version, his eyes lit up at the thought of how it would help storage professionals.Great news, but we realized that the next few days had just become much busier.Creating the VM version was a total barn-raising. Rather than a one-offwith sharp edges, adequate for a canned demo, we wanted to hand aproduct to users that would simulate exactlya Sun Storage 7000 series box. In about three days, everyone in thegroup pitched in to build what was essentially a brand new product and platform complete with a hardware view conjured from bits of our actual appliances. After a frenetic weekend in November, the Sun Unified Storage Simulator was ready in time for the launch. You can download it here for VMware. We had prepared versions for VirtualBox as well as VMware, preferring VirtualBox since it's a Sun product; along the way we found some usability issues with the VirtualBox version — we were pushing both products beyond their design center and VMware handled it better. Rest assured that we're working to resolve those issues and we'll release the simulator for VirtualBox just as soon as it's ready. Note that we didn't limit the functionality at all; what you see is exactly what you'll get with an actual 7000 series box (though the 7000 series will deliver much better performance than a laptop). Analytics, replication, compression, CIFS, iSCSI are all there; give it a try and see what you think.

In May of 2007 I was lined up to give my first customer presentation of what would become the Sun Storage 7000 series. I inherited a well-worn slide deck describing the product, but we had seen thereac...


More from the storage anarchist

In my last blog post I responded to Barry Burke author of the Storage Anarchist blog. I was under the perhaps naive impression that Barry was an independent voice in the blogosphere. In fact, he's merely Storage Anarchist by night; by day he's the mild-mannered chief strategy officer for EMC's Symmetrix Products Group — a fact notable for its absence from Barry's blog. In my post, I observed that Barry had apparently picked his horse in the flash race and Chris Caldwell commented that "it would appear that not only has he chosen his horse, but that he's planted squarely on its back wearing an EMC jersey." Indeed.While looking for some mention of his employment with EMC, I found this petard from Barry Burke chief strategy officer for EMC's Symmetrix Products Group:And [the "enterprise" differentiation] does matter – recall this video of a Fishworks JBOD suffering a 100x impact on response times just because the guy yells at a drive. You wouldn't expect that to happen with an enterprise class disk drive, and with enterprise-class drives in an enterprise-class array, it won't.Barry, we wondered the same thing so we got some time on what you'd consider an enterprise-class disk drive in an enterprise-class array from an enterprise-class vendor. The results were nearly identical (of course, measuring latency on other enterprise-class solutions isn't nearly as easy). It turns out drives don't like being shouted at (it's shock, not the traditional RV drives compensate for). That enterprise-class rig was not an EMC Symmetrix though I'd salivate over the opportunity to shout at one.

In my last blog post I responded to Barry Burke author of the Storage Anarchist blog. I was under the perhaps naive impression that Barry was an independent voice in the blogosphere. In fact, he's...


Dancing with the Anarchist

Barry Burke, the Storage Anarchist, has written an interesting roundup ("don't miss the amazing vendor flash dance") covering the flash strategies of some players in the server and storage spaces. Sun's position on flash comes out a bit mangled, but Barry can certainly be forgiven for missing the mark since Sun hasn't always communicated its position well. Allow me to clarify our version of the flash dance.Barry's conclusion that Sun sees flash as well-suited for the server isn't wrong — of course it's harder to drive high IOPS and low latency outside a single box. However we've also proven not only that we see a big role for flash in storage, but that we're innovating in that realm with the Hybrid Storage Pool (HSP) an architecture that seamlessly integrates flash into the storage hierarchy. Rather than a Ron Popeil-esque sales pitch, let me take you through the genesis of the HSP.The HSP is something we started to develop a bit over two years ago. By January of 2007, we had identified that a ZFS intent-log device using flash would greatly improve the performance of the nascent Sun Storage 7000 series in a way that was simpler and more efficient that some other options. We started getting our first flash SSD samples in February of that year. With SSDs on the brain, we started contemplating other uses and soon came up with the idea of using flash as a secondary caching tier between the DRAM cache (the ZFS ARC) and disk. We dubbed this the L2ARC.At that time we knew that we'd be using mostly 7200 RPM disks in the 7000 series. Our primary goal with flash was to greatly improve the performance of synchronous writes and we addressed this with the flash log device that we call Logzilla. With the L2ARC we solved the other side of the performance equation by improving read IOPS by leaps and bounds over what hard drives of any rotational speed could provide. By August of 2007, Brendan had put together the initial implementation of the L2ARC, and, combined with some early SSD samples — Readzillas — our initial enthusiasm was borne out. Yes, it's a caching tier so some workloads will do better than others, but customers have been very pleased with their results.These two distinct uses of flash comprise the Hybrid Storage Pool. In April 2008 we gave our first public talk about the HSP at the IDF in Shanghai, and a year and a bit after Brendan's proof of concept we shipped the 7410 with Logzilla and Readzilla. It's important to note that this system achieves remarkable price/performance through its marriage of commodity disks with flash. Brendan has done a terrific job of demonstrating the performance enabled by the HSP on that system.While we were finishing the product, the WSJ reported that EMC was starting to use flash drives into their products. I was somewhat deflated initially until it became clear that EMC's solution didn't integrate flash into the storage hierarchy nearly as seamlessly or elegantly as we had with the HSP; instead they had merely replaced their fastest, most expensive drives with faster and even more expensive SSDs. I'll disagree with the Storage Anarchist's conclusion: EMC did not start the flash revolution nor are they leading the way (though I don't doubt they are, as Barry writes, "Taking Our Passion, And Making It Happen"). EMC though has done a great service to the industry by extolling the virtues of SSDs and, presumably, to EMC customers by providing a faster tier for HSM.In the same article, Barry alludes to some of the problems with EMC's approach using SSDs from STEC:STEC rates their ZeusIOPS drives at something north of 50,000 read IOPS each, but as I have explained before, this is a misleading number because it’s for 512-byte blocks, read-only, without the overhead of RAID protection. A more realistic expectation is that the drives will deliver somewhere around 5-6000 4K IOPS (4K is a more typical I/O block size).The Hybrid Storage Pool avoids the bottlenecks associated with a tier 0 approach, drives much higher IOPS, scales, and makes highly efficient economical use of the resources from flash to DRAM and disk. Further, I think we'll be able to debunk this notion that the enterprise needs its own class of flash devices by architecting commodity flash to build an enterprise solution. There are a lot of horses in this race; Barry has clearly already picked his, but the rest of you may want survey the field.

Barry Burke, the Storage Anarchist, has written an interesting roundup ("don't miss the amazing vendor flash dance") covering the flash strategies of some players in the server and storage spaces....


Casting the shadow of the Hybrid Storage Pool

The debate, calmly waged, on the best use of flash in the enterprise can besummarized as whether flash should be a replacement for disk, acting asprimary storage, or it should be regarded as a new, and complementary tier inthe storage hierarchy, acting as a massive read cache. The market leaders instorage have weighed in the issue, and have declared incontrovertibly that,yes, both are the right answer, but there's some bias underlying thatequanimity.Chuck Hollis, EMC's Global Marketing CTO, writes, that"flashas cache will eventually become less interesting as part of the overalldiscussion... Flash as storage? Well, that's going to be reallyinteresting."Standing boldly with a foot in each camp, Dave Hitz, founder and EVP at Netapp, thinks that"Flash istoo expensive to replace disk right away, so first we'll see a new generation ofstorage systems that combine the two: flash for performance and disk forcapacity."So what are these guys really talking about, what does the landscape look like,and where does Sun fit in all this?Flash as primary storage (a.k.a. tier 0)Integrating flash efficiently into a storage system isn't obvious; the simplestway is as a direct replacement for disks. This is why most of the flash we usetoday in enterprise systems comes in units that look and act just like harddrives: SSDs are designed to be drop in replacements. Now, a flash SSD isquite different than a hard drive — rather than a servo spinningplatters while a head chatters back and forth, an SSD has floating gatesarranged in blocks... actually it's probably simpler to list what they havein common, and that's just the form factor and interface (SATA, SAS, FC).Hard drives have all kind of properties that don't make sense in the world ofSSDs (e.g. I've seen an SSD that reports it's RPM telemetry as 1),and SSDs have their own quirks with no direct analog (read/write asymmetry,limited write cycles, etc). SSD venders, however, manage to pound these roundpegs into their square holes, and produce something that can stand in for anexisting hard drive. Array vendors are all too happy to attain buzzwordcompliance by stuffing these SSDs into their products.The trouble with HSM is the burden of the M.Storage vendors already know how to deal with a caste system for disks: theystriate them in layers with fast, expensive 15K RPM disks as tier 1, andslower, cheaper disks filling out the chain down to tape. What to do withthese faster, more expensive disks? Tier-0 of course! An astute Netappblogger asks,"whenthe industry comes up with something even faster... are we going to havetier -1" — great question.What's wrong with that approach? Nothing. It works; it's simple; and we (thecomputing industry) basically know how to manage a bunch of tiers of storagewith something calledhierarchicalstorage management.The trouble with HSM is the burden of the M. This solution kicks the problemdown the road, leaving administrators to figure out where to put data, whatapplications should have priority, and when to migrate data.Flash as a cacheThe other school of thought around flash is to use it not as a replacementfor hard drives, but rather as a massive cache for reading frequently accesseddata. As I wrote back in June for CACM,"thisnew flash tier can be thought of as a radical form of hierarchical storagemanagement (HSM) without the need for explicit management. Tersely,HSM without the M. This idea forms a major component of what we at Sunare calling theHybridStorage Pool (HSP), a mechanism for integrating flash with disk and DRAMto form a new, and —Iargue — superior storage solution.Let's set aside the specifics of how we implement the HSP inZFS — you canread about thatelsewhere.Rather, I'll compare the use of flash as a cache to flash as a replacementfor disk independent of any specific solution.The case for cacheIt's easy to see why using flash as primary storage is attractive. Flash isfaster than the fastest disks by at least a factor of 10 for writes and afactor of 100 for reads measured in IOPS.Replacing disks with flash though isn't without nuance;there are several inhibitors, primary amongthem is cost. The cost of flash continues to drop, but it's still much moreexpensive than cheap disks, and will continue to be for quite awhile. Withflash as primary storage, you still need data redundancy — SSDs can anddo fail — and while we could use RAID with single- ordouble-device redundancy, that would cleave the available IOPS by a factor ofthe stripe width. The reason to migrate to flash is for performance so itwouldn't make much sense to hang a the majority of that performance back withRAID.The remaining option, therefore, is to mirror SSDs whereby the already highcost is doubled.It's hard to argue with results, all-flash solutions do rip. If money wereno object that may well be the best solution (but if cost truly wasn't afactor, everyone would strap batteries to DRAM and call it a day).Can flash as a cache do better? Say we need to store a 50TB of data. With anall-flash pool, we'll need to buy SSDs that can hold roughly 100TB of data ifwe want to mirror for optimal performance, and maybe 60TB if we're willing toaccept afar more modest performance improvement over conventional hard drives. Sincewe're already resigned to cutting a pretty hefty check, we have quite a bitof money to play with to design a hybrid solution.If we were to provision our system with50TB of flash and 60TB of hard drives we'd have enough cache to retain everybyte of active data in flash while the disks provide the necessaryredundancy. As writes come in the filesystem would populate the flash whileit writes data persistently to disk. The performance of this system would beepsilon away from the mirrored flash solution as read requests would only goto disk in the case of faults from the flash devices. Note that we never rely oncorrectness from the flash; it's the hard drives that provide reliability.The performance of this system would be epsilon away from the mirrored flash solution...The hybrid solution is cheaper, and it's also far more flexible. If a smallerworking set accounted for a disproportionally large number of reads, the totalIOPS capacity of the all-flash solution could be underused. With flash as acache, data could be migrated to dynamically distribute load, and additionalcache could be used to enhance the performance of the working set. It would bepossible to use some of the same techniques with an all-flash storage pool, butit could be tricky. The luxury of a cache is that the looser contraints allowfor more aggressive data manipulation.Building on the idea of concentrating the use of flash for hot data,it's easy to see how flash as a cache can improveperformance even without every byte present in the cache. Most data doesn'trequire 50μs random access latency over the entire dataset, users would see asignificant performance improvement with just the active subset in a flashcache.Of course, this meansthat software needs to be able to anticipate what data is in use which probablyinspired this comment from Chuck Hollis: "cache is cache — we all knowwhat it can and can't do." That may be so, but comparing an ocean of flash forprimary storage to a thimbleful of cache reflects fairly obtuse thinking.Caching algorithms will always be imperfect, but the massive scale to which wecan grow a flash cache radically alters the landscape.Even when a working set is too large to be cached, it's possible for a hybridsolution to pay huge dividends.Over at Facebook, Jason Sobel(a colleague of mine in college)produced an interestingpresentationon their use of storage (take a look at Jason's penultimate slide for his takeon SSDs).Their datasets are so vast and sporadically accessed that the latency ofactually loading a picture, say, off of hard drives isn't actually the biggestconcern, rather it's the time it takes to read the indirect blocks, themetadata. At facebook, they've taken great pains to reduce the number ofdependent disk accesses from fifteen down to about three.In a case such as theirs, it would never be economical store or cache the fulldataset on flash and the working set is similarly too large as data access canbe quite unpredictable.It could, however, be possible to cache all of their metadata in flash.This would reduce the latency to an infrequently accessed image by nearly afactor of three. Today in ZFS this is a manual setting per-filesystem, but itwould be possible to evolve a caching algorithm to detect a condition where thiswas the right policy and make the adjustment dynamically.Using flash as a cache offers the potential to do better, and tomake more efficient and more economical use of flash. Sun, and the industryas a whole have only just started to build the software designed to realizethat potential. Putting products before wordsAt Sun, we've just released our first line of products that offer completeflash integration with the Hybrid Storage Pool; you can read about that inmy blog poston the occassion of our product launch. On the eveof that launch, Netapp announced their own offering: a flash-laden PCI card thatplays much the same part as their DRAM-based Performance Acceleration Module(PAM). This will apparently be availablesometimein 2009.EMC offers a tier 0 solution that employs very fast and very expensive flashSSDs.What we have in ZFS today isn't perfect.Indeed, the Hybrid Storage Pool casts the state of the art forward, and we'll becatching up with solutions to the hard questions it raises for at least a fewyears. Only then will we realize the full potential of flash as a cache.What we have today though integrates flash in a way that changes the landscapeof storage economics and delivers cost efficiencies that haven't been seenbefore. If the drives manufacturers don't already, it can't be long until theyhear the death knell for 15K RPM drives loud and clear.Perhaps it's cynical or solipsistic to conclude that the timing of DaveHitz's and Chuck Hollis' blogs were designed to coincide with the release ofour new product and perhaps take some of the wind out of our sails,but I will — as thecommenters on Dave's Blog have — take it as a signthat we're on the right track. For the moment, I'll put my faith inthis bit of marketing materialenigmatically referenced in a number of Netappblogson the subject of flash:In today's competitive environment, bringing a product or service to market faster than the competition can make a significant difference. Releasing a product to market in a shorter time can give you first-mover advantage and result in larger market share and higher revenues.

The debate, calmly waged, on the best use of flash in the enterprise can be summarized as whether flash should be a replacement for disk, acting asprimary storage, or it should be regarded as a new,...


Sun Storage 7410 space calculator

The Sun Storage 7410 is our expandable storage appliance that can be hooked up to anywhere from one and twelve JBODs with 24 1TB disks. With all those disks we provide the several different options for how to arrange them into your storage pool: double-parity RAID-Z, wide-strip double-parity RAID-Z, mirror, striped, and single-parity RAID-Z with narrow stripes. Each of these options has a different mix of availability, performance, and capacity that are described both in the UI and in the installation documentation. With the wide array of supported configurations, it can be hard to know how much usable space each will support.To address this, I wrote a python script that presents a hypothetical hardware configuration to an appliance and reports back the available options. We use the logic on the appliance itself to ensure that the results are completely accurate as the same algorithms would be applied as when then the physical pallet of hardware shows up. This, of course, requires you to have an appliance available to query — fortunately, you can run a virtual instance of the appliance on your laptop.You can download the sizecalc.py here; you'll need python installed on the system where you run it. Note that the script uses XML-RPC to interact with the appliance, and consequently it relies on unstable interfaces that are subject to change. Others are welcome to interact with the appliance at the XML-RPC layer, but note that it's unstable and unsupported. If you're interested in scripting the appliance, take a look at Bryan's recent post. Feel free to post comments here if you have questions, but there's no support for the script, implied, explicit, unofficial or otherwise.Running the script by itself produces a usage help message:$ ./sizecalc.pyusage: ./sizecalc.py [ -h <half jbod count> ] <appliance name or address> <root password> <jbod count>Remember that you need a Sun Storage 7000 appliance (even a virtual one) to execute the capacity calculation. In this case, I'll specify a physical appliance running in our lab, and I'll start with a single JBOD (note that I've redacted the root password, but of course you'll need to type in the actual root password for your appliance):$ ./sizecalc.py catfish \*\*\*\*\* 1type NSPF width spares data drives capacity (TB)raidz2 False 11 2 22 18raidz2 wide False 23 1 23 21mirror False 2 2 22 11stripe False 0 0 24 24raidz1 False 4 4 20 15Note that with only one JBOD no configurations support NSPF (No Single Point of Failure) since that one JBOD is always a single point of failure. If we go up to three JBODs, we'll see that we have a few more options:$ ./sizecalc.py catfish \*\*\*\*\* 3type NSPF width spares data drives capacity (TB)raidz2 False 13 7 65 55raidz2 True 6 6 66 44raidz2 wide False 34 4 68 64raidz2 wide True 6 6 66 44mirror False 2 4 68 34mirror True 2 4 68 34stripe False 0 0 72 72raidz1 False 4 4 68 51In this case we have to give up a bunch of capacity in order to attain NSPF. Now let's look at the largest configuration we support today with twelve JBODs:$ ./sizecalc.py catfish \*\*\*\*\* 12type NSPF width spares data drives capacity (TB)raidz2 False 14 8 280 240raidz2 True 14 8 280 240raidz2 wide False 47 6 282 270raidz2 wide True 20 8 280 252mirror False 2 4 284 142mirror True 2 4 284 142stripe False 0 0 288 288raidz1 False 4 4 284 213raidz1 True 4 4 284 213The size calculator also allows you to model a system with Logzilla devices, write-optimized flash devices that form a key part of the Hybrid Storage Pool. After you specify the number of JBODs in the configuration, you can include a list of how many Logzillas are in each JBOD. For example, the following invocation models twelve JBODs with four Logzillas in the first 2 JBODs:$ ./sizecalc.py catfish \*\*\*\*\* 12 4 4type NSPF width spares data drives capacity (TB)raidz2 False 13 7 273 231raidz2 True 13 7 273 231raidz2 wide False 55 5 275 265raidz2 wide True 23 4 276 252mirror False 2 4 276 138mirror True 2 4 276 138stripe False 0 0 280 280raidz1 False 4 4 276 207raidz1 True 4 4 276 207A very common area of confusion has been how to size Sun Storage 7410 systems, and the relationship between the physical storage and the delivered capacity. I hope that this little tool will help to answer those questions. A side benefit should be still more interest in the virtual version of the appliance — a subject I've been meaning to post about so stay tuned.Update December 14, 2008: A couple of folks requested that the script allow for modeling half-JBOD allocations because the 7410 allows you to split JBODs between heads in a cluster. To accommodate this, I've added a -h option that takes as its parameter the number of half JBODs. For example:$ ./sizecalc.py -h 12 \*\*\*\*\* 0type NSPF width spares data drives capacity (TB)raidz2 False 14 4 140 120raidz2 True 14 4 140 120raidz2 wide False 35 4 140 132raidz2 wide True 20 4 140 126mirror False 2 4 140 70mirror True 2 4 140 70stripe False 0 0 144 144raidz1 False 4 4 140 105raidz1 True 4 4 140 105Update February 4, 2009: Ryan Matthews and I collaborated on a new version of the size calculator that now lists the raw space available in TB (decimal as quoted by drive manufacturers for example) as well as the usable space in TiB (binary as reported by many system tools). The latter also takes account of the sliver (1/64th) reserved by ZFS:$ ./sizecalc.py \*\*\*\*\* 12type NSPF width spares data drives raw (TB) usable (TiB)raidz2 False 14 8 280 240.00 214.87raidz2 True 14 8 280 240.00 214.87raidz2 wide False 47 6 282 270.00 241.73raidz2 wide True 20 8 280 252.00 225.61mirror False 2 4 284 142.00 127.13mirror True 2 4 284 142.00 127.13stripe False 0 0 288 288.00 257.84raidz1 False 4 4 284 213.00 190.70raidz1 True 4 4 284 213.00 190.70Update June 17, 2009: Ryan Matthews with help from has again revised the size calculator to model both adding expansion JBODs and to account for the now expandable Sun Storage 7210. Take a look at Ryan's post for usage information. Here's an example of the output:$ ./sizecalc.py \*\*\* 1 h1 add 1 h add 1 Sun Storage 7000 Size Calculator Version 2009.Q2type NSPF width spares data drives raw (TB) usable (TiB)mirror False 2 5 42 21.00 18.80raidz1 False 4 11 36 27.00 24.17raidz2 False 10-11 4 43 35.00 31.33raidz2 wide False 10-23 3 44 38.00 34.02stripe False 0 0 47 47.00 42.08Update September 16, 2009: Ryan Matthews updated the size calculator for the 2009.Q3 release. The update includes the new triple-parity RAID wide stripe and three-way mirror profiles:$ ./sizecalc.py boga \*\*\* 4Sun Storage 7000 Size Calculator Version 2009.Q3type NSPF width spares data drives raw (TB) usable (TiB)mirror False 2 4 92 46.00 41.18mirror True 2 4 92 46.00 41.18mirror3 False 3 6 90 30.00 26.86mirror3 True 3 6 90 30.00 26.86raidz1 False 4 4 92 69.00 61.77raidz1 True 4 4 92 69.00 61.77raidz2 False 13 5 91 77.00 68.94raidz2 True 8 8 88 66.00 59.09raidz2 wide False 46 4 92 88.00 78.78raidz2 wide True 8 8 88 66.00 59.09raidz3 wide False 46 4 92 86.00 76.99raidz3 wide True 11 8 88 64.00 57.30stripe False 0 0 96 96.00 85.95\*\* As of 2009.Q3, the raidz2 wide profile has been deprecated.\*\* New configurations should use the raidz3 wide profile.

The Sun Storage 7410 is our expandable storage appliance that can be hooked up to anywhere from one and twelve JBODs with 24 1TB disks. With all those disks we provide the several different options...


Hybrid Storage Pools in the 7410

The Sun Storage 7000 Series launches today, and with it Sun has the world'sfirst complete product that seamlessly adds flash into the storage hierarchyin what we call the Hybrid Storage Pool. The HSP represents adeparture from convention, and a new way of thinking designing a storagesystem. I'vewrittenbefore about the principles of the HSP, but now that it has been formallyannounced I can focus on the specifics of the Sun Storage 7000 Series and howit implements the HSP.Sun Storage 7410: The Cadillac of HSPsThe best example of the HSP in the 7000 Series is the 7410. This productcombines a head unit (or two for high availability) with as many as 12J4400JBODs. By itself, this is a pretty vanilla box: big, economical, 7200RPM drives don't win any races, and the maximum of 128GB of DRAM is certainlya lot, but some workloads will be too big to fit in that cache.With flash, however, this box turns into quite the speed demon.LogzillaThe write performance of 7200 RPM drive isn't terrific. The appalling thingis that the next best solution — 15K RPM drives — aren't reallythat much better: a factor of two or three at best. To blow thedoors off, the Sun Storage 7410 allows up to four write-optimized flashdrives per JBOD each of which is capable of handling 10,000 writes persecond. We call this flash device Logzilla.Logzilla is a flash-based SSD that contains a pretty big DRAM cache backed bya supercapacitor so that the cache can effectively be treated as nonvolatile.We use Logzilla as a ZFS intent log device so that synchronous writes aredirected to Logzilla and clients incur only that 100μs latency. This maysound a lot like how NVRAM is used to accelerate storage devices, and it is,but there are some important advantages of Logzilla.The first is capacity: most NVRAM maxes out at 4GB. That might seem likeenough, but I've talked to enough customers to realize that it really isn'tand that performance cliff is an awful long way down. Logzilla is an 18GBdevice which is big enough to hold the necessary data while ZFS syncs it outto disk even running full tilt.The second problem with NVRAM scalability: once you've stretchedyour NVRAM to its limit there's not much you can do. If your system supportsit (and most don't) you can add another PCI card, but those slots tend to bevaluable resources for NICs and HBAs, and even then there's necessarily apretty small number to which you could conceivably scale. Logzilla is an SSDsitting in a SAS JBOD so it's easy to plug more devices into ZFS and use themas a growing pool of intent log devices.ReadzillaThe standard practice in storage systems is to use the available DRAM as aread cache for data that is likely to be frequently accessed, and the 7000Series does the same. In fact, it can do quite a better job of it because,unlike most storage systems which stop at 64GB of cache, the 7410 has up to256GB of DRAM to use as a read cache. As I mentioned before, that's still notgoing to be enough to cache the entire working set for a lot of use cases.This is where we at Fishworks came up with the innovative solution of usingflash as a massive read cache. The 7410 can accomodate up to six 100GB,read-optimized, flash SSDs; accordingly, we call this device Readzilla.With Readzilla, a maximum 7410 configuration can have 256GB of DRAM providingsub-μs latency to cached data and 600GB worth of Readzilla servicing readrequests in around 50-100μs. Forgive me for stating the obvious: that's 856GB of cache &mdash. That may not suffice to cache all workloads,but it's certainly getting there. As with Logzilla, a wonderful property ofReadzilla is its scalability. You can change the number of Readzilla devicesto match your workload. Further, you can choose the right combination of DRAMand Readzilla to provide the requisite service times with the appopriatecost and power use. Readzilla is cheaper and less power-hungry than DRAM soapplications that don't need the blazing speed of DRAM can prefer the moreeconomical flash cache. It's a flexible solution that can be adapted tospecific needs.Putting It All TogetherWe started with DRAM and 7200 RPM disks, and by adding Logzilla andReadzilla the Sun Storage 7410 also has great write and read IOPS. Further,you can design the specific system you need with just the rightbalance of write IOPS, read IOPS, throughput, capacity, power-use, and cost.Once you have a system, the Hybrid Storage Pool lets you solve problems withtargeted solutions. Need capacity? Add disk. Out of read IOPS? Toss in anotherReadzilla or two. Write bogging down? Another Logzilla will net you another10,000 write IOPS. In the old model, of course, all problems were simplebecause the solution was always the same: buy more fast drives. The HSP inthe 7410 lets you address the specific problem you're having without payingfor a solution to three other problems that you don't have.Of course, this means that administrators need to better understand theperformance limiters, and fortunately the Sun Storage 7000 Series has a greatanswer to that in Analytics. Pop over toBryan's blogwhere he talks all about that feature of the Fishworks software stack and howto use it to find performance problems on the 7000 Series. If you want toread more details about Hybrid Storage Pools and how exactly all this works,take a lookmyarticle on the subject in CACM, aswell asthispost about the L2ARC (the magic behind using Readzilla) and a nice marketing pitchon HSPs.

The Sun Storage 7000 Series launches today, and with it Sun has the world's first complete product that seamlessly adds flash into the storage hierarchy in what we call the Hybrid Storage Pool. The...


Apple updates DTrace... again

Back in January, I ranted about Apple's ham-handed breakage in their DTrace port. After some injured feelings and teary embraces, Apple cleaned things up a bit, but some nagging issues remained as I wrote:For the Apple folks: I'd argue that revealing the name of otherwise untraceable processes is no more transparent than what Activity Monitor provides — could I have that please?It would be very un-Apple to — you know — communicate future development plans, but in 10.5.5, DTrace has seen another improvement. Previously when using DTrace to observe the system at large, iTunes and other paranoid apps would be hidden; now they're showing up on the radar:# dtrace -n 'profile-1999{ @[execname] = count(); }'dtrace: description 'profile-1999' matched 1 probe\^C loginwindow 2 fseventsd 3 kdcmond 5 socketfilterfw 5 distnoted 7 mds 8 dtrace 12 punchin-helper 12 Dock 20 Mail 25 Terminal 26 SystemUIServer 28 Finder 42 Activity Monito 49 pmTool 67 WindowServer 184 iTunes 1482 kernel_task 4030And of course, you can use generally available probes to observe only those touchy apps with a predicate:# dtrace -n 'syscall:::entry/execname == "iTunes"/{ @[probefunc] = count(); }'dtrace: description 'syscall:::entry' matched 427 probes\^C... pwrite 13 read 13 stat64 13 open_nocancel 14 getuid 22 getdirentries 26 pread 29 stat 32 gettimeofday 34 open 36 close 37 geteuid 65 getattrlist 199 munmap 328 mmap 338Predictably, the details of iTunes are still obscured:# dtrace -n pid42896:::entry...dtrace: error on enabled probe ID 225607 (ID 69364: pid42896:libSystem.B.dylib:pthread_mutex_unlock:entry): invalid user access in action #1dtrace: error on enabled probe ID 225546 (ID 69425: pid42896:libSystem.B.dylib:spin_lock:entry): invalid user access in action #1dtrace: 1005103 drops on CPU 1... which is fine by me; I've got code of my own I should be investigating. While I'm loath to point it out, an astute reader and savvy DTrace user will note that Apple may have left the door open an inch wider than they had anticipated. Anyone care to post some D code that makes use of that inch? I'll post an update as a comment in a week or two if no one sees it.Update: There were some good ideas in the comments. Here's the start of a script that can let you follow the flow of control of a thread in an "untraceable" process:#!/usr/sbin/dtrace -s#pragma D option quietpid$target:libSystem.B.dylib::entry,pid$target:libSystem.B.dylib::return{trace("this program is already traceable\\n");exit(0);}ERROR/self->level < 0 || self->level > 40/{self->level = 0;}ERROR{this->p = ((dtrace_state_t \*)arg0)->dts_ecbs[arg1 - 1]->dte_probe;this->mod = this->p->dtpr_mod;this->func = this->p->dtpr_func;this->entry = ("entry" == stringof(this->p->dtpr_name));}ERROR/this->entry/{printf("%\*s-> %s:%s\\n", self->level \* 2, "", stringof(this->mod), stringof(this->func));self->level++;}ERROR/!this->entry/{self->level--;printf("%\*s

Back in January, I ranted about Apple's ham-handed breakage in their DTrace port. After some injured feelings and teary embraces, Apple cleaned things up a bit, but some nagging issues remained as...


A glimpse into Netapp's flash future

The latest edition of Communications of the ACM includes apanel discussion between "seven world-class storage experts". The primary topic was flash memory and how it impacts the world of storage. The most interesting comment came from Steve Kleiman, Senior Vice President and Chief Scientist at Netapp:My theory is that whether it’s flash, phase-change memory, or something else, there is a new place in the memory hierarchy. There was a big blank space for decades that is now filled and a lot of things that need to be rethought. There are many implications to this, and we’re just beginning to see the tip of the iceberg.The statement itself isn't earth-shattering — it would be immodest to say so as Ireached the same conclusion in my own CACM article last month — with price trends and performance characteristics, it's obvious that flash has become relevant; those running the numbers as Steve Kleiman has will come to the same conclusion about how it might integrate into a system. What's interesting is that the person at Netapp"responsible for setting future technology directions for the company" has thrown his weight behind the idea. I look forward to seeing how this is manifested in Netapp's future offerings. Will it look something like theHybrid Storage Pool (HSP) that we've developed withZFS? Or might it integrate flash more explicitly into the virtual memory system in ONTAP, Netapp's embedded operating system?Soon enough we should start seeing products in the market that validate our expectations for flash and its impact to enterprise storage.

The latest edition of Communications of the ACM includes apanel discussion between "seven world-class storage experts". The primary topic was flash memory and how it impacts the world of storage. The...


Hybrid Storage Pools in CACM

As I mentioned in my previous post, I wrote an article about the hybrid storage pool (HSP); that article appears in the recently released July issue of Communications of the ACM. You can find it here. In the article, I talk about a novel way of augmenting the traditional storage stack with flash memory as a new level in the hierarchy between DRAM and disk, as well as the ways in which we've adapted ZFS and optimized it for use with flash.So what's the impact of the HSP? Very simply, the article demonstrates that, considering the axes of cost, throughput, capacity, IOPS and power-efficiency, HSPs can match and exceed what's possible with either drives or flash alone. Further, an HSP can be built or modified to address specific goals independently. For example, it's common to use 15K RPM drives to get high IOPS; unfortunately, they're expensive, power-hungry, and offer only a modest improvement. It's possible to build an HSP that can match the necessary IOPS count at a much lower cost both in terms of the initial investment and the power and cooling costs. As another example, people are starting to consider all-flash solutions to get very high IOPS with low power consumption. Using flash as primary storage means that some capacity will be lost to redundancy. An HSP can provide the same IOPS, but use conventional disks to provide redundancy yielding a significantly lower cost.My hope — perhaps risibly naive — is that HSPs will mean the eventual death of the 15K RPM drive. If it also puts to bed the notion of flash as general purpose mass storage, well, I'd be happy to see that as well.

As I mentioned in my previous post, I wrote an article about the hybrid storage pool (HSP); that article appears in the recently released July issue of Communications of the ACM. You can find it here....


Flash, Hybrid Pools, and Future Storage

Jonathan had a terrific post yesterday that does an excellent job of presenting Sun's strategy for flash for the next few years. With my colleagues at Fishworks, an advanced product development team, I've spent more than a year working with flash and figuring out ways to integrate flash into ZFS, the storage hierarchy, and our future storage products — a fact to which John Fowler, EVP of storage, alluded recently. Flash opens surprising new vistas; it's exciting to see Sun leading in this field, and it's frankly exciting to be part of it.Jonathan's post sketches out some of the basic ideas on how we're going to be integrating flash into ZFS to create what we call hybrid storage pools that combine flash with conventional (cheap) disks to create an aggregate that's cost-effective, power-efficient, and high-performing by capitalizing on the strengths of the component technologies (not unlike a hybrid car). We presented some early results at IDF which has already been getting a bit of buzz. Next month I have an article in Communications of the ACM that provides many more details on what exactly a hybrid pool is and how exactly it works. I've pulled out some excerpts from that article and included them below as a teaser and will be sure to post an update when the full article is available in print and online.While its prospects are tantalizing, the challenge is to find uses for flash that strike the right balance of cost and performance. Flash should be viewed not as a replacement for existing storage, but rather as a means to enhance it. Conventional storage systems mix dynamic memory (DRAM) and hard drives; flash is interesting because it falls in a sweet spot between those two components for both cost and performance in that flash is significantly cheaper and denser than DRAM and also significantly faster than disk. Flash accordingly can augment the system to form a new tier in the storage hierarchy – perhaps the most significant new tier since the introduction of the disk drive with RAMAC in 1956....A brute force solution to improve latency is to simply spin the platters faster to reduce rotational latency, using 15k RPM drives rather than 10k RPM or 7,200 RPM drives. This will improve both read and write latency, but only by a factor of two or so. ......ZFS provides for the use of a separate intent-log device, a slog in ZFS jargon, to which synchronous writes can be quickly written and acknowledged to the client before the data is written to the storage pool. The slog is used only for small transactions while large transactions use the main storage pool – it's tough to beat the raw throughput of large numbers of disks. The flash-based log device would be ideally suited for a ZFS slog. ... Using such a device with ZFS in a test system, latencies measure in the range of 80-100µs which approaches the performance of NVRAM while having many other benefits. ......By combining the use of flash as an intent-log to reduce write latency with flash as a cache to reduce read latency, we can create a system that performs far better and consumes less power than other system of similar cost. It's now possible to construct systems with a precise mix of write-optimized flash, flash for caching, DRAM, and cheap disks designed specifically to achieve the right balance of cost and performance for any given workload with data automatically handled by the appropriate level of the hierarchy. ... Most generally, this new flash tier can be thought of as a radical form of hierarchical storage management (HSM) without the need for explicit management.Updated July, 1: I've posted the link to the article in my subsequent blog post.

Jonathan had a terrific post yesterday that does an excellent job of presenting Sun's strategy for flash for the next few years. With my colleagues at Fishworks, an advanced product development team,...


Apple updates DTrace

Back in January, I posted about a problem with Apple's port of DTrace to Mac OS X. The heart of the issue is that their port would silently drop data such that certain experiments would be quietly invalid. Unfortunately, most reactions seized on a headline paraphrasing a line of the post — albeit with the critical negation omitted (the subject and language were, perhaps, too baroque to expect the press to read every excruciating word). The good news is that Apple has (quietly) fixed the problem in Mac OS X 10.5.3.One issue was that timer based probes wouldn't fire if certain applications were actively executing (e.g. iTunes). This was evident both by counting periodic probe firings, and by the absence of certain applications when profiling. Apple chose to solve this problem by allowing the probes to fire while denying any inspection of untraceable processes (and generating a verbose error in that case). This script which should count 1000 firings per virtual CPU gave sporadic results on earlier revisions of Mac OS X 10.5:profile-1000{@ = count();}tick-1s{printa(@);clear(@);}On 10.5.3, the output is exactly what one would expect on a 2-core CPU (1,000 executions per core): 1 22697 :tick-1s 2000 1 22697 :tick-1s 2000On previous revisions, profiling to see what applications were spending the most time on CPU would silently omit certain applications. Now, while we can't actually peer into those apps, we can infer the presence of stealthy apps when we encounter an error:profile-199{@[execname] = count();}ERROR{@["=stealth app="] = count();}Running this DTrace script will generate a lot of errors as we try to evaluate the execname variable for secret applications, but at the end we'll end up with a table like this: Adium 1 GrowlHelperApp 1 iCal 1 kdcmond 1 loginwindow 1 Mail 2 Activity Monito 3 ntpd 3 pmTool 6 mlb-nexdef-auto 12 Terminal 14 =stealth app= 29 WindowServer 34 kernel_task 307 Safari 571A big thank you to Apple for making progress on this issue; the situation is now much improved and considerably more palatable. That said, there are a couple of problems. The first is squarely the fault of team DTrace: we should probably have a mode where errors aren't printed particularly if the script is already handling them explicitly using an ERROR probe as in the script above. For the Apple folks: I'd argue that revealing the name of otherwise untraceable processes is no more transparent than what Activity Monitor provides — could I have that please? Also, I'm not sure if this has always been true, but the ustack() action doesn't seem to work from the profile action so simple profiling scripts like this one produce a bunch of errors and no output:profile-199/execname == "Safari"/{@[ustack()] = count();}But to reiterate: thank you thank you thank you, Steve, James, Tom, and the rest of the DTrace folks at Apple. It's great to see these issues being addressed. The whole DTrace community appreciates it.

Back in January, I posted about a problem with Apple's port of DTrace to Mac OS X. The heart of the issue is that their port would silently drop data such that certain experiments would be quietly...


dtrace.conf post-post-mortem

This originally was going to be a post-mortem on dtrace.conf, but so much time has passed, that I doubt it qualifies anymore. Back in March, we held the first ever DTrace (un)conference, and I hope I speak for all involved when I declare it a terrific success. And our t-shirts (logo pictured) were, frankly, bomb. Here are some fairly random impressions from the day:Notes on the demographics at dtrace.conf: Macs were the most prevalent laptops by quite a wide margin, and a ton of demos were done under VMware for the Mac. There were a handful of dvorak users who far outnumbered the Esperanto speakers (there were none) despite apparently similarly rationales. There were, by a wide margin, more live demonstrations that I'd seen during a day of technical talks; there were probably fewer individual slides than demos -- exactly what we had in mind.My favorite session brought the authors of the three DTrace ports to the front of the room to talk about porting, and answer questions (mostly from the DTrace team). I was excited that they agreed to work together on a wiki and on a DTrace porting project. Both would be great for new ports and for building a repository that could integrate all the ports into a single repository. I just have to see if I can get them to follow through now several weeks removed from the DTrace love-in...Also particularly interesting were a demonstration of a DTrace-enabled Adobe Air prototype and the very clever mechanism behind the Java group's plan for native Java static probes (JSDT). Essentially, they're using the same technique as normal USDT, but dynamically generating the tracing description structures and sending them down to the kernel (slick).The most interesting discussion resulted from Keith's presentation of vprobes -- a DTrace... um... inspired facility in VMware. While it is necessary to place a unified tracing mechanism at the lowest level of software abstraction (in DTrace's case, the kernel), it may also make sense to embed collaborating tracing frameworks at other levels of the stack. For example, the JVM could include a micro-DTrace which communicated with DTrace in the kernel as needed. This would both improve enabled performance (not a primary focus of DTrace), and allow for better domain-specific instrumentation and expression. I'll be interested to see how vprobes executes on this idea.Requests from the DTrace community:more providers ala the recent nfs and proposed ip providersconsistency between providers (kudos to those sending their providers to the DTrace discussion list for review)better compatibility with the ports -- several people observed that while they love the port to Leopard, Apple's spurious exclusion of the -G option created tricky conflictsBen was kind enough to video the entire day. We should have the footage publicly available in about a week. Thanks to all who participated; several recent projects have already gotten me excited for dtrace.conf(09).

This originally was going to be a post-mortem on dtrace.conf, but so much time has passed, that I doubt it qualifies anymore. Back in March, we held the first ever DTrace (un)conference, and I hope I...


DTrace and JavaOne: The End of the Beginning

It was a good run, but Jarod and I didn't make the cut for JavaOne this year...2005In 2005, Jarod came up with what he described as a jacked up way to use DTrace to get inside Java. This became the basis of the Java provider (first dvm for the 1.4.2 and 1.5 JVMs and now the hotspot provider for Java 6). That year, I got to stand up on stage at the keynote with John Loiacono and present DTrace for Java for the first time (to 10,000 people -- I was nervous). John was then the EVP of software at Sun. Shortly after that, he parlayed our keynote success into a sweet gig at Adobe (I was considered for the job, but ultimately rejected, they said, because their door frames couldn't accommodate my fro -- legal action is pending).That year we also started the DTrace challenge. The premise was that if we chained up Jarod in the exhibition hall, developers could bring him their applications and he could use DTrace to find a performance win -- or he'd fork over a free iPod. In three years Jarod has given out one iPod and that one deserves a Bondsian asterisk.After the excitement of the keynote, and the frenetic pace of the exhibition hall (and a haircut), Jarod and I anticipated at least fair interest in our talk, but we expected the numbers to be down a bit because we were presenting in the afternoon on the last day of the conference. We got to the room 15 minutes early to set up, skirting what we thought must have been the line for lunch, or free beer, or something, but turned out to be the line for our talk. Damn. It turns out that in addition to the 1,000 in the room, there was an overflow room with another 500-1,000 people. That first DTrace for Java talk had only the most basic features like tracing method entry and return, memory allocation, and Java stack backtraces -- but we already knew we were off to a good start.2006No keynote, but the DTrace challenge was on again and our talk reprised its primo slot on the last day of the conference after lunch (yes, that's sarcasm). That year the Java group took the step of including DTrace support in the JVM itself. It was also possible to dynamically turn instrumentation of the JVM off and on as opposed to the start-time option of the year before. In addition to our talk, there was a DTrace hands-on lab that was quite popular and got people some DTrace experience after watching what it can do in the hands of someone like Jarod.2007The DTrace talk in 2007 (again, last day of the conference after lunch) was actually one of my favorite demos I've given because I had never seen the technology we were presenting before. Shortly before JavaOne started, Lev Serebryakov from the Java group had built a way of embedding static probes in a Java program. While this isn't required to trace Java code, it does mean that developers can expose the higher level semantics of their programs to users and developers through DTrace. Jarod hacked up an example in his hotel room about 20 minutes before we presented, and amazingly it all went off without a hitch. How money is that?JSDT -- as the Java Statically Defined Tracing is called -- is in development for the next version of the JVM, and is the next step for DTrace support of dynamic languages. Java was the first dynamic language that we first considered for use with DTrace, and it's quite a tough environment to support due to the incredible sophistication of the JVM. That support has lead the way for other dynamic languages such as Ruby, Perl, and Python which all now have built-in DTrace providers.2008For DTrace and Java, this is not the end. It is not even the beginning of the end. Jarod and I are out, but Jon, Simon, Angelo, Raghavan, Amit, and others are in. At JavaOne 2008 next month there will be a talk, a BOF, and a hands-on lab about DTrace for Java and it's not even all Java: there's some php and JavaScript mixed in and both also have their own DTrace providers. I've enjoyed speaking at JavaOne these past three years, and while it's good to pass the torch, I'll miss doing it again this year. If I have the time, and can get past security I'll try to sneak into Jon and Simon's talk -- though it will be a departure from tradition for a DTrace talk to fall on a day other than the last.

It was a good run, but Jarod and I didn't make the cut for JavaOne this year... 2005 In 2005, Jarod came up with what he described as a jacked up way to use DTrace to get inside Java. This became the...


Expand-O-Matic RAID-Z

I was having a conversation with an OpenBSD user and developer the other day, and he mentioned some ongoing work in the community to consolidate support for RAID controllers. The problem, he was saying, was that each controller had a different administrative model and utility -- but all I could think was that the real problem was the presence of a RAID controller in the first place! As far as I'm concerned, ZFS and RAID-Z have obviated the need for hardware RAID controllers.ZFS users seem to love RAID-Z, but a frustratingly frequent request is to be able to expand the width of a RAID-Z stripe. While the ZFS community may care about solving this problem, it's not the highest priority for Sun's customers and, therefore, for the ZFS team. It's common for a home user to want to increase his total storage capacity by a disk or two at a time, but enterprise customers typically want to grow by multiple terabytes at once so adding on a new RAID-Z stripe isn't an issue. When the request has come up on the ZFS discussion list, we have, perhaps unhelpfully, pointed out that the code is all open source and ready for that contribution. Partly, it's because we don't have time to do it ourselves, but also because it's a tricky problem and we weren't sure how to solve it.Jeff Bonwick did a great job explaining how RAID-Z works, so I won't go into it too much here, but the structure of RAID-Z makes it a bit trickier to expand than other RAID implementations. On a typical RAID with N+M disks, N data sectors will be written with M parity sectors. Those N data sectors may contain unrelated data so adding modifying data on just one disk involves reading the data off that disk and updating both those data and the parity data. Expanding a RAID stripe in such a scheme is as simple as adding a new disk and updating the parity (if necessary). With RAID-Z, blocks are never rewritten in place, and there may be multiple logical RAID stripes (and multiple parity sectors) in a given row; we therefore can't expand the stripe nearly as easily.A couple of weeks ago, I had lunch with Matt Ahrens to come up with a mechanism for expanding RAID-Z stripes -- we were both tired of having to deflect reasonable requests from users -- and, lo and behold, we figured out a viable technique that shouldn't be very tricky to implement. While Sun still has no plans to allocate resources to the problem, this roadmap should lend credence to the suggestion that someone in the community might work on the problem.The rest of this post will discuss the implementation of expandable RAID-Z; it's not intended for casual users of ZFS, and there are no alchemic secrets buried in the details. It would probably be useful to familiarize yourself with the basic structure of ZFS, space maps (totally cool by the way), and the code for RAID-Z.Dynamic GeometryZFS uses vdevs -- virtual devices -- to store data. A vdev may correspond to a disk or a file, or it may be an aggregate such as a mirror or RAID-Z. Currently the RAID-Z vdev determines the stripe width from the number of child vdevs. To allow for RAID-Z expansion, the geometry would need to be a more dynamic property. The storage pool code that uses the vdev would need to determine the geometry for the current block and then pass that as a parameter to the various vdev functions.There are two ways to record the geometry. The simplest is to use the GRID bits (an 8 bit field) in the DVA (Device Virtual Address) which have already been set aside, but are currently unused. In this case, the vdev would need to have a new callback to set the contents of the GRID bits, and then a parameter to several of its other functions to pass in the GRID bits to indicate the geometry of the vdev when the block was written. An alternative approach suggested by Jeff and Bill Moore is something they call time-dependent geometry. The basic idea is that we store a record each time the geometry of a vdev is modified and then use the creation time for a block to infer the geometry to pass to the vdev. This has the advantage of conserving precious bits in the fixed-width DVA (though at 128 bits its still quite big), but it is a bit more complex since it would require essentially new metadata hanging off each RAID-Z vdev.Metaslab FoldingWhen the user requests a RAID-Z vdev be expanded (via an existing or new zpool(1M) command-line option) we'll apply a new fold operation to the space map for each metaslab. This transformation will take into account the space we're about to add with the new devices. Each range [a, b] under a fold from width n to width m will become[ m \* (a / n) + (a % n), m \* (b / n) + b % n ]The alternative would have been to account for m - n free blocks at the end of every stripe, but that would have been overly onerous both in terms of processing and in terms of bookkeeping. For space maps that are resident, we can simply perform the operation on the AVL tree by iterating over each node and applying the necessary transformation. For space maps which aren't in core, we can do something rather clever: by taking advantage of the log structure, we can simply append a new type of space map entry that indicates that this operation should be applied. Today we have allocated, free, and debug; this would add fold as an additional operation. We'd apply that fold operation to each of the 200 or so space maps for the given vdev. Alternatively, using the idea of time-dependent geometry above, we could simply append a marker to the space map and access the geometry from that repository.Normally, we only rewrite the space map if the on-disk, log-structure is twice as large as necessary. I'd argue that the fold operation should always trigger a rewrite since processing it always requires a O(n) operation, but that's really an ancillary point.vdev UpdateAt the same time as the previous operation, the vdev metadata will need to be updated to reflect the additional device. This is mostly just bookkeeping, and a matter of chasing down the relevant code paths to modify and augment.ScrubWith the steps above, we're actually done for some definition since new data will spread be written in stripes that include the newly added device. The problem is that extant data will still be stored in the old geometry and most of the capacity of the new device will be inaccessible. The solution to this is to scrub the data reading off every block and rewriting it to a new location. Currently this isn't possible on ZFS, but Matt and Mark Maybee have been working on something they call block pointer rewrite which is needed to solve a variety of other problems and nicely completes this solution as well. That's ItAfter Matt and I had finished thinking this through, I think we were both pleased by the relative simplicity of the solution. That's not to say that implementing it is going to be easy -- there's still plenty of gaps to fill in -- but the basic algorithm is sound. A nice property that falls out is that in addition to changing the number of data disks, it would also be possible to use the same mechanism to add an additional parity disk to go from single- to double-parity RAID-Z -- another common request.So I can now extend a slightly more welcoming invitation to the ZFS community to engage on this problem and contribute in a very concrete way. I've posted some diffs which I used sketch out some ideas; that might be a useful place to start. If anyone would like to create a project on OpenSolaris.org to host any ongoing work, I'd be happy to help set that up.

I was having a conversation with an OpenBSD user and developer the other day, and he mentioned some ongoing work in the community to consolidate support for RAID controllers. The problem, he...


pid2proc for DTrace

The other day, there was an interesting post on the DTrace mailing list asking how to derive a process name from a pid. This really ought to be a built-in feature of D, but it isn't (at least not yet). I hacked up a solution to the user's problem by cribbing the algorithm from mdb's ::pid2proc function whose source code you can find here. The basic idea is that you need to look up the pid in pidhash to get a chain of struct pid that you need to walk until you find the pid in question. This in turn gives you an index into procdir which is an array of pointers to proc structures. To find out more about these structures, poke around the source code or mdb -k which is what I did.The code isn't exactly gorgeous, but it gets the job done. It's a good example of probe-local variables (also somewhat misleadingly called clause-local variables), and demonstrates how you can use them to communicate values between clauses associated with a given probe during a given firing. You can try it out by running dtrace -c <your-command> -s <this-script>.BEGIN{ this->pidp = `pidhash[$target & (`pid_hashsz - 1)]; this->pidname = "-error-";}/\* Repeat this clause to accommodate longer hash chains. \*/BEGIN/this->pidp->pid_id != $target && this->pidp->pid_link != 0/{ this->pidp = this->pidp->pid_link;}BEGIN/this->pidp->pid_id != $target && this->pidp->pid_link == 0/{ this->pidname = "-no such process-";}BEGIN/this->pidp->pid_id != $target && this->pidp->pid_link != 0/{ this->pidname = "-hash chain too long-";}BEGIN/this->pidp->pid_id == $target/{ /\* Workaround for bug 6465277 \*/ this->slot = (\*(uint32_t \*)this->pidp) >> 8; /\* AHA! We finally have the proc_t. \*/ this->procp = `procdir[this->slot].pe_proc; /\* For this example, we'll grab the process name to print. \*/ this->pidname = this->procp->p_user.u_comm;}BEGIN{ printf("%d %s", $target, this->pidname);}Note that the second clause is the bit that walks the hash chain. You can repeat this clause as many times as you think will be needed to traverse the hash chain -- I really don't have any guidance here, but I imagine that a few times should suffice. Alternatively, you could construct a tick probe that steps along the hash chain to avoid a fixed limit. DTrace attempts to keep easy things easy and make difficult things possible. As evidenced by this example, possible doesn't necessarily correlate with beautiful.

The other day, there was an interesting post on the DTrace mailing list asking how to derive a process name from a pid. This really ought to be a built-in feature of D, but it isn't (at least not...


Mac OS X and the missing probes

As has been thoroughly recorded, Apple has included DTrace in Mac OS X. I've been using it as often as I have the opportunity, and it's a joy to be able to use the fruits of our labor on another operating system. But I hit a rather surprising case recently which led me to discover a serious problem with Apple's implementation.A common trick with DTrace is to use a tick probe to report data periodically. For example, the following script reports the ten most frequently accessed files every 10 seconds:io:::start{@[args[2]->fi_pathname] = count();}tick-10s{trunc(@, 10);printa(@);trunc(@, 0);}This was running fine, but it seemed as though sometimes (particularly with certain apps in the background) it would occasionally skip one of the ten second iterations. Odd. So I wrote the following script to see what was going on:profile-1000{@ = count();}tick-1s{printa(@);clear(@);}What this will do is fire a probe at 1000hz on all (logical) CPUs. Running this on a dual-core machine we'd expect to see it print out 2000 each time. Instead I saw this: 0 22369 :tick-1s 1803 0 22369 :tick-1s 1736 0 22369 :tick-1s 1641 0 22369 :tick-1s 3323 0 22369 :tick-1s 1704 0 22369 :tick-1s 1732 0 22369 :tick-1s 1697 0 22369 :tick-1s 5154Kind of bizarre. The missing tick-1s probes explain the values over 2000, but weirder were the values so far under 2000. To explore a bit more I performed another DTrace experiment to see what applications were running when the profile probe fired:# dtrace -n profile-997'{ @[execname] = count(); }'dtrace: description 'profile-997' matched 1 probe\^C Finder 1 configd 1 DirectoryServic 2 GrowlHelperApp 2 llipd 2 launchd 3 mDNSResponder 3 fseventsd 4 mds 4 lsd 5 ntpd 6 kdcmond 7 SystemUIServer 8 dtrace 8 loginwindow 9 pvsnatd 21 Dock 41 Activity Monito 45 pmTool 52 Google Notifier 60 Terminal 153 WindowServer 238 Safari 1361 kernel_task 4247While there's nothing suspicious about the output in itself, it was strange because I was listening to music at the time. With iTunes. Where was iTunes?I ran the first experiment again and caused iTunes to do more work which yielded these results: 0 22369 :tick-1s 3856 0 22369 :tick-1s 1281 0 22369 :tick-1s 4770 0 22369 :tick-1s 2271So what was iTunes doing? To answer that I again turned to DTrace and used the following enabling to see what functions were being called most frequently by iTunes (whose process ID was 332):# dtrace -n 'pid332:::entry{ @[probefunc] = count(); }'dtrace: description 'pid332:::entry' matched 264630 probesI let it run for a while, made iTunes do some work, and the result when I stopped the script? Nothing. The expensive DTrace invocation clearly caused iTunes to do a lot more work, but DTrace was giving me no output.Which started me thinking... did they? Surely not. They wouldn't disable DTrace for certain applications.But that's exactly what Apple's done with their DTrace implementation. The notion of true systemic tracing was a bit too egalitarian for their classist sensibilities so they added this glob of lard into dtrace_probe() -- the heart of DTrace:#if defined(__APPLE__) /\* \* If the thread on which this probe has fired belongs to a process marked P_LNOATTACH \* then this enabling is not permitted to observe it. Move along, nothing to see here. \*/ if (ISSET(current_proc()->p_lflag, P_LNOATTACH)) { continue; }#endif /\* __APPLE__ \*/Wow. So Apple is explicitly preventing DTrace from examining or recording data for processes which don't permit tracing. This is antithetical to the notion of systemic tracing, antithetical to the goals of DTrace, and antithetical to the spirit of open source. I'm sure this was inserted under pressure from ISVs, but that makes the pill no easier to swallow. To say that Apple has crippled DTrace on Mac OS X would be a bit alarmist, but they've certainly undermined its efficacy and, in doing do, unintentionally damaged some of its most basic functionality. To users of Mac OS X and of DTrace: Apple has done a service by porting DTrace, but let's convince them to go one step further and port it properly.

As has been thoroughly recorded, Apple has included DTrace in Mac OS X. I've been using it as often as I have the opportunity, and it's a joy to be able to use the fruits of our labor on another...



It's been more than a year since I first saw DTrace on Mac OS X, and now it's at last generally available to the public. Not only did Apple port DTrace, but they've also included a bunch of USDT providers. Perl, Python, Ruby -- they all ship in Leopard with built-in DTrace probes that allow developers to observe function calls, object allocation, and other points of interest from the perspective of that dynamic language. Apple did make some odd choices (e.g. no Java provider, spurious modifications to the publicly available providers, a different build process), but on the whole it's very impressive.Perhaps it was too much to hope for, but with Apple's obvious affection for DTrace I thought they might include USDT probes for Safari. Specifically, probes in the JavaScript interpreter would empower developers in the same way they enabled Ruby, Perl, and Python developers. Fortunately, the folks at the Mozilla Foundation have already done the heavy lifting for Firefox -- it was just a matter of compiling Firefox on Mac OS X 10.5 with DTrace enabled:There were some minor modifications I had to make to the Firefox build process to get everything working, but it wasn't too tricky. I'll try to get a patch submitted this week, and then Firefox will have the same probes on Mac OS X that it does -- thanks to Brendan's early efforts -- on Solaris. JavaScript developers take note: this is good news.

It's been more than a year since I first saw DTrace on Mac OS X, and now it's at last generally available to the public. Not only did Apple port DTrace, but they've also included a bunch of USDT...


What-If Machine: DTrace Port

What if there were a port of DTrace to Linux? What if there were a port of DTrace to Linux: could such a thing be done without violating either the GPL or CDDL? Read on before you jump right to the comments section to add your two cents.In my last post, I discussed an attempt to create a DTrace knockoff in Linux, and suggested that a port might be possible. Naively, I hoped that comments wouldexamine the heart of my argument,bemoan the apparent NIH in the Linux knockoff,regret the misappropriation of slideware, and maybe discuss sometechnical details -- anything butdwell on licensing issues.For this post, I welcome the debate. Open source licenses are important, and the choice can have a profound impact on the success of the software and the community. But conversations comparing the excruciating minutia of one license and another are exhausting, and usually become pointless in a hurry. Having a concrete subject might lead to a productive conversation.DTrace Port DetailsJust for the sake of discussion, let's say that Google decide to port DTrace to Linux (everyone loves Google, right?). This isn't so far fetched: Google uses Linux internally, maybe they're using SystemTap, maybe they're not happy with it, but they definitely (probably) care about dynamic tracing (just like all good system administrators and developers should). So suppose some engineers at Google take the following (purely hypothetical) steps:Kernel HooksDTrace has a little bit of functionality that lives in the core kernel. The code to deal withinvalid memory accesses,some glue between the kernel's dynamic linker and some of the DTrace instrumentation providers, and somesimple, low-level routinescover the bulk of it. My guess is there are about 1500 lines of code all told: not trivial, but hardly insurmountable. Google implements these facilities in a manner designed to allow the results to be licensed under the GPL. For example, I think it would be sufficient for someone to draft a specification and for someone else to implement it so long as the person implementing it hadn't seen the CDDL version. Google then posts the patch publically.DTrace Kernel ModulesThe other DTrace kernel components are divided into several loadable kernel modules. There's the main DTrace module and then the instrumentation provider modules that connect to the core framework through an internal interface. These constitute the vast majority of the in-kernel DTrace code. Google modifies these to use slightly different interfaces (e.g. mutex_enter() becomes mutex_lock()); the final result is a collection of kernel modules still licensed under the CDDL. Of course, Google posts any modifications to CDDL files.DTrace Libraries and CommandsIt wouldn't happen for free, but the DTrace user-land components could just be directly ported. I don't believe there are any legal issues here.So let's say that this is Google's DTrace port: their own hacked up kernel, some kernel modules operating under a non-GPL license, and some user-land components (also under a non-GPL license, but, again, I don't think that matters). Now some questions:1. Legal To Run?If Google assembled such a system, would it be legal to run on a development desktop machine? It seems to violate the GPL no more than, say, the nVidia drivers (which are presumably also running on that same desktop). What if Google installed the port on a customer-facing machine? Are there any additional legal complications there? My vote: legit.2. Legal To Distribute?Google distributes the Linux kernel patch (so that others can construct an identical kernel), and elsewhere they distribute the Linux-ready DTrace modules (in binary or source form): would that violate either license? It seems that it would potentially violate the GPL if a full system with both components were distributed together, but distributed individually it would certainly be fine. My vote: legit, but straying into a bit of a gray area.3. Patch Accepted?I'm really just putting this here for completeness. Google then submits the changes to the Linux kernel and tries to get them accepted upstream. There seems to be a precedent for the Linux kernel not accepting code that's there merely to support non-GPL kernel modules, so I doubt this would fly. My vote: not gonna happen.4. No Source?What if Google didn't supply the source code to either component, and didn't distribute any of it externally? My vote: legal, but morally bankrupt.You Make The CallSo what do you think? Note that I'm not asking if it would be "good", and I'm not concluding that this would obviate the need for direct support for a native dynamic tracing framework in the Linux kernel. What I want to know is whether or not this DTrace port to Linux would be legal (and why)? If not, what would happen to poor Google (e.g. would FSF ninjas storm the Googleplex)?If you care to comment, please include some brief statement about your legal expertise. I for one am not a lawyer, have no legal background, have read both the GPL and CDDL and have a basic understanding of both, but claim to be an authority in neither. If you don't include some information with regard to that, I may delete your comment.

What if there were a port of DTrace to Linux? What if there were a port of DTrace to Linux: could such a thing be done without violating either the GPL or CDDL? Read on before you jump right to the com...


DTrace Knockoffs

Update 8/6/2007: Those of you interested in this entry may also want to check outmy next entry on the legality of a hypothetical port of DTrace to Linux.Tools We Wish We Had -- OSCON 7/26/2007 Last week at OSCON someone set up a whiteboard with the heading "Tools We Wish We Had". People added entries (wiki-style); this one in particular caught my eye:dtrace for Linuxor something similar(LIKE SYSTEMTAP?)- jdub(NO, LIKE dtrace)- VLAD (like systemtap, but not crap)DTraceSo what exactly were they asking for? DTrace is the tool developers and sysadmins have always needed -- whether they knew it or not -- but weren't able to express in words let alone code. Most simply (and least humbly) DTrace lets you express a question about nearly any aspect of the system and get the answer in a simple and concise form. And -- this is important -- you can do it safely on machines running in production as well as in development. With DTrace, you can look at the highest level software such as Ruby (as was the subject of my talk at OSCON) through all the layers of the software stack down to the lowest level kernel facilities such as I/O and scheduling. This systemic scope, production focus, and arbitrary flexibility are completely new, and provide literally unprecedented observability into complex software systems. We're scientists, we're detectives -- DTrace lets us form hypotheses, and prove or disprove them in an instant until we've come to an understanding of the problem, until we've solved the crime. Of course anyone using Linux would love a tool like that -- especially because DTrace is already available on Mac OS X, Solaris, and FreeBSD.SystemTapSo is SystemTap like DTrace? To understand SystemTap, it's worth touching on the history of DTrace: Bryan cut the first code for DTrace in October of 2001; Mike tagged in moments later, and I joined up after a bit. In September of 2003 we integrated DTrace into Solaris 10 which first became available to customers in November of 2003 and formally shipped and was open-sourced in January of 2005. Almost instantly we started to see the impact in the field. In terms of performance, Solaris has strong points and weak points; with DTrace we were suddenly able to understand where those bottlenecks were on customer systems and beat out other vendors by improving our performance -- not in weeks or months, but literally in a few hours. Now, I'm not saying that DTrace was the silver bullet by which all enemies were slain -- that's clearly not the case -- but it was turning some heads and winning some deals.Now, this bit involves some hearsay and conjecture[1], but apparently some managers of significance at Red Hat, IBM, and Intel started to take note. "We've got to do something about this DTrace," one of them undoubtedly said with a snarl (as an underling dragged off the fresh corpse of an unlucky messenger). SystemTap was a direct reaction to the results we were achieving with DTrace -- not to DTrace as an innovative technology.When the project started in January of 2005, early discussion by the SystemTap team referred to "inspiration" that they derived from DTrace. They had a mandate to come up with an equivalent, so I assumed that they had spent the time to truly understand DTrace: to come up with an equivalent for DTrace -- or really to duplicate any technology -- the first step is to understand what it is completely. From day one, DTrace was designed to be used on mission critical systems, to always be safe, to induce no overhead when not in use, to allow for arbitrary data gathering, and to have systemic scope from the kernel to user-land and on up the stack into higher level languages. Those fundamental constraints led to some important, and non-obvious design decisions (e.g. our own language "D", a micro virtual machine, conservative probe point selection).SystemTap -- the "Sorny" of dynamic tracingInstead of taking the time to understand DTrace, and instead of using it and scouring the documentation, SystemTap charged ahead, completely missing the boat on safety with an architecture which is nearly impossible to secure (e.g. running a SystemTap script drops in a generated kernel module). Truly systemic scope remains an elusive goal as they're only toe-deep in user-land (forget about Ruby, Java, python, etc). And innovations in DTrace such as scalable data aggregation and speculative tracing are replicated poorly if at all. By failing to examine DTrace, and by rushing to have some sort of response, SystemTap isn't like DTrace: it's a knockoff.Amusingly, in an apparent attempt to salvage their self-respect, the SystemTap team later renounced their inspiration. Despite frequent mentions of DTrace in theirearly meetings and email, it turns out,DTrace didn't actually inspire them much at all:CVSROOT:/cvs/systemtapModule name:srcChanges by:kenistoj@sourceware.org2006-11-02 23:03:09Modified files:. : stap.1.in Log message:Removed refs to dtrace, to which we were giving undue credit in terms of"inspiration."you're not my real dad! <slam>Bad Artists Copy...So uninspired was the SystemTap team by DTrace, that they don't even advocate its use according to a presentation on profiling applications ("Tools that we avoid - dtrace [sic]"). In that same presentation there's an example of a SystemTap-based tool called udpstat.stp:$ udpstat.stp UDP_out UDP_outErr UDP_in UDP_inErr UDP_noPort 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 5 0 0 0 0 5 0 0 0 0... whose output was likely "inspired" by udpstat.d -- part of the DTraceToolkitby Brendan Gregg:# udpstat.d UDP_out UDP_outErr UDP_in UDP_inErr UDP_noPort 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1165 0 2 0 0 In another act of imitation reminiscent of liberal teenage borrowing from wikipedia, take a look at Eugene Teo's slides from Red Hat Summit 2007 as compared with Brendan's DTrace Topics Intro wiki (the former apparently being generated by applying a sed script to the latter). For example:What isn’t SystemTap SystemTap isn’t sentient; requires user thinking process SystemTap isn’t a replacement for any existing toolsWhat isn't DTraceDTrace isn't a replacement for kstat or SMNPkstat already provides inexpensive long term monitoring.DTrace isn't sentient, it needs to borrow your brain to do the thinkingDTrace isn't “dTrace”... Great Artists StealWhile some have chosen the knockoff route, others have taken the time to analyze what DTrace does, understood the need, and decided that the best DTrace equivalent would be... DTrace. As with the rest of Solaris, DTrace is open source so developers and customers are excited about porting. Just a few days ago there were a couple of interesting blog posts (here andhere) by users of ONTAP, NetApp's appliance OS, not for a DTrace equivalent, but for a port of DTrace itself.DTrace is already available in the developer builds of Mac OS X 10.5, and there's a functional port for FreeBSD. I don't think it's a stretch to say that DTrace itself is becoming the measuring stick -- if not the standard. Why reinvent the wheel when you can port it?Time For StandardsAt the end of my talk last week someone asked if there was a port of DTrace to Linux (not entirely surprising since OSCON has a big Linux user contingent). I told him to ask the Linux bigwigs (several of them were also at the conference); after all, we didn't do the port to Mac OS X, and we didn't do the port to FreeBSD. We did extend our help to those developers, and they, in turn, helped DTrace by growing the community and through direct contributions[2].We love to see DTrace on other operating systems, and we're happy to help.So to the pretenders: enough already with the knockoffs. Your users want DTrace, you obviously want what DTrace offers, and the entire DTrace team and community are eager to help. I'm sure there's been some FUD about license incompatibilities, but it's certainly Sun's position (as stated by Sun's CEO Jonathan Schwartz at OSCON 2005) that such a port wouldn't violate the OpenSolaris license. And even closed-source kernel components are tolerated from the likes of Symantec (nee Veritas) and nVidia. Linux has been a champion of standards, eschewing proprietary solutions for free and open standards. DTrace might not yet be a standard, but a DTrace knockoff never will be.[1] ... those are kinds of evidence[2] including posts on the DTrace discussion forum comprehensible only to me and James

Update 8/6/2007: Those of you interested in this entry may also want to check outmy next entry on the legality of a hypothetical port of DTrace to Linux. Tools We Wish We Had -- OSCON 7/26/2007 Last...


iSCSI DTrace provider and more to come

People often ask about the future direction of DTrace, and while we have some stuff planned for the core infrastructure, the future is really about extending DTrace's scope into every language, protocol, and application with new providers -- and this development is being done by many different members of the DTrace community. An important goal of this new work is to have consistent providers that work predictably. To that end, Brendan and I have started to sketch out an array of providers so that we can build a consistent model.In that vein, I recently integrated a provider for our iSCSI target into Solaris Nevada (build 69, and it should be in a Solaris 10 update, but don't ask me which one). It's an USDT provider so the process ID is appended to the name; you can use \* to avoid typing the PID of the iSCSI target daemon. Here are the probes with their arguments (some of the names are obvious; for others you might need to refer to the iSCSI spec):probe nameargs[0]args[1]args[2]iscsi\*:::async-sendconninfo_t \*iscsiinfo_t \*-iscsi\*:::login-commandconninfo_t \*iscsiinfo_t \*-iscsi\*:::login-responseconninfo_t \*iscsiinfo_t \*-iscsi\*:::logout-commandconninfo_t \*iscsiinfo_t \*-iscsi\*:::logout-responseconninfo_t \*iscsiinfo_t \*-iscsi\*:::data-receiveconninfo_t \*iscsiinfo_t \*-iscsi\*:::data-requestconninfo_t \*iscsiinfo_t \*-iscsi\*:::data-sendconninfo_t \*iscsiinfo_t \*-iscsi\*:::nop-receiveconninfo_t \*iscsiinfo_t \*-iscsi\*:::nop-sendconninfo_t \*iscsiinfo_t \*-iscsi\*:::scsi-commandconninfo_t \*iscsiinfo_t \*iscsicmd_t \*iscsi\*:::scsi-responseconninfo_t \*iscsiinfo_t \*-iscsi\*:::task-commandconninfo_t \*iscsiinfo_t \*-iscsi\*:::task-responseconninfo_t \*iscsiinfo_t \*-iscsi\*:::text-commandconninfo_t \*iscsiinfo_t \*-iscsi\*:::text-responseconninfo_t \*iscsiinfo_t \*-The argument structures are defined as follows:typedef struct conninfo { string ci_local; /\* local host address \*/ string ci_remote; /\* remote host address \*/ string ci_protocol; /\* protocol (ipv4, ipv6, etc) \*/} conninfo_t;typedef struct iscsiinfo { string ii_target; /\* target iqn \*/ string ii_initiator; /\* initiator iqn \*/ uint64_t ii_lun; /\* target logical unit number \*/ uint32_t ii_itt; /\* initiator task tag \*/ uint32_t ii_ttt; /\* target transfer tag \*/ uint32_t ii_cmdsn; /\* command sequence number \*/ uint32_t ii_statsn; /\* status sequence number \*/ uint32_t ii_datasn; /\* data sequence number \*/ uint32_t ii_datalen; /\* length of data payload \*/ uint32_t ii_flags; /\* probe-specific flags \*/} iscsiinfo_t;typedef struct iscsicmd { uint64_t ic_len; /\* CDB length \*/ uint8_t \*ic_cdb; /\* CDB data \*/} iscsicmd_t;Note that the arguments go from most generic (the connection for the application protocol) to most specific. As an aside, we'd like future protocol providers to make use of the conninfo_t so that one could write a simple script to see a table of frequent consumers for all protocols:iscsi\*:::,http\*:::,cifs:::{ @[args[0]->ci_remote] = count();}With the iSCSI provider you can quickly see which LUNs are most active:iscsi\*:::scsi-command{ @[args[1]->ii_target] = count();}or the volume of data transmitted:iscsi\*:::data-send{ @ = sum(args[1]->ii_datalen);}Brendan has been working on a bunch of iSCSI scripts -- those are great for getting started examining iSCSI

People often ask about the future direction of DTrace, and while we have some stuff planned for the core infrastructure, the future is really about extending DTrace's scope into every...


DTrace @ JavaOne 2007

This year, Jarod Jenson and I gave an updated version of our DTrace for Java (technology-based applications) talk:The biggest new feature that we demonstrated is the forthcoming Java Statically-Defined Tracing (JSDT) which will allow developers to embed stable probes in their code as we can do today in the kernel with SDT probes and in C and C++ applications with USDT probes. While you can already trace Java applications (and C and C++ applications), static probes let the developer embed stable and semantically rich points of instrumentation that allow the user to examine the application without needing to understand its implementation. The Java version of this is so new I had literally never seen it until Jarod gave a demonstration during our talk. The basic idea is that you can define a new probe by constructing a USDTProbe instance specifying the provider, function, probe name, and argument signature: sun.dtrace.USDTProbe myprobe = new sun.dtrace.USDTProbe("myprovider", "myfunc", "myprobe", "ssl");To fire the probe, you invoke the Call() method on the instance, and pass in the appropriate arguments.Attendance was great, and we talked to a lot of people who had attended last year and had been getting mileage out of DTrace for Java. Next year, we're hoping to give the updated version of this talk on Tuesday (rather than Friday for once) and invite people to bring in their applications for a tune-up; we'll present the results in a case study-focussed talk on Friday.

This year, Jarod Jenson and I gave an updated version of our DTrace for Java (technology-based applications) talk: The biggest new feature that we demonstrated is the forthcoming Java...


gzip for ZFS update

The other day I posted about a prototype I had created that adds a gzip compression algorithm to ZFS. ZFS already allows administrators to choose to compress filesystems using the LZJB compression algorithm. This prototype introduced a more effective -- albeit more computationally expensive -- alternative based on zlib.As an arbitrary measure, I used tar(1) to create and expand archives of an ON (Solaris kernel) source tree on ZFS filesystems compressed with lzjb and gzip algorithms as well as on an uncompressed ZFS filesystem for reference:Thanks for the feedback. I was curious if people would find this interesting and they do. As a result, I've decided to polish this wad up and integrate it into Solaris. I like Robert Milkowski's recommendation of options for different gzip levels, so I'll be implementing that. I'll also upgrade the kernel's version of zlib from 1.1.4 to 1.2.3 (the latest) for some compression performance improvements. I've decided (with some hand-wringing) to succumb to the requests for me to make these code modifications available. This is not production quality. If anything goes wrong it's completely your problem/fault -- don't make me regret this. Without further disclaimer:pdfpatchIn reply to some of the comments:UX-adminOne could choose between lzjb for day-to-day use, or bzip2 for heavily compressed, "archival" file systems (as we all know, bzip2 beats the living daylights out of gzip in terms of compression about 95-98% of the time).It may be that bzip2 is a better algorithm, but we already have (and need zlib) in the kernel, and I'm loath to add another algorithmivanvdb25Hi, I was just wondering if the gzip compression has been enabled, does it give problems when an ZFS volume is created on an X86 system and afterwards imported on a Sun Sparc?That isn't a problem. Data can be moved from one architecture to another (and I'll be verifying that before I putback).dennisAre there any documents somewhere explaining the hooks of zfs and how to add features like this to zfs? Would be useful for developers who want to add features like filesystem-based encryption to it. Thanks for your great work!There aren't any documents exactly like that, but there's plenty of documentation in the code itself -- that's how I figured it out, and it wasn't too bad. The ZFS source tour will probably be helpful for figuring out the big picture.Update 3/22/2007: This work was integrated into build 62 of onnv.Technorati Tags:ZFSOpenSolaris

The other day I posted about a prototype I had created that adds a gzip compression algorithm to ZFS. ZFS already allows administrators to choose to compress filesystems using the LZJB compression...


a small ZFS hack

I've been dabbling a bit in ZFS recently, and what's amazing is not just how well it solved the well-understood filesystem problem, but how its design opens the door to novel ways to manage data. Compression is a great example. An almost accidental by-product of the design is that your data can be stored compressed on disk. This is especially interesting in an era when we have CPU cycles to spare, many too few available IOPs, and disk latencies that you can measure with a stop watch (well, not really, but you get the idea). With ZFS can you trade in some of those spare CPU cycles for IOPs by turning on compression, and the additional latency introduced by decompression is dwarfed by the time we spend twiddling our thumbs waiting for the platter to complete another revolution.smaller and smallerTurning on compression in zfs (zfs compression=on <dataset>) enables the so called LZJB compression algorithm -- a variation on Lempel-Ziv tagged by its humble author. LZJB is fast, reasonably effective, and quite simple (compress and decompress are implemented in about a hundred lines of code). But the ZFS architecture can support many compression algorithms. Just as users can choose from several different checksum algorithms (fletcher2, fletcher4, or sha256), ZFS lets you pick your compression routine -- it's just that there's only the one so far.putting the z(lib) in ZFSI thought it might be interesting to add a gzip compression algorithm based on zlib. I was able to hack this up pretty quicky because the Solaris kernel already contains a complete copy of zlib (albeit scattered around a little) for decompressing CTF data for DTrace, and apparently for some sort of compressed PPP streams module (or whatever... I don't care). Here's what the ZFS/zlib mash-up looks like (for the curious, this is with the default compression level -- 6 on a scale from 1 to 9):# zfs create pool/gzip# zfs set compression=gzip pool/gzip# cp -r /pool/lzjb/\* /pool/gzip# zfs listNAME USED AVAIL REFER MOUNTPOINTpool/gzip 64.9M 33.2G 64.9M /pool/gzippool/lzjb 128M 33.2G 128M /pool/lzjbThat's with a 1.2G crash dump (pretty much the most compressible file imaginable). Here are the compression ratios with a pile of ELF binaries (/usr/bin and /usr/lib):# zfs get compressratioNAME PROPERTY VALUE SOURCEpool/gzip compressratio 3.27x -pool/lzjb compressratio 1.89x -Pretty cool. Actually compressing these files with gzip(1) yields a slightly smaller result, but it's very close, and the convenience of getting the same compression transparently from the filesystem is awfully compelling. It's just a prototype at the moment. I have no idea how well it will perform in terms of speed, but early testing suggests that it will be lousy compared to LZJB. I'd be very interested in any feedback: Would this be a useful feature? Is there an ideal trade-off between CPU time and compression ratio? I'd like to see if this is worth integrating into OpenSolaris.Technorati Tags:ZFSOpenSolaris

I've been dabbling a bit in ZFS recently, and what's amazing is not just how well it solved the well-understood filesystem problem, but how its design opens the door to novel ways to manage data....


It's tested or it's broken

It's amazing how lousy software is. That we as a society have come to accept buggy software as an inevitability is either a testament to our collective tolerance, or -- much more likely -- the near ubiquity of crappy software. So we are guilty of accepting low standards for software, but the smaller we of software writers are guilty of setting those low expectations. And I mean we: all of us. Every programmer has at some time written buggy software (or has never written any software of any real complexity), and while we're absolutely at fault its not from lack of exertion. From time immemorial PhD candidates have scratched their whiteboard markers dry in attempts to eliminate bugs with new languages, analysis, programming techniques, and styles. The simplest method for finding bugs before they're released into the wild remains the most generally effective: testing.Of course, programmers perform at least nominal checks before integrating new code, but there's only so much a person can test by hand. So we've invented tests suites -- collections of tests that require no interaction. Testing rigor is regarded by university computer science departments a bit like ditch-digging is by civil engineering departments: a bit pedestrian. So people tend to sort it out for themselves. Here are a couple of tips for software tests that have come out of my experience using and developing tests suites (and the DTrace test suite in particular):It has to be easy to runA favorite mantra of a colleague of mine is that software is only as good as its test suite. While slightly less pithy, I'd add that a test suite is only as good as one's ability to run it. At Sun we have test suites for all kinds of crazy things. Many of them require elaborate configurations, and complex installations. Even when you manage to get everything set up (or, as often as not, find someone else to get it set up) and run, comprehending the results can require a visit from the high priestess of QA to scrutinize the pigeon entrails of the output logs.Installing and executing a test suite needs to be so simple that it can be done by any moron who might have the wherewithall to be able to modify the software it tests (hint: that's usually a lower bar than you'd like). The same goes for understanding the results. Building the DTrace test suite creates a package which you then install wherever you want to perform the testing. Running it (by executing a single command) produces output indicating how many tests passed and how many failed. A single failure represents a bug. I've used test suites where there are expected failures (things are no more broken than they were), and unexpected failures (you broke something), but differentiating the two can be nearly impossible for a novice. Keep it simple and easy to understand, or don't bother at all -- no one will run tests they can't figure out.Complete and up-to-dateNow that people are executing the test suite because it's such a breeze, it actually needs to test the software. I think it's productive to write tests both from the perspective of the implementation and the documented behavior, but there just needs to be adequate coverage -- and the extent of the coverage is often you can test for with some accuracy. As the software is evolving, the test suite needs to evolve with it. Every enhancement or bug fix should be accompanied with new tests to verify the change, and to ensure that it's not regressed in the future. On projects I've worked on, the tests for certain features have required much more thought and effort than the feature itself, but skipping the test is absolutely unacceptable. In short: a test suite should completely test the target software at any given moment.With the codeOriginally we developed the DTrace test suite as a separate code base. This caused some unanticipated problems. Since they were in different places, we would often integrate a change to DTrace and forget about the test for a couple of days -- violating the constraint noted above. Also, projects that lagged behind the main repository would run the test suite and encounter a bunch of spurious failures because they were effectively testing out of date software. We had similar problems when back-porting new DTrace features and fixes to Solaris 10.The solution -- in a rare split decision among the DTrace team -- was to integrate the test suite into the same repository as the code. This has absolutely been the right move. Now we can update the code and the test suite literally at the same time, and we're forced to think about testing sooner and more rigorously. It's also proved beneficial for the back-porting effort since a given snapshot of the source base contains the correct tests for that code.Run automaticallyIdeally it shouldn't be necessary, but automatically running tests is a great way to ensure that errors don't creep in because of sloppy engineering or seemingly unrelated changes. This is actually an area where DTrace is a less compelling role model. If we had put this procedure in place, it would have helped us to catch at least one bug quite a bit earlier. Solaris Nevada -- the code name for the next Solaris release -- recently changed compiler versions which resulted in a DTrace bug due to a newly aggressive optimizer on SPARC. The DTrace test suite picked this up immediately, but it wasn't run for at least a week after the compiler switch was made. We're working to have it run nightly, and our new project has been running nightly tests for a few weeks now.Go forth and testI've spent too many hours trying to figure out how to run arcane test suites -- just so I can't be accused of unduly contributing to the crappy state of software. I hope some of these (admittedly less-than-brilliant) lessons learned from testing DTrace have been helpfull. If you want to check out the DTrace test suite, you can see the code here and find the documentation for it here.Technorati Tags:DTraceOpenSolarisTestingQA

It's amazing how lousy software is. That we as a society have come to accept buggy software as an inevitability is either a testament to our collective tolerance, or -- much more likely -- the near...


DTrace: a history

An unsurprisingly commonrequest on theDTrace discussion forum has been for updated documentation. People have been -- on the whole -- very pleased with theSolaris Dynamic Tracing Guidethat we worked hard to produce, but I readily admit that we haven't been nearly as diligent in updating it. OK: we haven't updated it at all.But we have been updating DTrace itself, adding new variables and functions, tacking on new features, adding new providers, and fixing bugs. But unless you've been scraping our putback logs, or reading between the lines on the discussion forum, these features haven't necessarily been obvious. To that end, I've scraped the putback logs, and tried to tease out some of the bigger features, and put them all on the DTrace Change Log. We'll try to keep this up to date so you can see what features are in the build of Solaris Nevada you're running or the Solaris 10 release.This is probably going to be handy in its own right and ameliorate the documentation gap, but we do still need to update the documentation. I'm torn between keeping it in SGML (or converting it to XML), and converting it to a wiki. The former has the disadvantage of being overly onerous to update (witness the complete lack of updates), while the latter prevents us from releasing it in printed form (unless someone knows of a tool that can turn a wiki into a book). If anyone from the community is interested in working on this project, it would be a tremendous help to every DTrace user and developer.Technorati Tags:DTraceOpenSolaris

An unsurprisingly common request on theDTrace discussion forum has been for updated documentation. People have been -- on the whole -- very pleased with theSolaris Dynamic Tracing Guide that we worked...


DTrace user number one

Some people think DTrace was built for developers; others think it was for system administrators; some even think it was a tool designed just for Solaris kernel hackers but was so useful we decided to unleash it on the world. All wrong. The user we always had in mind was Solaris user extraordinaire Jarod Jenson. DTrace let's you explore virtually any element of the system -- it's biggest limitation is the user's own knowledge of the system. Jarod has the most diverse and expansive knowledge of enterprise computing bar none; in his hands DTrace seemingly has no limit.Here's how Jarod works. He gets on a plane and arrives at some gigantic (and usually petulant) corporation. Having never seen the application, he then says something like: I'll get you a 20% win or don't pay me. He sits down with DTrace and gets between 20% and 20,000% (no joke). And from those experiences, he's provided input that's improved DTrace immeasurably (in fact, Jarod hacked up the first version of Java support in DTrace "right quick").So how does he do it? I only have a very vague idea. Luckily, DTrace user number one is also the latest member of the blogsphere. Check out Jarod's blog to get the latest DTrace war stories. I know I've been awaiting this one for a while, and it's about damned time.

Some people think DTrace was built for developers; others think it was for system administrators; some even think it was a tool designed just for Solaris kernel hackers but was so useful we decided to...


Double-Parity RAID-Z

When ZFS first started, it was just Jeff trying to pair old problems with new solutions in margins too small to contain either. Then Matt joined up to bring some young blood to the project. By the time the project putback, the team had grown to more than a dozen. And now I've been pulled in -- if only for a cameo.When ZFS first hit the streets, Jeff wrote about RAID-Z, an implementation of RAID designed for ZFS. RAID-Z improves upon previous RAID schemes primarily in that it eliminates the so-called "write hole" by using a full (and variable-sized) stripe for all write operations. It's worth noting that RAID-Z exploits the fact that ZFS is an end-to-end solution such that metadata (traditionally associated with the filesystem layer) is used to interpret the RAID layout on disk (an operation usually ascribed to a volume manager). In that post, Jeff mentioned that a double-parity version of RAID-Z was in the works. What he actually meant is that he had read a paper, and thought it might work out -- you'd be forgiven for inferring that actual code had been written.Over lunch, Bill -- yet another elite ZFS hacker -- mentioned double-parity RAID-Z and their plans for implementing it. I pressed for details, read the paper, got interested in the math, and started yakking about it enough for Bill to tell me to put up or shut up.RAID-6The basic notion behind double-parity RAID or RAID-6 is that a stripe can survive two failures without losing data where RAID-5 can survive only a single failure. There are a number of different ways of implementing double-parity RAID; the way Jeff and Bill had chosen (due to its computational simplicity and lack of legal encumbrance) was one described by H. Peter Anvin in this paper. It's a nice read, but I'll attempt to summarize some of the math (warning: this summary is going to be boring and largely unsatisfying so feel free to skip it).For a given stripe of n data blocks, D0 .. Dn-1, RAID-5 computes the contents of the parity disk P by taking the bitwise XOR of those data blocks. If any Dn is corrupted or missing, we can recover it by taking the XOR of all other data blocks with P. With RAID-6, we need to compute another prity disk Q using a different technique such that Q alone can reconstruct any Dn and P and Q together can reconstruction any two data blocks.To talk about this, it's easier -- believe it or not -- to define a Galois field (or a finite field as I learned it) over the integers [0..255] -- the values that can be stored in a single byte. The addition field operation (+) is just bitwise XOR. Multiplication (x) by 2 is given by this bitwise operation for x x 2 = y:y7=x6y6=x5y5=x4y4=x3 + x7y3=x2 + x7y2=x1 + x7y1=x0y0=x7A couple of simple things worth noting: addition (+) is the same as subtraction (-), 0 is the additive identity and the multiplicative annihilator, 1 is the multiplicative identity. Slightly more subtle: each element of the field except for 0 (i.e. [1..255]) can be represented as 2n for some n. And importantly: x-1 = x254. Also note that x x y can be rewritten as 2log x x 2log y or 2log x + log y (where + in that case is normal integer addition).We compute Q as2n-1 D0 + 2n-2 D1 ... + Dn-1or equivalently((...(((D0 x 2 + D1 + ...) x 2 + Dn-2) x 2 + Dn-1.Computing Q isn't much slower than computing P since we're just dealing with a few simple bitwise operations.With P and Q we can recover from any two failures. If Dx fails, we can repair it with P. If P also fails, we can recover Dx by computing Qx where Qi = Q + 2n - 1 - x x Dx (easily done by performing the same computation as for generating Q but with Dx set to 0); Dx is then (Qx + Q) / 2n - 1 - x = (Qx + Q) x 2x + 1 - n. Once we solve for Dx, then we recompute P as we had initially.When two data disks are missing, Dx and Dy, that's when the rubber really meets the road. We compute Pxy and Qxy such that Pxy + Dx + Dy = P and Qxy + 2n - 1 - x x Dx + 2n - 1 - y x Dy = Q (as before). Using those two expressions and some basic algebra, we can solve for Dx and then plug that in to solve for Dy. The actual expressions are a little too hairy for HTML, but you can check out equation 16 in the paper or the code for the gory details.Double-Parity RAID-ZAs of build 42 of

When ZFS first started, it was just Jeff trying to pair old problems with new solutions in margins too small to contain either. Then Matt joined up to bring some young blood to the project. By the...


DTrace on Geek Muse

DTrace was recently featured on episode 35 of Geek Muse. DTrace was brought to their attention because of John Birrell's recent work to port it to FreeBSD (nice work, John!). The plug was nice, but I did want to respond to a few things:DTrace was referred to as "a scripting language for debugging". While I can understand why one might get that impression, it's kind of missing the point. DTrace, concisely, is a systemic observability framework that's designed explicitly for use on mission-critical systems. It lets users and system administrators get concise answers to arbitrary questions. The scripting language aspect to DTrace lets you express those questions, but that's really just a component. James Dickens took a stab at an all-encompassing definition of DTrace....One of the podcasters said something to the effect of "I'm just a web developer..." One of the great things about DTrace is that it has uses for developers at almost any layer of the stack. Initially DTrace could only view the kernel, and C and C++ code, but its release in Solaris 10 well over a year ago, DTrace has been extended toJava,Ruby,php,python, perl,and a handful of other dynamic languages that folks who are "just web developers" tend to use. In addition to being able to understand how your own code works, you'll be able to see how it interacts with every level of the system all the way down to things like disk I/O and the CPU scheduler.Shortly after that, someone opined "I could use it for looking at XML-RPC communication". For sure! DTrace is crazy useful for understanding communication between processes, and in particular for XML-RPC for viewing calls and replies quickly and easily.At one point they also identified the need to make sure users can't use DTrace to spy on each other. By default, DTrace is only executable by the root user. System administrators can dole out various levels of DTrace privilege to users as desired. Check out the manual -- and the security chapter in particular.Technorati Tags:DTraceGeek Muse

DTrace was recently featured on episode 35 of Geek Muse. DTrace was brought to their attention because of John Birrell's recent work to port it to FreeBSD (nice work, John!). The plug was nice, but I...


DTrace at JavaOne 2006

At last year's JavaOne, DTrace enjoyed some modicum of celebrity, being featured in a keynote, a session, and the DTrace challenge. The session was attended by about 1,000 people (out of about 10,000 at the conference), and the DTrace Challenge -- which promised to find a performance win in Java applications brought to us or fork over an iPod -- went an impressive 15 for 16 (and that one outlier was a trivial 200 line program -- we've added a 2,000 line minimum this year).Building on that success, Jarod Jenson and I are giving an updated session on DTrace for Java Thursday 5/18 2:45-3:45 in the Moscone Center Gateway 102/103. We'll be giving a similar introduction to DTrace and to using DTrace on Java as at last year's talk, and showing off some of the improvements we've made in the past year. Jarod's also manning the DTrace Challenge again this year so get your Java application onto a USB key, CD or whatever, and bring it by booth 739; if he can't make it go faster you'll get an iPod, but if you were hoping for an iPod rather than a performance win prepare to be disappointed. Angelo Rajadurai and Peter Karlsson are also giving a hands on lab on using DTrace to examine Java applications (Friday 5/19 1:15-2:15 Moscone Center Hall E 130/131) so be sure to sign up for that if you want to get your hands dirty with the stuff we talk about at our session.Technorati Tags:DTraceJavaJavaOne

At last year's JavaOne, DTrace enjoyed some modicum of celebrity, being featured in a keynote, a session, and the DTrace challenge. The session was attended by about 1,000 people (out of about...


User-land tracing gets better and better

As I've mentioned in the past, developers can add their own DTrace probes using the user-land statically defined tracing (USDT) mechanism. It's been used to instrument Postgres and Apache, and to add observability into dynamic languages such as Java, Ruby, and php. I recently made a couple of improvements to USDT that I mentioned here and here, but I think deserve a little more discussion.Adding USDT probes (as described in the DTrace manual) requires creating a file defining the probes, modifying the source code to identify those probe sites, and modifying the build process to invoke dtrace(1M) with the -G option which causes it to emit an object file which is then linked into the final binary. Bart wrote up a nice example of how to do this. The mechanisms are mostly the same, but have been tweaked a bit.USDT in C++One of the biggest impediments to using USDT was its (entirely understandable) enmity toward C++. Briefly, the problem was that the modifications to the source code used a structure that was incompatible with C++ (it turns out you can only extern "C" symbols at the file scope -- go figure). To address this, I added a new -h option that creates a header file based on the probe definitions. Here's what the new way looks like:provider.dprovider database { probe query__start(char \*); probe query__done(char \*);};src.c or src.cxx...#include "provider.h"...static intdb_query(char \*query, char \*result, size_t size){ ... DATABASE_QUERY_START(query); ... DATABASE_QUERY_DONE(result); ...}Here's how you compile it:$ dtrace -h -s provider.d$ gcc -c src.c$ dtrace -G -s provider.d src.o$ gcc -o db provider.o src.o ...If you've looked at the old USDT usage, the big differences are the creation and use of provider.h, and that we use the PROVIDER_PROBE() macro rather than the generic DTRACE_PROBE1() macro. In addition to working with C++, this has the added benefit that it engages the compiler's type checking since the macros in the generated header file require the types specified in the provider definition.Is-Enabled ProbesOne of the tenets of DTrace is that the mere presence of probes can never slow down the system. We achieve this for USDT probes by only adding the overhead of a few no-op instructions. And while it's mostly true that USDT probes induce no overhead, there are some cases where the overhead can actually be substantial. The actual probe site is as cheap as a no-op, but setting up the arguments to the probe can be expensive. This is especially true for dynamic languages where probe arguments such as the class or method name are often expensive to compute. As a result, some providers -- the one for Ruby, for example -- couldn't be used in production due to the disabled probe effect.To address this problem, Bryan and I came up with the idea of what -- for lack of a better term -- I call is-enabled probes. Every probe specified in the provider definition has an associated is-enabled macro (in addition to the actual probe macro). That macro is used to check if the DTrace probe is currently enabled so the program can then only do the work of computing the requisite arguments if they're needed.For comparison, Rich Lowe's prototype Ruby provider basically looked like this:rb_call(...{ ... RUBY_ENTRY(rb_class2name(klass), rb_id2name(method)); ... RUBY_RETURN(rb_class2name(klass), rb_id2name(method)); ...}Where rb_class2name() and rb_id2name perform quite expensive operations.With is-enabled probes, Bryan was able to greatly reduce the overhead of the Ruby provider to essentially zero:rb_call(...{ ... if (RUBY_ENTRY_ENABLED()) RUBY_ENTRY(rb_class2name(klass), rb_id2name(method)); ... if (RUBY_RETURN_ENABLED()) RUBY_RETURN(rb_class2name(klass), rb_id2name(method)); ...}When the source objects are post-processed by dtrace -G, each is-enabled site is turned into a simple move of 0 into the return value register (%eax, %rax, or %o0 depending on your ISA and bitness). When probes are disabled, we get to skip all the expensive argument setup; when a probe is enabled, the is-enabled site changes so that the return value register will have a 1. (It's also worth noting that you can pull some compiler tricks to make sure that the program text for the uncommon case -- probes enabled -- is placed out of line.)The obvious question is then "When should is-enabled probes be used?" As with so many performance questions the only answer is to measure both. If you can eke by without is-enabled probes, do that: is-enabled probes are incompatible with versions of Solaris earlier than Nevada build 38 and they incur a greater enabled probe effect. But if acceptable performance can only be attained by using is-enabled probes, that's exactly where they were designed to be used.Technorati Tags:DTraceUSDT

As I've mentioned in the past, developers can add their own DTrace probes using the user-land statically defined tracing (USDT) mechanism. It's been used to instrument Postgres and Apache, and to add...


DTrace for Linux

With BrandZ, it's now possible to use DTrace onLinux applications. For the uninitiated, DTrace is the dynamic tracing facilityin OpenSolaris; it allows for systemic analysis of a scope and precision unequalled in the industry. With DTrace, administrators anddevelopers can trace lowlevel services like I/O and scheduling, up the system stack throughkernel functions calls, system calls, and system library calls, and intoapplications written in C and C++ or any of a host of dynamic languages likeJava, Ruby, Perl or php. One of my contributions to BrandZ was to extendDTrace support for Linux binaries executed in a branded Zone.DTrace has several different instrumentation providers thatknow how to instrument a particular part of the system and provide relevantprobes for that component. The io provider lets you trace diskI/O, the fbt (function boundary tracing) provider lets you traceany kernel function call, etc. A typical system will start with more than 30,000 probes but providers can create probes dynamically to trace new kernel modulesor user-land processes. When strictly focused on a user-land application, themost useful providers are typically the syscall provider toexamine system calls andthe pid provider that can trace any instruction in a any processexecuting on the system.For Linux processes, the pid provider just worked (well, once Russ built a library to understand the Linux run-time linker),and we introduced a newprovider -- the lx-syscall provider -- to trace entry and return foremulated Linux system calls. With these providers it's possible to understandevery facet of a Linux application's behavior and with the other DTrace probesit's possible to reason about an application's use of system resources. In other words, you can take that sluggish Linux application, stick it in a brandedZone, dissect it using Solaris tools, and then bring it back to a native Linuxsystem with the fruits of your DTrace investigation[1].To give an example of using DTrace on Linux applications, I needed anapplication to examine. I wanted a well known program that either didn't runon Solaris or operated sufficiently differently such examining the Linuxversion rather than the Solaris port made sense. I decided on/usr/bin/top partly because of the dramatic differences between how itoperates on Linux vs. Solaris (due to the differences in /proc), but mostlybecause of what I've heard my colleague, Bryan, refer to as the"top problem": your system is slow, so you run top. What's the topprocess? Top!Running top in the Linux branded zone, I opened a shell in theglobal (Solaris) zone to use DTrace.I started as I do on Solaris applications: I looked at system calls. I wasinterested to see which system calls were being executed most frequentlywhich is easily expressed in DTrace:bash-3.00# dtrace -n lx-syscall:::entry'/execname == "top"/{ @[probefunc] = count(); }'dtrace: description 'lx-syscall:::entry' matched 272 probes\^C fstat64 322 access 323 gettimeofday 323 gtime 323 llseek 323 mmap2 323 munmap 323 select 323 getdents64 1289 lseek 1291 stat64 3545 rt_sigaction 5805 write 6459 fcntl64 6772 alarm 8708 close 11282 open 14827 read 14830Note the use of the aggregation denoted with the '@'.Aggregations are the mechanism by which DTrace allows users to examinepatterns of system behavior rather than examining each individual datum-- each system call for example.(In case you also noticed the strange discrepancy between the number of openand close system calls, many of those opens are failing so it makes sense thatthey would have no corresponding close. I used the lx-syscall provider to sussthis out, but I omitted that investigation in a vain appeal for brevity.)There may well be something fishy about this output, but nothing struck me as so compellingly fishy to explore immediately. Instead, I fired up vi and wrote ashort D script to see which system calls were taking the most time:lx-sys.d#!/usr/sbin/dtrace -slx-syscall:::entry/execname == "top"/{ self->ts = vtimestamp;}lx-syscall:::return/self->ts/{ @[probefunc] = sum(vtimestamp - self->ts); self->ts = 0;}This script creates a table of system calls and the time spent in them (innanoseconds). The results were fairly interesting.bash-3.00# ./lx-sys.ddtrace: script './lx-sys.d' matched 550 probes\^C llseek 4940978 gtime 5993454 gettimeofday 6603844 fstat64 14217312 select 26594875 lseek 30956093 mmap2 43463946 access 49033498 alarm 72216971 fcntl64 188281402 rt_sigaction 197646176 stat64 268188885 close 417574118 getdents64 781844851 open 1314209040 read 1862007391 write 2030173630 munmap 2195846497That seems like a lot of time spent in munmap for top. In fact, I'mrather surprised that there's any mapping and unmapping going on at all (Iguess that should have raised an eyebrow after my initial system call count).Unmapping memory is a pretty expensive operation that gets more expensiveon bigger systems as it requires the kernel to do some work on every CPUto completely wipe out the mapping.I then modified lx-sys.d to record the total amount of time top spent on theCPU and the total amount of time spent in system calls to see how large a chunkof time these seemingly expensive unmap operations were taking:lx-sys2.d#!/usr/sbin/dtrace -slx-syscall:::entry/execname == "top"/{ self->ts = vtimestamp;}lx-syscall:::return/self->ts/{ @[probefunc] = sum(vtimestamp - self->ts); @["- all syscalls -"] = sum(vtimestamp - self->ts); self->ts = 0;}sched:::on-cpu/execname == "top"/{ self->on = timestamp;}sched:::off-cpu/self->on/{ @["- total -"] = sum(timestamp - self->on); self->on = 0;}I used the sched provider to see when top was going on and offof the CPU, and I added a row to record the total time spent in allsystem call. Here were the results (keep in mind I was just hitting \^C toend the experiment after a few seconds so it's expected that these numberswould be different from those above; there are ways to have more accuratelytimed experiments):bash-3.00# ./lx-sys2.ddtrace: script './lx-sys2.d' matched 550 probes\^C llseek 939771 gtime 1088745 gettimeofday 1090020 fstat64 2494614 select 4566569 lseek 5186943 mmap2 7300830 access 8587484 alarm 11671436 fcntl64 31147636 rt_sigaction 33207341 stat64 45223200 close 69338595 getdents64 131196732 open 220188139 read 309764996 write 340413183 munmap 365830103 - all syscalls - 1589236337 - total - 3258101690So system calls are consuming roughly half of top's time on the CPU andthe munmap syscall is consuming roughly a quarter of that. This was enoughto convince me that there was probably room for improvement and furtherinvestigation might bear fruit.Next, I wanted to understand what this mapped memory was being used for soI wrote a little script that traces all the functions called in the processbetween when memory is mapped using the mmap2(2) system call and when it'sunmapped and returned to the system through the munmap(2) system call:map.d#!/usr/sbin/dtrace -s#pragma D option quietlx-syscall::mmap2:return/pid == $target/{ self->ptr = arg1; self->depth = 10; printf("%\*.s %s`%s\\n", self->depth, "", probemod, probefunc);}pid$target:::return/self->ptr/{ printf("%\*.s %s syscall\\n", self->depth, "", probefunc); self->ptr = 0; self->depth = 0; exit(0);}This script uses the $target variable which means that we need to run it withthe -p option where is the process ID of top. When mmap2 returns,we set a thread local variable, 'ptr', which stores the address at the baseof the mapped region; for every function entry and return in the process wecall printf() if self->ptr is set; finally, we exit DTrace whenmunmap is called with that same address. Here are the results:bash-3.00# ./map.d -p `pgrep top`follow = arg0;}lx-syscall::mmap2:entry/self->follow/{ @["mapped"] = count(); self->follow = 0;}pid$target::malloc:return/self->follow/{ @["no map"] = count(); self->follow = 0;}Here are the results:bash-3.00# ./malloc.d -p `pgrep top`dtrace: script './malloc.d' matched 11 probes\^C mapped 275 no map 3024So a bunch of allocations result in a mmap, but not a huge number. Next Idecided to explore if there might be a correlation between the size of theallocation and whether or not it resulted in a call to mmap using the followingscript:malloc2.d#!/usr/sbin/dtrace -spid$target::malloc:entry{ self->size = arg0;}lx-syscall::mmap2:entry/self->size/{ @["mapped"] = quantize(self->size); self->size = 0;}pid$target::malloc:return/self->size/{ @["no map"] = quantize(self->size); self->size = 0;}Rather than just counting the frequency, I used the quantizeaggregating action to built a power-of-two histogram on the number of bytesbeing allocated (self->size). The output was quite illustrative:bash-3.00# ./malloc2.d -p `pgrep top`dtrace: script './malloc2.d' matched 11 probes\^C no map value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@ 426 8 |@@@@@@@@@@@@@@@ 852 16 |@@@@@@@@@@@ 639 32 |@@@@ 213 64 | 0 128 | 0 256 | 0 512 |@@@@ 213 1024 | 0 mapped value ------------- Distribution ------------- count 131072 | 0 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 213 524288 | 0All the allocations that required a mmap were huge -- between 256k and512k. Now it makes sense why the Linux libc allocator would treat theseallocations a little differently than reasonably sized allocations. And this isclearly a smoking gun for top performance: it would do much better topreallocate a huge buffer and grow it as needed (assuming it actually needs itat all) than to malloc it each time. Tracking down the offending line of codewould just require a non-stripped binary and a little DTrace invocation likethis:# dtrace -n pid`pgrep top`::malloc:entry'/arg0 >= 262144/{@[ustack()] = count()}'From symptoms to root cause on a Linux application in a few DTrace scripts --and it took me approximately 1000 times longer to cobble together some vaguelycoherent prose describing the scripts than it did for me to actually performthe investigation. BrandZ opens up some pretty interesting new vistas forDTrace. I look forward to seeing Linux applications being brought in fortune-ups on BrandZ and then reaping those benefits either back on theirmother Linux or sticking around to enjoy the fault management, ZFS,scalability, and, of course, continued access to DTrace in BrandZ.[1] Of course, results may vary since the guts of the Linux kernel differsignificantly from those of the Solaris kernel, but they're often fairlysimilar. I/O or scheduling problems will be slightly different, but oftennot so different that the conclusions lack applicability.[2] Actually, we can can still trace function calls -- in fact, we can trace any instruction -- but it takes something of a heroic effort. We could disassemble parts of top to identify calls sites and then use esoteric pid123::-:address probe format to trace the stripped function. I said we could do it; I never said it would be pretty.Technorati Tags:BrandZDTraceSolarisOpenSolaris

With BrandZ, it's now possible to use DTrace on Linux applications. For the uninitiated, DTrace is the dynamic tracing facility in OpenSolaris; it allows for systemic analysis of a scope and precision...


OpenSolaris on LugRadio

The good folks over at LugRadio (that LUG; not the other LUG of course) invited me on their show to answer some of their questions about OpenSolaris. You can find it in the latest episode. Check your volume before you start playing the episode: the first reference to masturbation is approximately 6 seconds in.Understandably, the LugRadio guys didn't have much exposure to OpenSolaris, but they were certainly interested and recognize that it represents a pretty exciting development in the world of open source. Mostly I addressed some of the differences between Solaris and Linux. They seemed quite keen on Nexenta and I recommend that Solaris and Debian users alike try it out. There's one small correction I'd like to make regarding my comments on the show: I erroneously claimed that Solaris 7 -- the first version of Solaris to support 64-bit processors -- came out in 1996. In fact, Solaris 7 came out in November of 1998 (64-bit support having been first integrated in October of 1997). That cuts our 64-bit experience to 7 or 8 years from the 10 that I claimed.On a vaguely related note, I'll be in London speaking at the OpenSolaris user group meeting on 12/19/2005. That's after a tromp accross Europe which includes a DTrace cum Java talk and an OpenSolaris BoF at JavaPolis in Belgium.Technorati tags:OpenSolarisLugRadioJavaPolis

The good folks over at LugRadio (that LUG; not the other LUG of course) invited me on their show to answer some of their questions about OpenSolaris. You can find it in the latest episode. Check your...


Fall of Code?

Actually, Fall, Winter and Spring of code. Sun just announced the Solaris 10 Univesity Challenge Content. That's right, it's a challenge and a contest and with three times the staying power of single season of code. Apparently in a modern day treaty of Brest-Litovsk, Google ceded the rest of the year to Sun. Perhaps this was the real beef of the recent Sun/Google agreement.This is actually pretty cool: be a college student, do something cool on OpenSolaris, take a shot at winning the grand prize of $5k and a sweet box (I imagine there might be prizes for other interesting entries). There are some ideas projects on the contest/challenge/seasonal coding page ranging from good (MythTV), to mundane (support for inkjet printers from Epson), to confusing (internet gaming -- I thought online gamling was its own reward), to inane ("A joystick driver - for gaming", oh for gaming? I've been using it for system administration). Here's my list off the top of my head -- if you want more ideas, feel free to drop me a line.Work on an existing open source project pearpc runs ppc applications on x86. I started working on porting it over and was able to boot Darwin. valgrind is very cool (I've only just seen it). It would be great to port it or to use pieces like KCacheGrind and plug in DTrace as a backend. Port over your favorite application, or get it running in some emulation environment of your own devising. Make something go faster. MySQL, Gnome, mozilla, some random system call, whatever; there's a lot of inefficiency out there.Write something new (using cools stuff in Solaris) I'd love to see more dynamic languages with native DTrace support. We've already got support forJava,php,Ruby, andPerl in some form; make it better or add support for some other language you know and love (TCL, python, scheme, LISP, ML, etc.). Build another kind of analysis tool on top of DTrace. We're working on a Java binding which is going to make this easier. Write a device driver for your favorite crazy device (which I assume is your new iPod nano or something; you're such a hipster Apple fanboy). Build a tool to simulate a distributed environment on Zones and use DTrace to monitor the communication. WARNING: your distributed systems professor will be your new best friend.That's what I'd do, but if you have a better idea, go do that instead.Technorati tags:OpenSolarisSummer o' Code

Actually, Fall, Winter and Spring of code. Sun just announced the Solaris 10 Univesity Challenge Content. That's right, it's a challenge and a contest and with three times the staying power of single...


OpenSolaris and svk

Today at EuroOSCON, I attended a introductory talk on svn by Chia-liang Kao. I was hopeful that svk might address some of the issues that I thought would prevent us from adopting Subversion for OpenSolaris. In particular, Subversion requires a centralized repository whereas svk, which is built on top of Subversion, provides the distributed revision control system that we'd need. After the talk, my overall impression was that svk seemed to lack a certain polish, but after struggling to phrase that in less subjective terms, I'm coming around a bit.I got a little nervous when the first example of svk's use was for keeping the contents of /etc under revision control. The big problem that svk solved was having random directories (.svc, SCCS, whatever) in, for example, /etc/skel. Talk about trivia (almost as relevant as a demo implementing a weblog in Ruby on Rails). I guess it's nice that svk solves a problem for that particularly esoteric scenario, but wasn't there some mention that this might be used to, you know, hold onto source code? Was that actually the design center for svk?Fortunately we did get to the details of using svk for a source code repository. I think this is just my bias coming from teamware, but some of the mechanisms seem a bit strange. In particular, you do svk mirror to make one kind of copy of the main repository (the kind that's a "local repository"), and svk checkout to make a different kind of copy (the kind that's the "working area"). In other words, you have a tree structure, but the branches and leaves are different entities entirely and editing can only be done on the leaves. I guess that's not such a huge pain, but I think this reflects the origins: taking a local revision control system and making it distributed. Consequentially, there's a bunch of stuff left over from Subversion (like branches) that seem rather redundant in a distributed revision control system (don't take branch, make another local repository, right?); it's not that these actually hurt anything, it just means that there's a bunch of complexity for what's essentially ancillary functionality.Another not-a-big-deal-that-rubs-me-the-wrong-way is that svk is a pile of perl modules (of course, there's probably a specific perlism for that; "epocalyptus" or something I'm sure). I suppose we'll only have to wrestle that bear to the ground once, and stuff it in a tar ball (yes, Allan, or a package). To assuage my nervousness, I'd at least like to be confident that we could maintain this ourselves, but I don't think we have the collective perl expertise (except for the aforementioned Alan) to be able to fix a bug or add a feature we desperately need.I want this thing to work, because svk seems to be the best option currently available, but I admit that I was a bit disappointed. If we were going to use this for OpenSolaris, we'd certainly need to distribute it in a self-contained form, and really take it through the paces to make sure it could do all the things we need in the strangest edge cases. As I mentioned, we currently use teamware which is an old Sun product that's been in constant use despite being end-of-lifed several years ago. While I like it's overall design, there's a bunch of work that would need to happen for it to be suitable for OpenSolaris. In particular, it currently requires a shared NFS filesystem, and we require you to be on the host machine when committing a change. Adding networked access capabilities to it would probably be a nightmare. It also relies on SCCS (an ancient version control system) for managing individual files; not a problem per se, but a little crufty. Teamware is great and has withstood the test of time, but svk is probably closer to satisfying our requirements.Update: But there are quite a few other options I hadn't looked into. Svk no longer looks like a front runner. If there are other systems you think are worth considering, let me know.I'll play with svk some more and try to psych myself up for this brave new world. I'd appreciate any testimonials or feedback, and, of course, please correct all the factual errors I'm sure I committed.Technorati tags:EuroOSCONOpenSolarissvk

Today at EuroOSCON, I attended a introductory talk on svn by Chia-liang Kao. I was hopeful that svk might address some of the issues that I thought would prevent us from adopting Subversion for OpenSo...


OpenSolaris and Subversion

I just attended Brian W. Fitzpatrick's talk on Subversion at EuroOSCON. Brian did a great job and Subversion looks like a really complete replacement for cvs -- the stated goal of the project. What I was particularly interested in was the feasibility of using Subversion as the revision control system for OpenSolaris; according to the road map we still have a few months to figure it out, but, as my grandmother always said while working away at her mechanical Turing machine, time flies when you're debating the merits of various revision control systems. While Subversion seems like some polished software, I don't think it's a solution -- or at least a complete solution -- to the problem we have with OpenSolaris. In particular, it's not a distributed revision control system meaning that there's one master repository that manages everything including branches and sub-branches. This means that if you have a development team at the distal point on the globe from the main repository (we pretty much do), all that team's work has to traverse the globe. Now the Subversion folks have ensured that the over the wire protocol is lean, but that doesn't really address the core of the problem -- the concern isn't potentially slow communication, it's that it happens at all. Let's say a bunch of folks -- inside or outside of Sun -- start working on a project; under Subversion there's a single point of failure -- if the one server in Menlo Park goes down (or the connection to it does down), the project can't accept any more integrations. I'm also not clear if branches can have their own policies for integrations. There are a couple other issues we'd need to solve (e.g. comments are made per-integration rather than per-file), but this is by far the biggest.Brian recommeded a talk on svk later this week; svk is a distributed revision control and source management system that's built on Subversion. I hope svk solves the problems OpenSolaris would have with Subversion, but it would be great if Subversion could eventually migrate to a distributed model. I'd also like to attend this BoF on version control systems, but I'll be busy at the OpenSolaris User Group meeting -- where I'm sure you'll be as well.Technorati tags:EuroOSCONOpenSolaris

I just attended Brian W. Fitzpatrick's talk on Subversion at EuroOSCON. Brian did a great job and Subversion looks like a really complete replacement for cvs -- the stated goal of the project. What I...


Too much pid provider

Perhaps it's a bit Machiavellian, but I just love code that in some way tricks another piece of code. For example, in college I wrote some code that trolled through the address space of my favorite game to afford me certain advantages. Most recently, I've been working on some code that tricks other code into believing a complete fiction[1] about what operating system it's executing on. While working on that, I discovered an interesting problem with the pid provider -- code that's all about deception and sleight of hand. Before you read further, be warned: I've already written two completely numbing accounts of the details of the pid provider here and here, and this is going to follow much in that pattern. If you skip this one for fear of being bored to death[2], I won't be offended.The problem arose because the traced process tried to execute an x86 instruction like this: call \*0x10(%gs)This instruction is supposed to perform a call to the address loaded from 0x10 bytes beyond the base of the segment described by the %gs selector. The neat thing about the pid provider (in case you've skipped those other posts) is that most instructions are executed natively, but some -- and call is one of them -- have to be emulated in the kernel. This instruction's somewhat unusual behavior needed to be emulated precisely; the pid provider, however, didn't know from selector prefixes and blithely tried to load from the absolute virtual address 0x10. Whoops.To correct this, I needed to add some additional logic to parse the instruction and then augment the emulation code to know how to deal with these selectors. The first part was trivial, but the second half involved some digging into the x86 architecture manual. There are two kinds of descriptor tables, the LDT (local) and GDT (global). The value of %gs, in this case, tells us which table to look in, the index into that table, and the permissions associated with that selector.Below is the code I added to usr/src/uts/intel/dtrace/fasttrap_isa.c to handle this case. You can find the context here.1145 if (tp->ftt_code == 1) {1146 1147 /\*1148 \* If there's a segment prefix for this1149 \* instruction, first grab the appropriate1150 \* segment selector, then pull the base value1151 \* out of the appropriate descriptor table1152 \* and add it to the computed address.1153 \*/1154 if (tp->ftt_segment != FASTTRAP_SEG_NONE) {1155 uint16_t sel, ndx;1156 user_desc_t \*desc;1157 1158 switch (tp->ftt_segment) {1159 case FASTTRAP_SEG_CS:1160 sel = rp->r_cs;1161 break;1162 case FASTTRAP_SEG_DS:1163 sel = rp->r_ds;1164 break;1165 case FASTTRAP_SEG_ES:1166 sel = rp->r_es;1167 break;1168 case FASTTRAP_SEG_FS:1169 sel = rp->r_fs;1170 break;1171 case FASTTRAP_SEG_GS:1172 sel = rp->r_gs;1173 break;1174 case FASTTRAP_SEG_SS:1175 sel = rp->r_ss;1176 break;1177 }1178 1179 /\*1180 \* Make sure the given segment register1181 \* specifies a user priority selector1182 \* rather than a kernel selector.1183 \*/1184 if (!SELISUPL(sel)) {1185 fasttrap_sigsegv(p, curthread,1186 addr);1187 new_pc = pc;1188 break;1189 }1190 1191 ndx = SELTOIDX(sel);1192 1193 if (SELISLDT(sel)) {1194 if (ndx > p->p_ldtlimit) {1195 fasttrap_sigsegv(p,1196 curthread, addr);1197 new_pc = pc;1198 break;1199 }1200 1201 desc = p->p_ldt + ndx;1202 1203 } else {1204 if (ndx >= NGDT) {1205 fasttrap_sigsegv(p,1206 curthread, addr);1207 new_pc = pc;1208 break;1209 }1210 1211 desc = cpu_get_gdt() + ndx;1212 }1213 1214 addr += USEGD_GETBASE(desc);1215 }The thing I learned by writing this is how to find the base address for those segment selectors which has been something I've been meaning to figure out. We (and most other operating systems) get to the thread pointer through a segment selector, so when debugging in mdb(1) I've often wondered how to perform the mapping from the value of %gs to the thread pointer that I care about. I haven't put that code back yet, so feel free to point out any problems you see. Anyway, if you made it here, congratulations and thanks.[1]Such is my love of the elaborate ruse that I once took months setting up a friend of mine for a very minor gag. Lucas and I were playing scrabble and he was disappointed to hear that the putative word "fearslut" wasn't good. Later I conspired with a friend at his company to have a third party send mail to an etymology mailing list claiming that he had found the word "fearslut" in an old manuscript of an obscure Shakespear play. Three months later Lucas triumphantly announced to me that, lo and behold, "fearslut" was a word. I think I passed out I was laughing so hard. [2]My parents are fond of recounting my response when they asked what I was doing in my operating systems class during college: "If I told you, you wouldn't understand, and if I explained it, you'd be bored."

Perhaps it's a bit Machiavellian, but I just love code that in some way tricks another piece of code. For example, in college I wrote some code that trolled through the address space of my favorite...


The mysteries of _init

I hadn't been fully aware that I felt this way, but I recently had arealization: I love the linker. It's a technology that's amazing in both itssimplicity and its complexity. I'm sure my feelings are influenced in no smallway by the caliber of the engineers working on it -- Rodand Mike are always eager to explain how the some facet ofthe linker works or to add something new and whizzy if it can't quite do whatI need.Over the course of developinguser-levelstatically defined tracing USDT, I've worked (and continue to work) withthe linker guys to figure out the best way to slot the two technologiestogether. Recently, some users of USDT have run into a problem wherebinaries compiled with USDT probes weren't actually making them available tothe system. We eventually tracked it down to incorrect use of the linker. Ithought it would be helpful to describe the problem and the solution in caseother people bump into something similar.First a little bit on initialization. In a C compiler, you can specify aninitialization function like this: #pragma init(my_init). Theintention of this is to have the specified function (e.g. my_init)called when the binary is loaded into the program. This is a good place to doinitialization like memory allocation or other set up used in the rest of thebinary. What the compiler actually does when you specify this is create a".init" section which contains a call to the specified function.As a concrete example (and the example relevant to this specificmanifestation of the problem), take a look at this code inusr/src/lib/libdtrace/common/drti.c: 88 #pragma init(dtrace_dof_init) 89 static void 90 dtrace_dof_init(void) 91 {When we compile this into an object file (which we then deliver in/usr/lib/dtrace/drti.o), the compiler generates a .init ELF section thatcontains a call to dtrace_dof_init() (actually it contains a call witha relocation that gets filled into to be the address ofdtrace_dof_init(), but that's a detail for another blog entry).The linker doesn't really do anything special with .init ELF sections -- itjust concatenates them like it does all other sections with the same name. Sowhen you compile a bunch of object files with .init sections, they just getcrammed together -- there's still nothing special that causes them to getexecuted with the binary is loaded.Here's the clever part, when a compiler invokes the linker, it provides twospecial object files: crti.o at the beginning, and crtn.o at the end. You canfind those binaries on your system in /usr/lib/ or in /usr/sfw/lib/gcc/...for the gcc version. Those binaries are where the clever part happens;crti.o's .init section contains effectively an open brace and crtn.o containsthe close brace (the function prologue and epilogue respectively):$ dis -t .init /usr/lib/crti.osection .init_init() _init: 55 pushl %ebp 1: 8b ec movl %esp,%ebp 3: 53 pushl %ebx 4: e8 00 00 00 00 call +0x5 9: 5b popl %ebx a: 81 c3 03 00 00 00 addl $0x3,%ebx$ dis -t .init /usr/lib/crtn.osection .init 0: 5b popl %ebx 1: c9 leave 2: c3 retBy now you may see the punch-line: by bracketing the user-generated objectfiles with these crti.o and crtn.o the resulting .init section is theconcatenation of the function prologue, all the calls in the user's objectfiles, and finally the function epilogue. All of this is contained in thesymbol called _init.The linker then has some magic that identifies the _init function as specialand includes a dynamic entry (DT_INIT) that causes _init to be called by thethe run-time linker (ld.so.1) when the binary is loaded. In the binary thatwas built with USDT but wasn't working properly, there was a .init sectionwith the call to dtrace_dof_init(), but no _init symbol. Theproblem was, of course, that crti.o and crtn.o weren't being specified in thelinker invocation resulting in a .init section, but no _init symbol so noDT_INIT section, so no initialization and no USDT.

I hadn't been fully aware that I felt this way, but I recently had a realization: I love the linker. It's a technology that's amazing in both itssimplicity and its complexity. I'm sure my feelings are...


DTrace User-Land

One of the primary motivations for DTrace was the absence of a framework that united observability into all aspects of the system. There were certainly tools for looking at the individual components (iostat(1) for I/O; mpstat(1) and prstat(1) for some basic system monitoring; truss(1), gdb(1), mdb(1) and dbx(1) for examining processes), but correlating the data from the disparate sources was difficult or impossible (tell that to a room of system administrators and they start nodding like a bunch of Barry Bonds bobble-heads). DTrace was designed to fill that hole -- to be a framework for systemic analysis and to provide a single interface for observing and combining data from any corner of the system.I joined the DTrace team because I had some ideas for how to implement user-land tracing (which I've discussed ad nauseum here, here, here, and pretty much the rest of my blog). At that point, most of the work on DTrace had been around the framework and the kernel function boundary tracing provider (fbt). The kernel tracing providers were -- and continue to be -- immensely useful for us working on the Solaris kernel, but when I talk to customers and folks from Sun working to get performance wins on business applications, they're using the user-level tracing components for a large majority of the D-scripts they write. But that's not to say that the kernel tracing components aren't vital -- without facilities like the io provider or the sched provider many of those performance wins would be impossible to attain -- rather most of the problems manifest themselves in user-land and that's where most investigations begin.When I (or others) start using DTrace on an application, the first D invocations we use are intended to gather information about the application -- where is it spending its time, what are the hot functions, etc. As the investigation narrows in scope further and further, it becomes increasingly important to understand the underlying activity in the kernel be it the many kernel function calls, or its interaction with the system's physical resources. The user-level and kernel providers are both required to form a complete whole. Without either half of the equation, you aren't left with a DTrace that's half as useful; you're left with what is effectively another specialized tool for a specific area of the system that doesn't actually address the problem of systemic comprehension.We've continued working to improve DTrace since we first integrated it into Solaris 10 -- they're have been over 200 changes to add features, or address problems. It's worth noting that much of our recent work and the bulk of our future work involves improving support for user-land. DTrace tooks its first steps into the JVM just a few months ago, and there's a huge amount of work left to be done for truly elegant integration. User-level statically defined tracing (USDT) has proved to be a great tool, and we're working to make it easier to support, maintain and use.There's still some work to be done in terms of coverage for kernel tracing -- a more general networking provider would be great -- but in many ways, tracing the kernel was the easy part. In kernel-mode we have nearly complete control of the hardware and the software; there's only one kernel to worry about and it can't go away (or if it does DTrace goes with it). In user-land processes can come and go as they please; they can be as ill-constructed, ill-compiled, or ill-behaved as they like; they can have complex interactions between each other; and depend on wildly unpredictable behaviors of the system. In some ways user-land is simpler in that you can't destroy the box, but as long as they obey a few rules, applications can be as crazy as they like. Taming user-land continues to be an ongoing challenge for DTrace.What has made DTrace so successful is not its ability to observe any one aspect of the system (be it I/O, Java, or C applications), but its ability to observe the whole system. Since the framework lives in the kernel, the kernel was the first obvious source of data, but without user-level tracing DTrace would be an incomplete solution to understanding the system as a whole.Technorati Tags:DTraceSolarisOpenSolaris

One of the primary motivations for DTrace was the absence of a framework that united observability into all aspects of the system. There were certainly tools for looking at the individual components...


DTrace Presentation at JavaOne

Thanks to everyone who attended Jarod's and my talk this afternoon at JavaOne and especially to those who had to go to the overflow room. The rest of the DTrace team and I were thrilled by the turnout -- we would never have thought that 900+ Java developers would be so interested in DTrace. We spent the next few hours hashing out some ways to get better support for Java in DTrace; we look forward to giving an update at next year's conference.As promised, here are the slides from the talk:To get started with the dvm provider on your Solaris 10 system, download the agent here. You may need to set your LD_LIBRARY_PATH to point to wherever you install it. Then invoke java with the -Xrundvmti:all option.I wasn't able to capture the command history as was requested, but Bryan wrote up a nice post which can be used for the first part of the talk, and here are some of the java-specific commands from today.# dtrace -n dvm`pgrep java`:::method-entry'{ @[copyinstr(arg0), copyinstr(arg1)] = count() }'# dtrace -n dvm`pgrep java`:::object-alloc'{ @[jstack(20, 8000)] = count() }'# dtrace -n dvm`pgrep java`:::object-alloc'/copyinstr(arg0) == "java/awt/Rectangle"/{}'Try mixing and matching. Check out the Solaris Dynamic Tracing Guide for the exhaustive reference guide. If you have questions or want to figure out some more scripts, feel free to post a question here or -- even better -- on the DTrace forum on opensolaris.org.Technorati Tags:DTraceJavaOne

Thanks to everyone who attended Jarod's and my talk this afternoon at JavaOne and especially to those who had to go to the overflow room. The rest of the DTrace team and I were thrilled by the turnout...


Open-Sourcing the JavaOne Keynote

This morning I gave a demo of DTrace with the Java agents during the keynote at JavaOne. In the past few hours I've had a lot of great feedback from Java developers -- we've found a bunch of big performance wins already, and I expect we'll find more this week (remember the DTrace challenge). For the demo, I ran /usr/java/demo/jfc/Java2D/Java2Demo.jar with the Java DTrace agents enabled and ran a couple of scripts on it.The first script just gathered a frequency count for each method invoked -- nothing too subtle:jmethod.d#!/usr/sbin/dtrace -sdvm$1:::method-entry{ @[copyinstr(arg0), copyinstr(arg1)] = count();}END{ printa("%-10@u %s.%s()\\n", @);}bash-3.00# dtrace -s jmethods.d `pgrep java`...574 sun/java2d/SunGraphics2D.getCompClip()608 sun/java2d/pipe/Region.dimAdd()648 java/lang/ref/Reference.get()671 java/awt/geom/AffineTransform.transform()685 java/awt/Component.getParent_NoClientCode()685 java/awt/Component.getParent()702 sun/misc/VM.addFinalRefCount()798 java/lang/ref/ReferenceQueue.remove()809 java/lang/ref/Reference.access$200()923 java/awt/geom/RectIterator.isDone()1228 sun/dc/pr/Rasterizer.nextTile()1657 sun/dc/pr/Rasterizer.getTileState()1692 sun/java2d/pipe/AlphaColorPipe.renderPathTile()1692 sun/java2d/pipe/AlphaColorPipe.needTile()1702 sun/java2d/SunGraphics2D.getSurfaceData()3457 java/lang/Math.min()\^CThe second demo was a little more exciting: this guy followed a thread of control all the way from Java code through the native library code, the system calls, and all the kernel function calls to the lowest levels of the system. Each different layer of the stack is annotated with color -- the first use of color in a DTrace script as far as I know.follow.d#!/usr/sbin/dtrace -s/\* \* This script was used for the DTrace demo during the JavaOne keynote. \* \* VT100 escape sequences are used to produce multi-colored output from \* dtrace(1M). Pink is Java code, red is library code, blue is system calls, \* and green is kernel function calls. \*/#pragma D option quietdvm$1:::method-entry/copyinstr(arg0) == "sun/java2d/pipe/AlphaColorPipe" && copyinstr(arg1) == "renderPathTile"/{ self->interested = 1; self->depth = 8;}dvm$1:::method-entry/self->interested/{ printf("\\033[01;35m%\*.\*s -> %s.%s\\033[0m\\n", self->depth, self->depth, "", copyinstr(arg0), copyinstr(arg1)); self->depth += 2;}dvm$1:::method-return/self->interested/{ self->depth -= 2; printf("\\033[01;35m%\*.\*s %s`%s\\033[0m\\n", self->depth, self->depth, "", probemod, probefunc); self->depth += 2;}pid$1:::return/self->interested && probemod != "libdvmti.so"/{ self->depth -= 2; printf("\\033[01;31m%\*.\*s %s\\033[0m\\n", self->depth, self->depth, "", probefunc); self->depth += 2;}syscall:::return/self->interested/{ self->depth -= 2; printf("\\033[01;34m%\*.\*s %s\\033[0m\\n", self->depth, self->depth, "", probefunc); self->depth += 2;}fbt:::return/self->interested/{ self->depth -= 2; printf("\\033[32m%\*.\*s

This morning I gave a demo of DTrace with the Java agents during the keynote at JavaOne. In the past few hours I've had a lot of great feedback from Java developers -- we've found a bunch of big perfor...


Comment out of context

The OpenSolaris launch has been pretty fun -- I've already had some discussions with customers over the source code. Of course, the first thing people seemed to do with the source code is look for references to "shit" and "fuck". This was titillating to be sure. Unsatisfied with the cheap laugh, ZDNet wanted to draw some conclusions from the profanity:The much-vaunted dynamic tracing (dtrace) feature of Sun's system may not be as safe to use as most people think."This bit me in the ass a couple of times, so lets toss this in as a cursory sanity check," wrote one careful developer in the dtrace section.I wrote that code in October of 2002. For those of you keeping score at home, that's almost a year before DTrace integrated into Solaris 10 and more than two years before Solaris 10 hit the streets. Here's the larger context of that comment: 923 /\* 924 \* This bit me in the ass a couple of times, so lets toss this 925 \* in as a cursory sanity check. 926 \*/ 927 ASSERT(pc != rp->r_g7 + 4); 928 ASSERT(pc != rp->r_g7 + 8);This gets pretty deep into the bowels of the pid provider, but the code preceeding these ASSERTs does the work of modifying the registers appropriately to emulate the effects of the traced instruction. For most instructions, we relocate the instruction bits to some per-thread scratch space and execute it there. We keep this scratch space in the user-land per-thread structure which, on SPARC, is always pointed to by the %g7 register (rp->r_g7 in the code above). The tricky thing is that while we change the program counter (%pc) to point to the scratch space, we leave the next program counter (%npc) where it was.A bug I ran into very early in development was winding up executing the wrong code because I had incorrectly emulated an instruction. One way this manifested itself was that the program counter was set to %g7 + 4 or %g7 + 8. I added these ASSERTs after tracking down the problem -- not because it was a condition that I thought should be handled, but because I wanted everything to stop immediately if it did.In the nearly three years this code has existed, those ASSERTs have never been tripped. Of course, I didn't expect them to be -- they were a cursory sanity check so I could be sure this aberrant condition wasn't occuring. Of course, if I had omitted the curse this might not have inspired such a puerile thrill.Technorati Tags:DTraceOpenSolaris

The OpenSolaris launch has been pretty fun -- I've already had some discussions with customers over the source code. Of course, the first thing people seemed to do with the source code is look for...


Debugging cross calls on OpenSolaris

I think the thing I love most about debugging software is that each toughbug can seem like an insurmountable challenge -- until you figure it out.But until you do, each tough bugs is the hardest problem you'veever had to solve. There are weeks when every morning presents me witha seemingly impossible challenge, and each afternoon I get to spike mykeyboard and do a little victory dance before running for thetrain.For my first OpenSolaris blog post, Ithought I talk about one of my favorite such bugs.This particularlynasty bughad to do with a tricky interaction betweenVMware andDTrace (pre-productionversions of each to be clear). My buddyKeith --a fellow BrownCS Grad -- gave me a call and told meabout some strange behavior he was seeting running DTrace inside of a VMwareVM. Keith is a big fanof DTrace, but an intermittant, but reproducable problem was putting adamper on his DTrace enthusiasm. Every once in a while, running DTrace wouldcause the system to freeze. Because Solaris was running in the virtualenvironment, he could see that both virtual CPUs where spinning away, butthey weren't making any forward progress. After a couple of back and forthsover email, I made the trip down to Palo Alto so we could work on the problemtogether.Using some custom debugging tools, we were able to figure out where the twovirtual CPUs were spinning. One CPU was in xc_common() and the otherwas in xc_serv() -- code to handle cross calls. So what was going on?cross callsBefore I can really delve into the problem, I want to give just a briefoverview of cross calls. In general terms, a cross call (xcall) is used in amulti-processor (MP) system when one CPU needs to get another CPU to do somework. It works by sending a special type of interrupt which the remote CPUhandles. You may have heard the term interprocessor interrupt (IPI) -- samething. One example of when xcalls are used is when unmapping a memory region.To unmap a region, a process will typically call the munmap(2)system call. Remember that in an MP system, any processor may have runthreads in this process so those mappings may be present in that any CPU'sTLB. The unmap operation executes on one CPU, but the other CPUs need toremove the relevant mappings from their own TLBs. To accomplish thiscommunication, the kernel uses a xcall.DTrace uses xcalls synchronize data used by all CPUs by ensuring that allCPUs have reached a certain point in the code. DTrace executes actions withinterrupts disabled (an explanation of why this must be so is well beyond thebounds of this discussion) so we can tell that a CPU isn't in probe contextif its able to handle our xcall. When DTrace is stopping tracing activity,for example, it will update some data that affects all CPUs and then use axcall to make sure that every CPU has seen its effects before proceeding:dtrace_state_stop()10739 /\*10740 \* We'll set the activity to DTRACE_ACTIVITY_DRAINING, and issue a sync10741 \* to be sure that every CPU has seen it. See below for the details10742 \* on why this is done.10743 \*/10744 state->dts_activity = DTRACE_ACTIVITY_DRAINING;10745 dtrace_sync();dtrace_sync() sends a xcall to all other CPUs and has them spin in aholding pattern until all CPUs have reached that point at which time the CPUwhich sent the xcall releases them all (and they go back to whatever they hadbeen doing when they received the interrupt). That's the high level overview;let's go into a little more detail on how xcalls work (well, actually a lotmore detail).xcall implementationIf you follow the sequence of functions called bydtrace_sync()(andI encourage you to do so), you'll find that this eventually callsxc_common() to do the heavy lifting. It's important to note thatthis call toxc_common()will have the sync argument set to1. What's that mean? In a text book example of good softwareengineering, someone did agood job documenting what this value means:xc_common() 411 /\* 412 \* Common code to call a specified function on a set of processors. 413 \* sync specifies what kind of waiting is done. 414 \*-1 - no waiting, don't release remotes 415 \*0 - no waiting, release remotes immediately 416 \*1 - run service locally w/o waiting for remotes. 417 \*2 - wait for remotes before running locally 418 \*/ 419 static void 420 xc_common( 421 xc_func_t func, 422 xc_arg_t arg1, 423 xc_arg_t arg2, 424 xc_arg_t arg3, 425 int pri, 426 cpuset_t set, 427 int sync)Before you start beating your brain out trying to figure out what you'remissing here, in this particular case, this destinction bewteen sync having thevalue of 1 and 2 is nil: the service (function pointer specified by thefunc argument) that we're running isdtrace_sync_func()which does literally nothing.Let's start picking apart xc_common():xc_common() 446 /\* 447 \* Request service on all remote processors. 448 \*/ 449 for (cix = 0; cix < NCPU; cix++) { 450 if ((cpup = cpu[cix]) == NULL || 451 (cpup->cpu_flags & CPU_READY) == 0) { 452 /\* 453 \* In case CPU wasn't ready, but becomes ready later, 454 \* take the CPU out of the set now. 455 \*/ 456 CPUSET_DEL(set, cix); 457 } else if (cix != lcx && CPU_IN_SET(set, cix)) { 458 CPU_STATS_ADDQ(CPU, sys, xcalls, 1); 459 cpup->cpu_m.xc_ack[pri] = 0; 460 cpup->cpu_m.xc_wait[pri] = sync; 461 if (sync > 0) 462 cpup->cpu_m.xc_state[pri] = XC_SYNC_OP; 463 else 464 cpup->cpu_m.xc_state[pri] = XC_CALL_OP; 465 cpup->cpu_m.xc_pend[pri] = 1; 466 send_dirint(cix, xc_xlat_xcptoipl[pri]); 467 } 468 }We take a first pass through all the processors; if the processor is ready togo and is in the set of processors we care about (they all are in the case ofdtrace_sync()) we set the remote CPU's ack flag to 0, it'swait flag to sync (remember, 1 in this case), and it'sstate flag to XC_SYNC_OP and then actually send the interruptwith the call to send_dirint().Next we wait for the remote CPUs to acknowledge that they've executed therequested service which they do by setting the ack flag to 1:xc_common() 479 /\* 480 \* Wait here until all remote calls complete. 481 \*/ 482 for (cix = 0; cix < NCPU; cix++) { 483 if (lcx != cix && CPU_IN_SET(set, cix)) { 484 cpup = cpu[cix]; 485 while (cpup->cpu_m.xc_ack[pri] == 0) { 486 ht_pause(); 487 return_instr(); 488 } 489 cpup->cpu_m.xc_ack[pri] = 0; 490 } 491 }That while loop spins waiting for ack to become 1. If youlook at the definition ofreturn_instr()it's name is actually more descriptive that you might imagine: it's just areturn instruction -- the most trivial function possible. I'm not absolutelycertain, but I think it was put there so the compiler wouldn't "optimize" theloop away. The call to the inline functionht_pause()is so that the thread spins in such a way that's considerate on anhyper-threadedCPU. The call to ht_pause() is probably sufficient to prevent thecompiler from being overly clever, but the legacy call toreturn_instr() remains.Now let's look at the other side of this conversation: what happens on aremote CPU as a result of this interrupt? This code is in xc_serv()xc_serv() 138 /\* 139 \* Acknowledge that we have completed the x-call operation. 140 \*/ 141 cpup->cpu_m.xc_ack[pri] = 1; 142 I'm sure it comes as no surprise that after executing the given function, itjust sets the ack flag.Since in this case we're dealing with a synchronous xcall, the remote CPUthen needs to just chill out until the CPU that initiated the xcall discoversthat all remote CPUs have executed the function and are ready to be released:xc_serv() 146 /\* 147 \* for (op == XC_SYNC_OP) 148 \* Wait for the initiator of the x-call to indicate 149 \* that all CPUs involved can proceed. 150 \*/ 151 while (cpup->cpu_m.xc_wait[pri]) { 152 ht_pause(); 153 return_instr(); 154 } 155 156 while (cpup->cpu_m.xc_state[pri] != XC_DONE) { 157 ht_pause(); 158 return_instr(); 159 }And here's the code on the initiating side that releases all the remote CPUsby setting the wait and state flags to the values that theremote CPUs are waiting to see:xc_common() 502 /\* 503 \* Release any waiting CPUs 504 \*/ 505 for (cix = 0; cix < NCPU; cix++) { 506 if (lcx != cix && CPU_IN_SET(set, cix)) { 507 cpup = cpu[cix]; 508 if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) { 509 cpup->cpu_m.xc_wait[pri] = 0; 510 cpup->cpu_m.xc_state[pri] = XC_DONE; 511 } 512 } 513 }there's a problemWait! Without reading ahead in the code, does anyone see the problem?Back at VMware, Keith hacked up a version of the virtual machine monitorwhich allowed us to trace certain points in the code and figure out theprecise sequence in which they occurred. We traced the entry and return toxc_common() and xc_serv(). Almost every time we'd seesomething like this: enter xc_common() on CPU 0 enter xc_serv() on CPU 1 exit xc_serv() on CPU 1 exit xc_common() on CPU 0or: enter xc_common() on CPU 0 enter xc_serv() on CPU 1 exit xc_common() on CPU 0 exit xc_serv() on CPU 1But the problem happened when we saw a sequence like this: enter xc_common() on CPU 0 enter xc_serv() on CPU 1 exit xc_common() on CPU 0 enter xc_common() on CPU 0And nothing futher. What was happening was that after releasing remote CPUs,CPU 0 was exiting from the call to xc_common() and calling it againbefore the remote invocation of xc_serv() on the other CPU had achange to exit.Recall that one of the the first things that xc_common() does is setthe state flag. If the first call to xc_common() sets thestate flag to release the remote CPU from xc_sync(), but whenthings go wrong, xc_common() was overwriting that flag before theremote CPU got a change to see it.the problemWe were seeing this repeatably under VMware, but no one had seen this at allon real hardware (yet). The machine Keith and I were using was a 2-way boxrunning Linux. On VMware, each virtual CPU is represented by a thread on thenative OS so rather than having absolute control of the CPU, the executionwas more or less at the whim of the Linux scheduler.When this code is running unadulterated on physical CPUs, we won't see thisproblem. It's just a matter of timing -- the remote CPU has many many fewerinstructions to execute before the state flag gets overwritten by asecond xcall so there's no problem. On VMware, the Linux scheduler mightdecide that's your second virtual CPU's right to the physical CPU is trumpedby moving the hands on your xclock (why not?) so there are no garuanteesabout how long these operations can take.the fixThere are actually quite a few ways to fix this problem -- I'm sure you canthink of at least one or two off the top of your head. We just need to makesure that subsequent xcalls can't interfere with each other. When we foundthis, Solaris 10 was wrapping up -- we were still making changes, but onlythose deemed of the absolute highest importance. Making changes to the xcallcode (which is rather delicate and risky to change) for a bug that onlymanifests itself on virtual hardware (and which VMware could work aroundusing some clever trickery[1]) didn't seem worthy of being designated a show-stopper.Keith predicted a few possible situations where this same bug couldmanifest itself on physical CPUs: on hyper-threaded CPUs, or in the presenceof service management interrupts. And that prediction turned out to be spoton: a few weeks after root causing the bug under VMware, we hit the sameproblem on a system with four hyper-threaded chips (8 logical CPUs).Since at that time we were even closer to shipping Solaris 10, I chose thefix I thought was the safest and least likely to have nasty side effects.After releasing remote CPUs, the code in xc_common() would now waitfor remote CPUs to check in -- wait for them to acknowledge receipt of thedirective to proceed.xc_common() 515 /\* 516 \* Wait for all CPUs to acknowledge completion before we continue. 517 \* Without this check it's possible (on a VM or hyper-threaded CPUs 518 \* or in the presence of Service Management Interrupts which can all 519 \* cause delays) for the remote processor to still be waiting by 520 \* the time xc_common() is next invoked with the sync flag set 521 \* resulting in a deadlock. 522 \*/ 523 for (cix = 0; cix < NCPU; cix++) { 524 if (lcx != cix && CPU_IN_SET(set, cix)) { 525 cpup = cpu[cix]; 526 if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) { 527 while (cpup->cpu_m.xc_ack[pri] == 0) { 528 ht_pause(); 529 return_instr(); 530 } 531 cpup->cpu_m.xc_ack[pri] = 0; 532 } 533 } 534 }In that comment, I tried to summarize in 6 lines what has just taken meseveral pages to describe. And maybe I should have said "livelock" -- ohwell. Here's the complementary code in xc_serv():xc_serv() 170 /\* 171 \* Acknowledge that we have received the directive to continue. 172 \*/ 173 ASSERT(cpup->cpu_m.xc_ack[pri] == 0); 174 cpup->cpu_m.xc_ack[pri] = 1;conclusionsThat was one of my favorite bugs to work on, and it's actually fairly typicalof a lot of the bugs I investigate: something's going wrong; figure out why.I think the folks who work on Solaris tend to love that kind of stuff as arule. We spend tons of time building facilities like DTrace, mdb(1), kmdb,CTF, fancy core files, and libdis so that the hart part of investigatingmysterious problems isn't gathering data or testing hypotheses, it'sthinking of the questions to answer and inventing new hypotheses.It's my hope that OpenSolaris will attract those types of inquisitive mindsthat thrive on the (seemingly) unsolvable problem.[1] This sort of problem is hardly unique to DTrace or to Solaris. Apparently(and not surprisingly) there are problems like this in nearly every operatingsystem where the code implicitly or explicitly relies on the relative timingof certain operations. In these cases, VMware has hacks to do things likeexecute the shifty code in lock step. Technorati Tag: OpenSolaris Technorati Tag: Solaris Technorati Tag: DTrace Technorati Tag: mdb

I think the thing I love most about debugging software is that each tough bug can seem like an insurmountable challenge -- until you figure it out.But until you do, each tough bugs is the hardest...


Real Java debugging w/ DTrace

When I was in college one of the rights of passage in the computer science department was the software engineering class which involved a large group project. Fresh from completing that class, my brother turned up the other day in San Francisco (where I live); naturally I wanted to try out the game he and his friends had written. Hogs is a 3-D tank game written in Java -- when it failed to run on my Solaris 10 laptop I decided to use the new DTrace agents for the JVM that I blogged about recently.After downloading the game and the requisite libraries (jogl, OGL, etc.) I tried running it and got this:java.net.UnknownHostException: epizooty: epizooty at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at hogs.net.client.RemoteEngine.(RemoteEngine.java:79) at hogs.net.client.NetEngine.forHost(NetEngine.java:93) at hogs.common.Controller.(Controller.java:226) at hogs.common.Controller.main(Controller.java:118)Without understanding much about Java or anything about how my brother's game worked, I guessed that this code was trying to figure out the hostname of my laptop. The strange thing was that it seemed to find the name -- epizooty -- bu then get confused and throw some exception. The stack backtrace didn't give me much to go on so I decided to put this new Java DTrace agent to the test.Using the dvm provider was, initially, a bit of a pain (through no fault of its own). The dvm provider is very easy to use for long running Java programs: you just fire up the JVM and enable the probes at some later time. Because of the failure during start up, the game wasn't sticking around long enough for me to enable the probes. And while dtrace(1M) has a -c option that lets you specify a command to examine with DTrace the dvm probes don't show up until a little later when the JVM has initialized. It's worth mentioning that on the next version of Solaris (available via Solaris Express we've added a feature that lets you specify probes that don't yet exist that will be enabled when they show up; that feature will be in an early Solaris 10 update. Since this was a stock Solaris 10 system though, I had to get creative.Using some knowledge of how DTrace user-level statically defined tracing (USDT) providers load, I wrote stop.d that waits until the dvm provider loads and stops the process. After the process is stopped, another invocation of DTrace can then use the dvm provider.#!/usr/sbin/dtrace -s#pragma D option destructivesyscall::close:entry/pid == $target && basename(curthread->t_procp->p_user.u_finfo.fi_list[arg0].uf_file->f_vnode->v_path) == "dtrace@0:helper"/{ self->interested = 1;}syscall::close:entry/self->interested/{ cnt++;}syscall::close:entry/self->interested && cnt == 2/{ stop(); printf("stopped %d\\n", pid); exit(0);}syscall::close:return/self->interested/{ self->interested = 0;}DTrace USDT provider and helpers open a special helper psuedo device to register with the DTrace framework. When they're done, they use the close(2) system call to close the file descriptor to the device. What this script does is look for calls to close(2) where the file descriptor corresponds to that pseudo device. It's worth mentioning here that in the next version of Solaris there's a fds[] array that gives you the file name and other information for an open file descriptor so this will be a little cleaner in the future. The script looks for the second such close(2) because the JVM itself has a DTrace helper which enables the magic of the jstack() action. To be clear: I'm not particularly proud of this script, but it got the job done.Once I had the game stopped at the right spot, I run amid the noise this snippet looked interesting: 0 34481 _method_entry:method-entry -> java/net/InetAddress$1.lookupAllHostAddr() 0 34481 _method_entry:method-entry -> java/net/UnknownHostException.()So this localAllHostAddr() method was throwing the exception that was causing me so much heartache. I wanted to understand the actual interaction between this method and lower level address resolution. It turned out that the native library calls were in a shared object that the JVM was lazily loading so I needed to stop the process after the native library had been loaded but before the method had completed. I wrote the following as a sort of conditional breakpoint:#!/usr/sbin/dtrace -s#pragma D option destructivedvm$target:::method-entry/copyinstr(arg1) == "getLocalHost"/{ self->state = 1;}dvm$target:::method-entry/copyinstr(arg1) == "lookupAllHostAddr" && self->state == 1/{ self->state = 2; stop(); exit(0);}dvm$target:::method-return/copyinstr(arg1) == "lookupAllHostAddr" && self->state == 2/{ self->state = 1;}dvm$target:::method-return/copyinstr(arg1) == "getLocalHost" && self->state == 1/{ self->state = 0;}Sifting through some more data, I figured out the name of the native function that was being used to implement lookupAllHostAddr() and wrote this script to follow the program flow from there:#!/usr/sbin/dtrace -s#pragma D option flowindentpid$target::Java_java_net_Inet4AddressImpl_lookupAllHostAddr:entry{ self->interested = 1;}pid$target:::entry/self->interested/{}pid$target:::return/self->interested/{ printf("+%x %x (%d)", arg0, arg1, errno);}pid$target::gethostbyname_r:entry/self->interested/{ printf("hostname = %s", copyinstr(arg0));}pid$target::Java_java_net_Inet4AddressImpl_lookupAllHostAddr:return/self->interested/{ self->interested = 0;}In the output I found a smoking gun: gethostbyname_r(3NSL) was returning NULL. A little more investigation confirmed that the argument to gethostbyname_r(3NSL) was "epizooty"; a little test program showed the same problem. Now well away from Java and in more familar waters, I quickly realized that adding an entry into /etc/hosts was all I needed to do to clear up the problem.This was a great experience: not only was I able to use this dvm stuff to great effect (for which my excitement had been largely theoretical), but I got to prove to my brother how amazingly cool this DTrace thing really is. As I haven't done any serious Java debugging for quite a while I'd like to pose this question to anyone who's managed to stay with me so far: How would anyone debug this without DTrace? Are there other tools that let you observe Java and the native calls and the library routines? And, though I didn't need it here, are there tools that let you correlate Java calls to low level kernel facilities? I welcome your feedback.Technorati tag: DTrace

When I was in college one of the rights of passage in the computer science department was the software engineering class which involved a large group project. Fresh from completing that class, my...


DTracing Java

DTrace has cast light on parts of the system that were previously only dimly illuminated by previous tools, but there have been some parts of the system frustratingly left in the dark. The prevalent example is Java. Java has been relatively unobservable with DTrace; the jstack() action has offered a narrow beam of light into interactions between Java code and the rest of the system, but we really need is Java probes in the DTrace framework.DTrace users really want to be able to trace Java methods in the same way they can trace C function calls in the application (or in the kernel). We haven't quite reached that Xanadu yet, but Kelly O'Hair (with the inspiration and prodding of Jarod Jenson) has created JVMPI and JVMTI agents export the instrumentation provided by those frameworks into DTrace.For example, examining the size of Java allocations is now a snap. The object-alloc probe fires every time an object gets allocated, and one of the arguments is the size.# dtrace -n 'djvm$target:::object-alloc{ @ = quantize(arg1) }' -p `pgrep -n java`dtrace: description 'djvm$target:::object-alloc' matched 1 probe\^C value ------------- Distribution ------------- count 4 | 0 8 | 43 16 |@@@@@@@@@@@@@@@@@ 18771 32 |@@@@@@@@@@@@@@@@ 17482 64 |@@@@@ 5292 128 |@ 1486 256 | 106 512 | 165 1024 | 319 2048 | 149 4096 | 48 8192 | 0 16384 | 1 32768 | 1 65536 | 0 One of the most troublesome areas when dealing with production Java code seems to be around garbage collection. There are two probes -- that fire at the start and end of a GC run -- that can be used, for example, to look for latency spikes in garbage collection:bash-3.00# dtrace -s /dev/stdin -p `pgrep -n java`djvm$target:::gc-start{ self->ts = vtimestamp;}djvm$target:::gc-finish/self->ts/{ @ = quantize(vtimestamp - self->ts); self->ts = 0;}dtrace: script '/dev/stdin' matched 2 probes\^C value ------------- Distribution ------------- count 16777216 | 0 33554432 |@@ 1 67108864 |@@@@@@ 3 134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16 268435456 | 0 Let's say there's some itermittent problem where GC takes a long time. This sort of script can help you identify the standard behavior and the outliers, and other DTrace facility -- notably speculative tracing -- will let you drill down on the source of the problem ("I care about these events only when the GC run takes more than 10ms?).One of the most exciting moments early on with DTrace was observing the flow of control from a user-land function through a system call and into the kernel -- as far as I know we were seeing something that hadn't been done before (Bryan's last example in this blog post demonstrates this). Here's a script that instruments a particular Java method (java.io.InputStreamReader.read()) and follows the thread of control from that method call through Java, libc, the system call table, and the kernel -- and back out:#pragma D option quietdjvm$target:::method-entry/copyinstr(arg0) == "java/io/InputStreamReader" && copyinstr(arg1) == "read"/{ self->interested = 1; self->indent = 0;}djvm$target:::method-entry/self->interested/{ self->indent += 2; printf("%\*s -> %s:%s\\n", self->indent, "", copyinstr(arg0), copyinstr(arg1));}djvm$target:::method-return/self->interested/{ printf("%\*s %s\\n", self->indent, "", probefunc);}syscall:::return/self->interested/{ printf("%\*s %s:%s\\n", self->indent, "", probemod, probefunc);}pid$target:libc.so.1::return/self->interested/{ printf("%\*s %s:%s\\n", self->indent, "", probemod, probefunc);}fbt:::return/self->interested/{ printf("%\*s

DTrace has cast light on parts of the system that were previously only dimly illuminated by previous tools, but there have been some parts of the system frustratingly left in the dark. The prevalent...


the pid provider and > 10 arguments

A long-time DTrace user was recently examining an ugly C++ application, and this obvious DTrace invocation to trace the 15th argument (zero-indexed) to a particularly ugly function:# dtrace -n pid123::foobar:entry'{ trace(arg15); }'dtrace: invalid probe specifier pid380863:::entry{ trace(arg15); }: in action list: failed to resolve arg15: Unknown variable nameAs described in the Solaris Dynamic Tracing Guide we actually only provide access to arguments 0-9. I suppose you could call this a design oversight, but really it reflects our bias about software -- no one's going to want to call your function if it has a bazillion arguments.But -- as with most things pertaining to C++ -- sometimes you just have to hold your nose and get it working. If you need to trace arguments past arg9 in functions you're observing with the pid provider, here's how you can do it:x86 this->argN = \*(uint32_t \*)copyin(uregs[R_SP] + sizeof (uint32_t) \* (this->N + 1), sizeof (uint32_t));x86-64/AMD64 this->argN = \*(uint64_t \*)copyin(uregs[R_SP] + sizeof (uint64_t) \* (this->N - 5), sizeof (uint64_t));SPARC 32-bit this->argN = \*(uint32_t \*)copyin(uregs[R_SP] + sizeof (uint32_t) \* (this->N + 17), sizeof (uint32_t));SPARC 64-bit this->argN = \*(uint64_t \*)copyin(uregs[R_SP] + 0x7ff + sizeof (uint64_t) \* (this->N + 16), sizeof (uint64_t));Note that for SPARC (32-bit and 64-bit) as well as AMD64, these formulas only work for arguments past the sixth -- but then you should probably be using arg0 .. arg9 when you can.UPDATEThe methods above only apply for integer arguments; while I think it will work for 32-bit x86, the other architectures can pass floating-point arguments in registers as well as on the stack. Perhaps a future entry will discuss floating-point arguments if anyone cares.There are a couple of gotchas I neglected to mention. On AMD64 if the argument is less that 64-bits (e.g. an int), the compiler can leave garbage in the upper bits meaning that you have to cast the variable to the appropriate type in DTrace (e.g. trace((int)this->argN)). On both 32-bit architectures, 64-bit arguments are passed in 2 of these 32-bit arguments; to get the full 64-bit value, just shift and or the two arguments together (e.g. ((this->arg13 arg14) or ((this->arg14 arg13) for SPARC and x86 respectively). Even for arguments that you can get with the built-in variables, you will need to mash together 64-bit arguments on 32-bit architectures (except on the SPARCv8+ ABI which can pass the first 6 arguments in 64-bit registers).Technorati tag: DTrace

A long-time DTrace user was recently examining an ugly C++ application, and this obvious DTrace invocation to trace the 15th argument (zero-indexed) to a particularly ugly function: # dtrace -n...


DTrace is open

It's a pretty exciting day for the DTrace team as our code is the first part of Solaris to be released under the CDDL. I thought I'd take the opportunity to end my blogging hiatus and to draw attention to some of my favorite corners of the code. Bryan has written an exhaustive overview of the code structure as well as some of his favorite parts of the code.fasttrapThe biggest component of DTrace that I was wholly responsible for was the user-level tracing component. The pid provider (implemented as the 'fasttrap' kernel module for largely historical reasons) lets DTrace consumers trace function entry and return (as the fbt probider does for the kernel) as well as any individual instruction. It does this all losslessly, without stopping other threads and -- in fact -- without inducing any lock contention or serialization. (Check out the comments at the top of fasttrap.c, and the two fasttrap_isa.c's for extensive discussion.)Here's the general technique employed by the pid provider: each traced instruciton is first replaced with a trapping instruction. On sparc we use a ta (trap always) and on x86 (by which I mean i386 and amd64) we use an int3 (0xcc) (see the fasttrap_tracepoint_install() function in usr/src/uts/sparc/dtrace/fasttrap_isa.c and usr/src/uts/intel/dtrace/fasttrap_isa.c). Now any time a user-level thread executes this instruction it will bounce into the fasttrap module (on x86 this requires a little trickery because the int3 instruction is also used by debuggers to set breakpoints) and into the fasttrap_pid_probe() function (in both instances of fasttrap_isa.c). In fasttrap_pid_probe(), we lookup the original instruction in fasttrap_tpoints -- a global hash table of tracepoints -- and call dtrace_probe() to invoke the DTrace framework. Here's what it looks like on i386 (fasttrap_isa.c): 821 uintptr_t s0, s1, s2, s3, s4, s5; 822 uint32_t \*stack = (uint32_t \*)rp->r_sp; 823 824 /\* 825 \* In 32-bit mode, all arguments are passed on the 826 \* stack. If this is a function entry probe, we need 827 \* to skip the first entry on the stack as it 828 \* represents the return address rather than a 829 \* parameter to the function. 830 \*/ 831 s0 = fasttrap_fuword32_noerr(&stack[0]); 832 s1 = fasttrap_fuword32_noerr(&stack[1]); 833 s2 = fasttrap_fuword32_noerr(&stack[2]); 834 s3 = fasttrap_fuword32_noerr(&stack[3]); 835 s4 = fasttrap_fuword32_noerr(&stack[4]); 836 s5 = fasttrap_fuword32_noerr(&stack[5]); 837 838 for (id = tp->ftt_ids; id != NULL; id = id->fti_next) { 839 fasttrap_probe_t \*probe = id->fti_probe; 840 841 if (probe->ftp_type == DTFTP_ENTRY) { 842 /\* 843 \* We note that this was an entry 844 \* probe to help ustack() find the 845 \* first caller. 846 \*/ 847 cookie = dtrace_interrupt_disable(); 848 DTRACE_CPUFLAG_SET(CPU_DTRACE_ENTRY); 849 dtrace_probe(probe->ftp_id, s1, s2, 850 s3, s4, s5); 851 DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_ENTRY); 852 dtrace_interrupt_enable(cookie); 853 } else if (probe->ftp_argmap == NULL) { 854 dtrace_probe(probe->ftp_id, s0, s1, 855 s2, s3, s4); 856 } else { 857 uint32_t t[5]; 858 859 fasttrap_usdt_args32(probe, rp, 860 sizeof (t) / sizeof (t[0]), t); 861 862 dtrace_probe(probe->ftp_id, t[0], t[1], 863 t[2], t[3], t[4]); 864 } 865 }Now that we've properly invoked the DTrace framework, we have to make sure the program does the right thing -- rather than executing the instruction it needed, we forced it to execute a trap instruction; obviously we can't just return control to the user-level thread without doing something. Of course, all we have to do at this point is emulate the original instruction -- but if you feel like writing a complete x86 emulator and running it in the kernel, be my guest. Instead, I did something much lazier. There are some instructions we do emulate -- the ones that are position-dependent like relative jumps, calls, etc. (see fasttrap_tracepoint_init() on both sparc and intel) -- but there are really only a few of these. For the rest we use a technique called displaced execution.As the name suggests, rather than executing the original instruction at its original address, we relocate it into some reserved scratch space in the user-level thread structure (obviously a per-thread entity). We then arrange to continue execution with what would have normally been the subsequent instruction. On x86 we just have a jmp to the next instruction, on sparc we make clever use of the delay slot and the %npc. I've love to stick the code here, but there's really a lot of it; I suggest you open your favorite fasttrap_isa.c file and search for 'case FASTTRAP_T_COMMON' which is where I handle generic instructions using displaced execution (the other cases deal with the instructions that need to be emulated).Just a quick time out to compare this to other tracing techniques. Every other technique that I'm aware of either has the potential for losing data or can serialize the process or induce lock contention. A technique that loses data is to just replace the trap with the original instruction, single-step and reinstall the trap; if another thread comes along in the meantime, it didn't see the trap. Truss is a good example of tracer that induces serialization; to avoid the lossiness problem, it stops all other threads in the process when it single-steps. By serializing the processes execution, you can't gather meaningful data about data races or lock contention (see usr/src/cmd/plockstat/), but, of course, with a lossy tracer, you can't really gather any meaningful data at all.amd64 trickinessWhile you're looking at the displaced execution code, I'd appreciate it if you'd spend some time looking at the code to deal with %rip-relative (program counter-relative) instructions on amd64. The basic premise of displaced execution is that the number of instructions that we need to execute is relatively small, but I got a big surprise when it turned out that pretty much any instruction on amd64 could potentially depend on its address for correct execution.With a little help from the in-kernel disassembler, we detect if the instruction is %rip-relative: 469 if (p->p_model == DATAMODEL_LP64 && tp->ftt_type == FASTTRAP_T_COMMON) { 470 /\* 471 \* If the process is 64-bit and the instruction type is still 472 \* FASTTRAP_T_COMMON -- meaning we're going to copy it out an 473 \* execute it -- we need to watch for %rip-relative 474 \* addressing mode. See the portion of fasttrap_pid_probe() 475 \* below where we handle tracepoints with type 476 \* FASTTRAP_T_COMMON for how we emulate instructions that 477 \* employ %rip-relative addressing. 478 \*/ 479 if (rmindex != -1) { 480 uint_t mod = FASTTRAP_MODRM_MOD(instr[rmindex]); 481 uint_t reg = FASTTRAP_MODRM_REG(instr[rmindex]); 482 uint_t rm = FASTTRAP_MODRM_RM(instr[rmindex]); 483 484 ASSERT(rmindex > start); 485 486 if (mod == 0 && rm == 5) { 487 /\* 488 \* We need to be sure to avoid other 489 \* registers used by this instruction. While 490 \* the reg field may determine the op code 491 \* rather than denoting a register, assuming 492 \* that it denotes a register is always safe. 493 \* We leave the REX field intact and use 494 \* whatever value's there for simplicity. 495 \*/ 496 if (reg != 0) { 497 tp->ftt_ripmode = FASTTRAP_RIP_1 | 498 (FASTTRAP_RIP_X \* 499 FASTTRAP_REX_B(rex)); 500 rm = 0; 501 } else { 502 tp->ftt_ripmode = FASTTRAP_RIP_2 | 503 (FASTTRAP_RIP_X \* 504 FASTTRAP_REX_B(rex)); 505 rm = 1; 506 } 507 508 tp->ftt_modrm = tp->ftt_instr[rmindex]; 509 tp->ftt_instr[rmindex] = 510 FASTTRAP_MODRM(2, reg, rm); 511 } 512 } 513 }Note that we've changed the instruction at line 509 to depend on %rax (or %r8) rather than %rip. When we hit that tracepoint, we move what would have normally been the %rip value into %rax (or %r8), and make sure to reset the value of %rax (or %r8) when we're done. Actually, as you might have noticed from the code above, it's a little more complicated because we want to avoid using %rax if the instruction already uses that register, but I'm sure you can figure it out from the code :-).moreI was going to write more about the pid provider, user-level statically defined tracing (USDT), the plockstat provider and command, and some (I think) clever parts about user/kernel interactions, but this is already much longer than what I assume is the average attention span. More later.

It's a pretty exciting day for the DTrace team as our code is the first part of Solaris to be released under the CDDL. I thought I'd take the opportunity to end my blogging hiatus and to...


Solaris 10 in the news: week 1

It's been just over a week since we officially launched Solaris 10 and the reactions from the press have been all over the map. Consider these two declarations:Not many open source aficionados will realize the impact, but by making Solaris 10 free and capable of operating on any kind of hardware, Sun is making a coup in the server market....As a result Linux will probably not grow much beyond its current market share of about 10% leaving Red Hat and especially Novell with a big problem.linkSun's announcement of the launch of free open source Solaris 10 has garnered a mild positive response from the investment community. However it also has raised a lot of skepticism from the technical community. The current plans are not good enough.linkIn case you didn't notice the URLs: those are the same guy posted about 20ns apart. Why the sudden reversal? slashdot! Yes, the Linus-loyal quickly rose to Linux's defence citing the multitude of reasons Open Solaris will fail.It will be a while before anyone knows the relative success or failure of Open Solaris -- we're not even sure of the terms of the license -- so this is last challenge I'm going to rise to. But look: the Linux that exists today is a direct result literally and philosophically of its community. And while Solaris has so far been developed by a closed community, it also reflects the timbre of that community. For example: Linux adheres to a spartan philosophy of debugging help -- ergo no crash dumps; Solaris takes a different tack -- we require projects to come with debugger commands to examine crash dumps. It's the community that defines the project and, as is evident, it's the community that defends it, but don't discount the very real Solaris community.The dumbest slight is against Solaris's hardware compatabiIity: do you think Linux always ran on as much stuff as it does now? Of course not. A month ago, did Solaris run on the (sick) new laptop I'm typing this on? Nope. The hardware these two operating systems support doesn't define their communities, bur rather the reverse. When people can hack up Solaris to work on their gear, they will just as they've been doing on Linux for years. I can't wait for the license to be announced, for opensolaris.org to open for bidness, and to have people start contributing code from outside of Sun. Does anyone really doubt that these things will happen? The only question is if Open Solaris will take off -- wait and see.

It's been just over a week since we officially launched Solaris 10 and the reactions from the press have been all over the map. Consider these two declarations: Not many open source aficionados will...


Baby's first DTrace

At the Solaris 10 launch on Monday I was talking to a sysadmin about DTrace. He was clearly very excited about it -- finally he could end a fight between the database guys and the appserver guys about whose stuff was to blame -- but he had one reservation: Where do I start? Since DTrace lets you look at almost anything on the system, it can be hard knowing the first thing to look at, here's what I told him:start with the tools you knowYou've probably used truss(1) or mpstat(1M) or prstat(1) or iostat(1M) or whatever. They give you a static view of what's happening on the system -- static in that you can't get any more, you can't get any other degree of detail, and you can't dive deeper. So start from those points, and go deeper. Each statistic in those observability tools has at least one associated probe in DTrace. If you're looking at mpstat(1M) output, maybe cross-calls (xcal) are high, or spins on mutexes (smtx) are high. You don't have to guess anymore; you can actually drill down and figure out what application or what user or what zone they correspond to by enabling their corresponding DTrace probes (sysinfo:::xcalls and lockstat:::\*-spin respectively) and trace the data you want.figure out what functions are being calledWhen you're trying to optimize an application, it helps to know where the app is spending its time. A simple DTrace invocation like this:# dtrace -n 'pid$target:::entry{ @[probefunc] = count() }' -p <process-id>can give you a coarse idea of where you're spending time. When you do this, a lot of it will make sense, but some of it will probably be a surprise: "Why am I calling malloc(3C) a bazillion times?" So find those aberrant cases and figure out what's going on: "OK, how much are we allocating each time?" (dtrace -n 'pid$target::malloc:entry{ @ = quantize(arg0) }' -p <process-id>).look for lock contentionIn multi-threaded apps, lock contention can be huge performance killer. Run the new plockstat(1) command to see if your app suffers from lock contention. If it does, you'll see long spin and contention times. These are pretty easy problems to solve, but if you can't track down the source of the problem, plockstat -- of course -- lets you dig deeper by using the plockstat provider.Those are a few places I've started from in the past, but, of course, every application is different. DTrace isn't meant to supplant your knowledge about your app and the system at large, rather it should complement it and let you do more with what you already know.

At the Solaris 10 launch on Monday I was talking to a sysadmin about DTrace. He was clearly very excited about it -- finally he could end a fight between the database guys and the appserver guys about...


Solaris 10 Launch

I was at the Solaris 10 launch for most of Monday, and it was a pretty fantastic day for everyone working on Solaris 10. I spent about two hours helping to answer questions in an online chat -- here's the transcript -- about Solaris 10 in what was dubbed a webchat sweatshop. There were a bunch of us from the Solaris group as well as Scott and some other execs all huddled around laptops while the HR folks beat the drum at a slow, but steady pace: These answers have to be on the streets of Hong Kong by morning.I spent the rest of the afternoon talking to customers and the press -- mostly about DTrace -- and the exciting thing was that even more than before, they're getting it and they're excited about Solaris. One of the most indelible moments was when a group of us from the kernel group were talking to Jem Matzan of The Jem Report and he challenged the claim that Solaris 10 was the most innovative operating system ever. As we all painted the picture of Solaris 10, I realized that this was true. Solaris 10 isn't just a random collection of neat crap, but it comes together as a cohesive whole that innovates in each and every place an operating system can innovate. Maybe that's a stretch, but it's not far off..

I was at the Solaris 10 launch for most of Monday, and it was a pretty fantastic day for everyone working on Solaris 10. I spent about two hours helping to answer questions in an online chat -- here's...


DTrace time

The other day I spent some time with a customer tuning their app using DTrace. I hadn't done much work on apps outside of Sun, so I was excited to see what I could do with DTrace other people's code. I'm certainly not the first person to use DTrace on a real-world app (I'm probably the 10,000th), so I had some expectations to live up to.The app basically processes transactions: message comes in, the app does some work and sends out another message. There's a lot of computation for each message, but also a lot of bookkeeping and busy work receiving, processing and generating the messages.When sprintf(3C) attacks...One of the first things we looked at was what functions the app calls the most. Nothing fancy; just a DTrace one-liner:# dtrace -n pid123:::entry'{ @[probefunc] = count(); }'Most of it made sense, but one thing we noticed was that sprintf(3C) was getting called a ton, so the question was "what are those sprintfs doing?" Another one-liner had the answer:# dtrace -n pid123::sprintf:entry'{ @[copyinstr(arg0)] = count(); }'There were about four different format strings being used, two of them farily complex and then these two: "%ld" and "%f". In about 5 seconds, the app was calling sprintf("%ld", ) several thousand times. It turns out that each transactions contains an identifier -- represented by a decimal integer -- so this wasn't unexpected per se, but we speculated that using lltostr(3C) or hand rolling a int->string function might yield better results.I had just assumed that lltostr(3C) would perform better -- it's a specialized function that doesn't involve all the (extensive and messy) machinery of sprintf(3C). I wrote a little microbenchmark that just converted the number 1000000 to a string with both functions a million times and ran it on an x86 machine; the results were surprising:$ ./testsprintf(3C) took 272512920nslltostr(3C) took 523507925nsWhat? I checked my test, made sure I was compiling everything properly, had called the function once before I started timing (to avoid any first-call overhead from the dynamic linker), but the results were dead repeatable: lltostr(3C) was about half as fast. I looked at the implementation and while I can't post it here (yet -- I can't wait for OpenSolaris), suffice it to say that it did the obvious thing. The strange thing was that sprintf(3C) had basically the same alogorithm. Just for kicks, I decided to build it amd64 native and run it on the same opteron box; here were the results:sprintf(3C) took 140706282nslltostr(3C) took 38804963nsAh much better (I love opteron). It turns out the problem was that we were doing 64-bit math in 32-bit mode -- hence ell-ell-to-str -- and that is slooooooow. Luckily the app we were looking at was compiled 64-bit native so it wouldn't have had this problem, but there are still plenty of 32-bit apps out there that shouldn't have to pay the 64-bit math tax in this case. I made a new version of lltostr(3C) that checks the top 32-bits of the 64-bit input value and does 32-bit math if those bits are clear. Here's how that performed (on 32-bit x86):sprintf(3C) took 251953795nslltostr(3C) took 459720586nsnew lltostr took 32907444nsMuch better. For random numbers between 232 and 2\^63-1 the change hurt performance by about 1-2%, but it's probably worth the hit for the 1300% improvement with smaller numbers.In any case, that was just a long way of saying that for those of you using lltostr(3C) or sprintf(%ld), there are wins to be had.Timing transactionsOur first use of DTrace was really to discover things that the developers of that app didn't know about. This sprintf(3c) stuff was one issues, and there were a couple of others, but, on the whole, the app worked as intended. And that's actually no small feat -- many many programs are doing lots of work that wasn't really intended by the developers or spend their CPU time in places that are completely surprising to the folks who wrote it. The next thing the customers wanted to find was the source of certain latency bubbles. So we used DTrace to time each transaction as it went through multiple processes on the system. While each step was fairly simple, the sum total was by far the largest D script I had written, and in the end we were able to record the time a transaction spent over each leg of the trip through the system.This is something that could have been done without DTrace, but it would have involved modifying the app, and streaming the data out to some huge file to be post-processed later. Not only is that a huge pain in the ass, it can also have a huge performance impact and is inflexible in terms of the data gathered. With DTrace, the performance impact can be mitigated by instrumenting less, and you can gather arbitrary data so when you get the answer to your first question, you don't have to rebuild and rerun the app to dive deeper to your next question.I had been telling people for a while about the virtues on real-world applications, I was happy to see first hand that they were all true, and -- perhaps more importantly -- I convinced the customer.

The other day I spent some time with a customer tuning their app using DTrace. I hadn't done much work on apps outside of Sun, so I was excited to see what I could do with DTrace other people's...


breaking with tradition

In this weblog, I've tried to stick to the facts, talk about the things I know about DTrace, Solaris and the industry, and not stray into the excrutiating minutia of the rest of my life. But:The Red Sox Are Going To The World Series!!Post World Series Update:It has been an amazing and unexpected elation I've carried with me this past week since the Sox won their first world series in my and my father's lifetimes. My grandfather was born in 1919 -- we would have loved to have seen this. Some of my earliest memories are of watching the Red Sox on my dad's lap and the 1986 world series is the only time I can remember him shedding a tear. When the Sox were down 3-0 to the hated Yankees in the ALCS, I was crushed. I pledged not to watch games 4 and 5 because I was so emotionally invested in the game that watching them lose would be too painful and watching them win would just be a reminder of the impossibly high mountain they still had left to climb. But they I didn't live up to my pledge and Ortiz won games 4 and 5 in dramatic fashion (with some help from Dave Roberts). When Lowe pitched a brilliant game 7 to seal their historic come back, I was amazing and delighted (and that's to say nothing of the heroic efforts of Shilling and Pedro), but nothing compared to that moment with Foulke underhanded Renteria's gounder to Doug Mientkiewicz and the entirety of Red Sox nation began the celebration.

In this weblog, I've tried to stick to the facts, talk about the things I know about DTrace, Solaris and the industry, and not stray into the excrutiating minutia of the rest of my life. But: The Red...


more on gcore

Trawling through b.s.c I noticed Fintan Ryan talking about gcore(1), and I realized that I hadn't sufficently promoted this cool utility. As part of my work adding variable core file content, I rewote gcore from scratch (it used to be a real pile) to add a few new features and to make it use libproc (i.e. make it slightly less of a pile).You use gcore to take a core dump of a live running process without actually causing the process to crash. It's not completely uninvasive because gcore stops the process you're taking the core of to ensure a consistent snapshot, but unless the process is huge or it's really cranky about timing the perturbation isn't noticeable. There are a lot of places where taking a snapshot with gcore is plenty useful. Let's say a process is behaving strangely, but you can't attach a debugger because you don't want to take down the service, or you want to have a core file to send to someone who can debug it when you yourself can't -- gcore is perfect. I use to it to take cores of mozilla when it's chugging away on the processor, but not making any visible progress.I mentioned that big processes can take a while to gcore -- not surprising because we have to dump that whole image out to disk. One of the cool uses of variable core file content is the ability to take faster core dumps by only dumping the sections you care about. Let's say there's somebig ISM segment or a big shared memory segment: exclude it and gcore will go faster:hedge /home/ahl -> gcore -c default-ism 256755gcore: core.256755 dumpedPretty handy, but the coolest I've been making of gcore lately is by mixing it with DTrace and the new(ish) system() action. This script snapshots my process once every ten seconds and names the files according to the time they were produced:# cat gcore.d#pragma D option destructive#pragma D option quiettick-10s{ doit = 1;}syscall:::/doit && pid == $1/{ stop(); system("gcore -o core.%%t %d", pid); system("prun %d", pid); doit = 0;}# dtrace -s gcore.d 256755gcore: core.1097724567.256755 dumpedgcore: core.1097724577.256755 dumpedgcore: core.1097724600.256755 dumped\^CWARNING! When you specify destructive in DTrace, it means destructive. The system() and stop() actions can be absolutely brutal (I've rendered at least one machine unusable my indelicate use of that Ramirez-Ortiz-ian one-two combo. That said, if you screw something up, you can break into the debugger and set dtrace_destructive_disallow to 1.OK, so be careful, but that script can give you some pretty neat results. Maybe you have some application that seems to be taking a turn for the worse around 2 a.m. -- put together a DTrace script that detects the problem and use gcore to take a snapshot so you can figure out whatwas going on when to get to the office in the morning. Take a couple of snapshots to see how things are changing. You do like debugging from core dumps, right?

Trawling through b.s.c I noticed Fintan Ryan talking about gcore(1), and I realized that I hadn't sufficently promoted this cool utility. As part of my work adding variable core file content, I rewote...


back from the DTrace road show

A number of factors have conspired to keep me away from blogging, not the least of which being that I've been on a coast-to-coast DTrace road show. Now that I'm back, I've got some news to report from the road.Step right up!At times it felt a bit like a medicine road show: "Step right up to see the amazing DTraaaaace! The demystifying optimizing tantalizing reenergizing tracing frameworrrrrrrk!" I stopped in the Midwest, and D.C. (Eric helped during that leg as my fellow huckster) then I went back to San Francisco and then back accross the country to New York for the big Wall Street to do.I admit that it got a little repetitive -- at one point I ran into A/V problems and was able to run through the presentation from memory -- but the people who I talked to, and their questions kept it interesting.Can DTrace do X?I was impressed by the number of people who had not only heard of DTrace, but who had already started playing around with it. It used to be that the questions were all of the form "Can DTrace do X?" On this trip, more and more I was asked about specific things people were trying to accomplish with DTrace and told about problems they'd found using DTrace. I thought I'd repeat some of the best questions and insights from the trip:I'm sure this won't come as a huge shock to anyone who's tried to heft a printed copy of the DTrace docs, but some people were a little daunted by the size and complexity of DTrace. To address that, I'd call on an analogy to perl -- no one picks up the O'Reilly books on perl, reads them cover to cover and declares mastery of perl. The way most people learn perl is by calling on the knowledge they have from similar languages, finding examples that make sense and then modifying them. When they need to do something just beyond their grasp, they go back to the documentation and find the nugget they need. We've tried to follow a similar model with DTrace -- find an example that makes sense and work from there; call on your knowledge of perl, java, c, or whatever, and see how it can apply to D. We've tried to design DTrace so things pretty much just work as you (a programmer, sysadmin, whatever) would expect them to, and so that a little time invested with the documentation goes a long way. Seeing too much data? Predicates are easy. Sill not getting just the data you want? Spend fifteen minutes to wrap your head around speculations.And speaking of perl, a lot of people asked about DTrace's visibility into perl. Right now the only non-natively executed languate DTrace lets you observe is java, but now that we realize how much need there is for visibility into perl, we're going to be working aggressively on making DTrace work well with perl. We've got some neat ideas, but if there are things you'd like to see with DTrace and perl, we'd love to hear about it as we think about what sorts of problems we need to solve.A lot of people in industry use tools like Intel's VTune and IBM's Purify and Quantify, so I got a lot of questions about how DTrace compares to those tools. Which led to the inevitable question of "Where's the GUI?" First, DTrace by definition can do more than those tools even discounting its systemic scope simply by the ability users have to customize their tracing with D. VTune, Purify, Quantify and other tools present a fairly static view, and I'm sure that users of those tools have always had just one more question, one next step that those tools weren't solve. Because DTrace doesn't present a canned, static view, it's not so clear on what kind of GUI you'd want. Clearly, it's not just a pull down menu with 40,000 probes to choose from, so we're actively working on ways to engage the eyes and the visual cortex, but without strapping DTrace into a static framework, bounded by the same constraints of those traditional tools.BackWhew. It was a good trip though a bit exhausting. I think I convinced a bunch of people about the utility of DTrace in general, but also to their specific problems. But I also learned about some of DTrace's shortcomings which we are now working to address. It's good to be back to coding -- I'm putting the finishing touches on DTrace's AMD64 support which has been a lot of fun. In the next few weeks I'll be writing about the work going on in the kernel group as we put the final coat of polish on Solaris 10 as it gets ready for its release.

A number of factors have conspired to keep me away from blogging, not the least of which being that I've been on a coast-to-coast DTrace road show. Now that I'm back, I've got some news to report from...


a new view into software

As Bryan has observed the past, software has a quality unique to engineering disciplines in that you can build it, but you can't see it. DTrace changes that by opening windows into parts of the system that were previously unobservable and it does so in a way that minimally changes what you're attempting to observe -- this software "uncertainty principle" has limited the utility of previous observability tools. One of the darkest areas of debugging in user-land has been around lock contention.In multi-threaded programs synchronization primitives -- mutexes, R/W locks, semaphores, etc. -- are required to coordinate each thread's efforts and make sure shared data is accessed safely. If many threads are kept waiting while another thread owns a sychronization primitive, the program is said to suffer from lock contention. In the kernel, we've had lockstat(1m) for many years, but in user-land, the techniqes for observing lock behavior and sorting out the cause or even the presence have been very ad hocthe plockstat providerI just finished work on the plockstat provider for DTrace as well as a new plockstat(1m) command for observing user-land synchronization objects. If you're unfamiliar with DTrace, you might want to take a quick look at the Solaris Dynamic Tracing Guide (look through it for some examples); that will help ground some of this explanation.The plockstat provider has these probes:mutex-acquirefires when a mutex is acquiredmutex-releasefires when a mutex is releasedmutex-blockfires when a thread blocks waiting for a mutexmutex-spinfires when a thread spins waiting for a mutexrw-acquirefires when an R/W lock is acquiredrw-releasefires when an R/W lock is releasedrw-blockfires when a thread blocks waiting for an R/W lockIt's possible with other tools to observe these points, but -- as anyone who's tried it can attest -- other tools can alter the effects you're trying to observe. Traditional debuggers can effectively serialize your parallel program removing any trace of the lock contention you'd see during a normal run. DTrace and the plockstat provider avoid eliminate this problem.With the plockstat provider you can answer questions that were previously very difficult to solve, such as "where is my program blocked on mutexes":bash-2.05b# dtrace -n plockstat1173:::mutex-block'{ @[ustack()] = count() }'dtrace: description 'plockstat1173:::mutex-block' matched 2 probes\^C libc.so.1`mutex_lock_queue+0xa9 libc.so.1`slow_lock+0x3d libc.so.1`mutex_lock_impl+0xec libc.so.1`mutex_lock+0x38 libnspr4.so`PR_Lock+0x1a libnspr4.so`PR_EnterMonitor+0x35 libxpcom.so`__1cGnsPipePGetWriteSegment6MrpcrI_I_+0x3e libxpcom.so`__1cSnsPipeOutputStreamNWriteSegments6MpFpnPnsIOutputStream_pvpcIIpI_I3I5_I_+0x4f c4654d3c libxpcom.so`__1cUnsThreadPoolRunnableDRun6M_I_+0xb0 libxpcom.so`__1cInsThreadEMain6Fpv_v_+0x32 c4ec1d6a libc.so.1`_thr_setup+0x50 libc.so.1`_lwp_start 1(any guesses as to what program this might be?)Not just a new view for DTrace, but a new view for user-land.the plockstat(1m) commandDTrace is an incredibly powerful tool, but some tasks are so common that we want to make it as easy as possible to use DTrace's facilities without knowing anything about DTrace. The plockstat(1m) command wraps up a bunch of knowledge about lock contention in a neat and easy to use package:# plockstat -s 10 -A -p `pgrep locker`\^CMutex block-------------------------------------------------------------------------------Count nsec Lock Caller 13 22040260 locker`lock1 locker`go_lock+0x47 nsec ---- Time Distribution --- count Stack 65536 |@@@@@@@@@@@@@@ | 8 libc.so.1`mutex_lock+0x38 131072 | | 0 locker`go_lock+0x47 262144 |@@@@@ | 3 libc.so.1`_thr_setup+0x50 524288 | | 0 libc.so.1`_lwp_start 1048576 | | 0 2097152 | | 0 4194304 | | 0 8388608 | | 0 16777216 |@ | 1 33554432 | | 0 67108864 | | 0 134217728 | | 0 268435456 |@ | 1 ...This has been a bit of a teaser. I only integrated plockstat into Solaris 10 yesterday and it will be a few weeks before you can access plockstat as part of the Solaris Express program, but keep an eye on the DTrace Solaris Express Schedule.

As Bryan has observed the past, software has a quality unique to engineering disciplines in that you can build it, but you can't see it. DTrace changes that by opening windows into parts of the system...