Recent Posts


Multiple pools in 2010.Q1

When the Sun Storage 7000 was first introduced, a key design decision was to allow only a single ZFS storage pool per host. This forces users to fully take advantage of the ZFS pool storage model, and prevents them from adopting ludicrous schemes such as "one pool per filesystem." While RAID-Z has non-trivial performance implications for IOPs-bound workloads, the hope was that by allowing logzilla and readzilla devices to be configured per-filesystem, users could adjust relative performance and implement different qualities of service on a single pool.While this works for the majority of workloads, there are still some that benefit from mirrored performance even in the presence of cache and log devices. As the maximum size of Sun Storage 7000 systems increases, it became apparent that we needed a way to allow pools with different RAS and performance characteristics in the same system. With this in mind, we relaxed the "one pool per system" rule1 with the 2010.Q1 release.The storage configuration user experience is relatively unchanged. Instead of having a single pool (or two pools in a cluster), and being able to configure one or the other, you can simply click the '+' button and add pools as needed. When creating a pool, you can now specify a name for the pool. When importing a pool, you can either accept the existing name or give it a new one at the time you select the pool. Ownership of pools in a cluster is now managed exclusively through the Configuration -> Cluster screen, as with other shared resources.When managing shares, there is a new dropdown menu at the top left of the navigation bar. This controls which shares are shown in the UI. In the CLI, the equivalent setting is the 'pool' property at the 'shares' node.While this gives some flexibility in storage configuration, it also allows users to create poorly constructed storage topologies. The intent is to allow the user to create pools with different RAS and performance characteristics, not to create dozens of different pools with the same properties. If you attempt to do this, the UI will present a warning summarizing the drawbacks if you were to continue:Wastes system resources that could be shared in a single pool.Decreases overall performanceIncreases administrative complexity.Log and cache devices can be enabled on a per-share basis.You can still commit the operation, but such configurations are discouraged. The exception is when configuring a second pool on one head in a cluster.We hope this feature will allow users to continue to consolidate storage and expand use of the Sun Storage 7000 series in more complicated environments.Clever users figured out that this mechanism could be circumvented in a cluster to have two pools active on the same host in an active/passive configuration.

When the Sun Storage 7000 was first introduced, a key design decision was to allow only a single ZFS storage pool per host. This forces users to fully take advantage of the ZFS pool storage model, and...


Shadow Migration Internals

In my previous entry, I described the overall architecture of shadow migration. This post will dive into the details of how it's actually implemented, and the motivation behind some of the original design decisions.VFS interpositionA very early desire was that we wanted something that could migrate data from many different sources. And while ZFS is the primary filesystem for Solaris, we also wanted to allow for arbitrary local targets. For this reason, the shadow migration infrastructure is implemented entirely at the VFS (Virtual FileSystem) layer. At the kernel level, there is a new 'shadow' mountpoint option, which is the path to another filesystem on the system. The kernel has no notion of whether a source filesystem is local or remote, and doesn't differentiate between synchronous access and background migration. Any filesystem access, whether it is local or over some other protocol (CIFS, NFS, etc) will use the VFS interfaces and therefore be fed through our interposition layer.The only other work the kernel does when mounting a shadow filesystem is check to see if the root directory is empty. If it is empty, we create a blank SUNWshadow extended attribute on the root directory. Once set, this will trigger all subsequent migration as long as the filesystem is always mounted with the 'shadow' attribute. Each VFS operation first checks to see if the filesystem is shadowing another (a quick check), and then whether the file or directory has the SUNWshadow attribute set (slightly more expensive, but cached with the vnode). If the attribute isn't present, then we fall through to the local filesystem. Otherwise, we migrate the file or directory and then fall through to the local filesystem.Migration of directoriesIn order to migrate a directory, we have to migrate all the entriest. When migrating an entry for a file, we don't want to migrate the complete contents until the file is accessed, but we do need to migrate enough metadata such that access control can be enforced. We start by opening the directory on the remote side whose relative path is indicated by the SUNWshadow attribute. For each directory entry, we create a sparse file with the appropriate ownership, ACLs, system attributes and extended attributes.Once the entry has been migrated, we then set a SUNWshadow attribute that is the same as the parent but with "/" appended where "name" is the directory entry name. This attribute always represents the relative path of the unmigrated entity on the source. This allows files and directories to be arbitrarily renamed without losing track of where they are located on the source. It also allows the source to change (i.e. restored to a different host) if needed. Note that there are also types of files (symlinks, devices, etc) that do not have contents, in which case we simply migrate the entire object at once.Once the diretory has been completely migrated, we remove the SUNWshadow attribute so that future accesses all use the native filesystem. If the process is interrupted (system reset, etc), then the attribute will still be on the parent directory so we will migrate it again when the user (or background process) tries to access it.Migration of filesMigrating a plain file is much simpler. We use the SUNWshadow attribute to locate the file on the source, and then read the source file and write the corresponding data to the local filesystem. In the current software version, this happens all at once, meaning the first access of a large file will have to wait for the entire file to be migrated. Future software versions will remove this limitation and migrate only enough data to satisfy the request, as well as allowing concurrent accesses to the file. Once all the data is migrated, we remove the SUNWshadow attribute and future accesses will go straight to the local filesystem.Dealing with hard linksOne issue we knew we'd have to deal with is hard links. Migrating a hard link requires that the same file reference appear in multiple locations within the filesystem. Obviously, we do not know every reference to a file in the source filesystem, so we need to migrate these links as we discover them. To do this, we have a special directory in the root of the filesystem where we can create files named by their source FID. A FID (file ID) is a unique identifier for the file within the filesystem. We create the file in this hard link directory with a name derived from its FID. Each time we encounter a file with a link count greater than 1, we lookup the source FID in our special directory. If it exists, we create a link to the directory entry instead of migrating a new instance of the file. This way, it doesn't matter if files are moved around, removed from the local filesystem, or additional links created. We can always recreate a link to the original file. The one wrinkle is that we can migrate from nested source filesystems, so we also need to track the FSID (filesystem ID) which, while not persistent, can be stored in a table and reconstructed using source path information.Completing migrationA key feature of the shadow migration design is that it treats all accesses the same, and allows background migration of data to be driven from userland, where it's easier to control policy. The downside is that we need the ability to know when we have finished migrating every single file and directory on the source. Because the local filesystem is actively being modified while we are traversing, it's impossible to know whether you've visited every object based only on walking the directory hierarchy. To address this, we keep track of a "pending" list of files and directories with shadow attributes. Every object with a shadow attribute must be present in this list, though this list can contain objects without shadow attributes, or non-existant objects. This allows us to be synchronous when it counts (appending entries) and lazy when it doesn't (rewriting file with entries removed). Most of the time, we'll find all the objects during traversal, and the pending list will contain no entries at all. In the case we missed an object, we can issue an ioctl() to the kernel to do the work for us. When that list is empty we know that we are finished and can remove the shadow setting.ZFS integrationThe ability to specify the shadow mount option for arbitrary filesystems is useful, but is also difficult to manage. It must be specified as an absolute path, meaning that the remote source of the mount must be tracked elsewhere, and has to be mounted before the filesystem itself. To make this easier, a new 'shadow' property was added for ZFS filesystems. This can be set using an abstract URI syntax ("nfs://host/path"), and libzfs will take care of automatically mounting the shadow filesystem and passing the correct absolute path to the kernel. This way, the user can manage a semantically meaningful relationship without worrying about how the internal mechanisms are connected. It also allows us to expand the set of possible sources in the future in a compatible fashion.Hopefully this provides a reasonable view into how exactly shadow migration works, and the design decisions behind it. The goal is to eventually have this available in Solaris, at which point all the gritty details should be available to the curious.

In my previous entry, I described the overall architecture of shadow migration. This post will dive into the details of how it's actually implemented, and the motivation behind some of the original...


User and Group Quotas in the 7000 Series

When ZFS was first developed, the engineering team had the notion that pooled storage would make filesystems cheap and plentiful, and we'd move away from the days of /export1, /export2, ad infinitum. From the ZFS perspective, they are cheap. It's very easy to create dozens or hundreds of filesystems, each which functions as an administrative control point for various properties. However, we quickly found that other parts of the system start to break down once you get past 1,000 or 10,000 filesystems. Mounting and sharing filesystems takes longer, browsing datasets takes longer, and managing automount maps (for those without NFSv4 mirror mounts) quickly spirals out of control.For most users this isn't a problem - a few hundred filesystems is more than enough to manage disparate projects and groups on a single machine. There was one class of users, however, where a few hundred filesystems wasn't enough. These users were university or other home directory environments with 20,000 or more users, each which needed to have a quota to guarantee that they couldn't run amok on the system. The traditional ZFS solution, creating a filesystem for each user and assigning a quota, didn't scale. After thinking about it for a while, Matt developed a fairly simple architecture to provide this functionality without introducing pathological complexity into the bowels of ZFS. In build 114 of Solaris Nevada, he pushed the following:PSARC 2009/204 ZFS user/group quotas & space accountingThis provides full support for user and group quotas on ZFS, as well as the ability to track usage on a per-user or per-group basis within a dataset. This was later integrated into the 2009.Q3 software release, with an additional UI layer. From the 'general' tab of a share, you can query usage and set quotas for individual users or groups quickly. The CLI allows for automated batch operations. Requesting a single user or group is significantly faster than requesting all the current usage, but you an also get a list of the current usage for a project or share. With integrated identity management, users and groups can be specified either by UNIX username or Windows name.There are some significant differences between user and group quotas and traditional ZFS quotas. The following is an excerpt from the on-line documentation on the subject:User and group quotas can only be applied to filesystems.User and group quotas are implemented using delayed enforcement. This means that users will be able to exceed their quota for a short period of time before data is written to disk. Once the data has been pushed to disk, the user will receive an error on new writes, just as with the filesystem-level quota case.User and group quotas are always enforced against referenced data. This means that snapshots do not affect any quotas, and a clone of a snapshot will consume the same amount of effective quota, even though the underlying blocks are shared.User and group reservations are not supported.User and group quotas, unlike data quotas, are stored with the regular filesystem data. This means that if the filesystem is out of space, you will not be able to make changes to user and group quotas. You must first make additional space available before modifying user and group quotas.User and group quotas are sent as part of any remote replication. It is up to the administrator to ensure that the name service environments are identical on the source and destination.NDMP backup and restore of an entire share will include any user or group quotas. Restores into an existing share will not affect any current quotas. (There is currently a bug preventing this from working in the initial release, which will be fixed in a subsequent minor release.)This feature will hopefully allow the Sun Storage 7000 series to function in environments where it was previously impractical to do so. Of course, the real person to thank is Matt and the ZFS team - it was a very small amount of work to provide an interface on top of the underlying ZFS infrastructure.

When ZFS was first developed, the engineering team had the notion that pooled storage would make filesystems cheap and plentiful, and we'd move away from the days of /export1, /export2, ad infinitum....


What is Shadow Migration?

In the Sun Storage 7000 2009.Q3 software release, one of the major new features I worked on was the addition of what we termed "shadow migration." When we launched the product, there was no integrated way to migrate data from existing systems to the new systems. This resulted in customers rolling it by hand (rsync, tar, etc), or paying for professional services to do the work for them. We felt that we could present a superior model that would provide for a more integrated experience as well as let the customer leverage the investment in the system even before the migration was complete.The idea in and of itself is not new, and various prototypes of this have been kicking around inside of Sun under various monikers ("brain slug", "hoover", etc) without ever becoming a complete product. When Adam and myself sat down shortly before the initial launch of the product, we decide we could do this without too much work by integrating the functionality directly in the kernel. The basic design requirements we had were:We must be able to migrate over standard data protocols (NFS) from arbitrary data sources without the need to have special software running on the source system.Migrated data must be available before the entire migration is complete, and must be accessible with native performance.All the data to migrate the filesystem must be stored within the filesystem itself, and must not rely on an external database to ensure consistency.With these requirements in hand, our key insight was that we could create a "shadow" filesystem that could pull data from the original source if necessary, but then fall through to the native filesystem for reads and writes one the file has been migrated. What's more, we could leverage the NFS client on Solaris and do this entirely at the VFS (virtual filesystem) layer, allowing us to migrate data between shares locally or (eventually) over other protocols as well without changing the interpositioning layer. The other nice thing about this architecture is that the kernel remains ignorant of the larger migration process. Both synchronous requests (from clients) and background requests (from the management daemon) appear the same. This allows us to control policy within the userland software stack, without pushing that complexity into the kernel. It also allows us to write a very comprehensive automated test suite that runs entirely on local filesystems without need a complex multi-system environment.So what's better (and worse) about shadow migration compared to other migration strategies? For that, I'll defer to the documentation, which I've reproduced here for those who don't have a (virtual) system available to run the 2009.Q3 release:Migration via synchronizationThis method works by taking an active host X and migrating data to the new host Y while X remains active. Clients still read and write to the original host while this migration is underway. Once the data is initially migrated, incremental changes are repeatedly sent until the delta is small enough to be sent within a single downtime window. At this point the original share is made read-only, the final delta is sent to the new host, and all clients are updated to point to the new location. The most common way of accomplishing this is through the rsync tool, though other integrated tools exist. This mechanism has several drawbacks:The anticipated downtime, while small, is not easily quantified. If a user commits a large amount of change immediately before the scheduled downtime, this can increase the downtime window.During migration, the new server is idle. Since new servers typically come with new features or performance improvements, this represents a waste of resources during a potentially long migration period.Coordinating across multiple filesystems is burdensome. When migrating dozens or hundreds of filesystems, each migration will take a different amount of time, and downtime will have to be scheduled across the union of all filesystems.Migration via external interpositionThis method works by taking an active host X and inserting a new appliance M that migrates data to a new host Y. All clients are updated at once to point to M, and data is automatically migrated in the background. This provides more flexibility in migration options (for example, being able to migrate to a new server in the future without downtime), and leverages the new server for already migrated data, but also has significant drawbacks:The migration appliance represents a new physical machine, with associated costs (initial investment, support costs, power and cooling) and additional management overhead.The migration appliance represents a new point of failure within the system.The migration appliance interposes on already migrated data, incurring extra latency, often permanently. These appliances are typically left in place, though it would be possible to schedule another downtime window and decommission the migration appliance.Shadow migrationShadow migration uses interposition, but is integrated into the appliance and doesn't require a separate physical machine. When shares are created, they can optionally "shadow" an existing directory, either locally (see below) or over NFS. In this scenario, downtime is scheduled once where the source appliance X is placed into read-only mode, a share is created with the shadow property set, and clients are updated to point to the new share on the Sun Storage 7000 appliance. Clients can then access the appliance in read-write mode.Once the shadow property is set, data is transparently migrated in the background from the source appliance locally. If a request comes from a client for a file that has not yet been migrated, the appliance will automatically migrate this file to the local server before responding to the request. This may incur some initial latency for some client requests, but once a file has been migrated all accesses are local to the appliance and have native performance. It is often the case that the current working set for a filesystem is much smaller than the total size, so once this working set has been migrated, regardless of the total native size on the source, there will be no perceived impact on performance.The downside to shadow migration is that it requires a commitment before the data has finished migrating, though this is the case with any interposition method. During the migration, portions of the data exists in two locations, which means that backups are more complicated, and snapshots may be incomplete and/or exist only on one host. Because of this, it is extremely important that any migration between two hosts first be tested thoroughly to make sure that identity management and access controls are setup correctly. This need not test the entire data migration, but it should be verified that files or directories that are not world readable are migrated correctly, ACLs (if any) are preserved, and identities are properly represented on the new system.Shadow migration implemented using on-disk data within the filesystem, so there is no external database and no data stored locally outside the storage pool. If a pool is failed over in a cluster, or both system disks fail and a new head node is required, all data necessary to continue shadow migration without interruption will be kept with the storage pool.In a subsequent post, I'll discuss some of the thorny implementation detail we had to solve, as well as provide some screenshots of migration in progress. In the meantime, I suggest folks download the simulator and upgrade to the latest software to give it a try.

In the Sun Storage 7000 2009.Q3 software release, one of the major new features I worked on was the addition of what we termed "shadow migration." When we launched the product, there was no integrated...


ZFS, FMA, hotplug, and Fishworks

In the past, I've discussed the evolution of disk FMA. Much has been accomplished in the past year, but there are still several gaps when it comes to ZFS and disk faults. In Solaris today, a fault diagnosed by ZFS (a device failing to open, too many I/O errors, etc) is reported as a pool name and 64-bit vdev GUID. This description leaves something to be desired, referring the user to run zpool status to determine exactly what went wrong. But the user is still has to know how to go from a cXtYdZ Solaris device name to a physical device, and when they do locate the physical device they need to manually issue a zpool replace command to initiate the replacement.While this is annoying in the Solaris world, it's completely untenable in an appliance environment, where everything needs to "just work". With that in mind, I set about to plug the last few holes in the unified plan:ZFS faults must be associated with a physical disk, including the human-readable labelA disk fault (ZFS or SMART failure) must turn on the associated fault LEDRemoving a disk (faulted or otherwise) and replacing it with a new disk must automatically trigger a replacementWhile these seem like straightforward tasks, as usual they are quite difficult to get right in a truly generic fashion. And for an appliance, there can be no Solaris commands or additional steps for the user. To start with, I needed to push the FRU information (expressed as a FMRI in the hc libtopo scheme) into the kernel (and onto the ZFS disk label) where it would be available with each vdev. While it is possible to do this correlation after the fact, it simplifies the diagnosis engine and is required for automatic device replacement. There are some edge conditions around moving and swapping disks, but was relatively straightforward. Once the FMRI was there, I could include the FRU in the fault suspect list, and using Mike's enhancements to libfmd_msg, dynamically insert the FRU label into the fault message. Traditional FMA libtopo labels do not include the chassis label, so in the Fishworks stack we go one step further and re-write the label on receipt of a fault event with the user-defined chassis name as well as the physical slot. This message is then used when posting alerts and on the problems page. We can also link to the physical device from the problems page, and highlight the faulty disk in the hardware view.With the FMA plumbing now straightened out, I needed a way to light the fault LED for a disk, regardless of whether it was in the system chassis or an external enclosure. Thanks to Rob's sensor work, libtopo already presents a FMRI-centric view of indicators in a platform agnostic manner. So I rewrote the disk-monitor module (or really, deleted everything and created a new fru-monitor module) that would both poll for FRU hotplug events, as well as manage the fault LEDs for components. When a fault is generated, the FRU monitor looks through the suspect list, and turns on the fault LED for any component that has a supported indicator. This is then turned off when the corresponding repair event is generated. This also had the side benefit of generating hotplug events phrased in terms of physical devices, which the appliance kit can use to easily present informative messages to the user.Finally, I needed to get disk replacement to work like everyone expects it to: remove a faulted disk, put in a new one, and walk away. The genesis of this functionality was putback to ON long ago as the autoreplace pool property. In Solaris, this functionality only works with disks that have static device paths (namely SATA). In the world of multipathed SAS devices, the device path is really a scshi_vhci node identified by the device WWN. If we remove a disk and insert a new one, it will appear as a new device with no way to correlate it to the previous instance, preventing us from replacing the correct vdev. What we need is physical slot information, which happens to be provided by the FMRI we are already storing with the vdev for FMA purposes. When we receive a sysevent for a new device addition, we look at the latest libtopo snapshot and take the FMRI of the newly inserted device. By looking at the current vdev FRU information, we can then associate this with the vdev that was previously in the slot, and automatically trigger the replacement.This process took a lot longer than I would have hoped, and has many more subtleties too boring even for a technical blog entry, but it is nice to sit back and see a user experience that is intuitive, informative, and straightforward - the hallmarks of an integrated appliance solution.

In the past, I've discussed the evolution of disk FMA. Much has been accomplished in the past year, but there are still several gaps when it comes to ZFS and disk faults. In Solaris today, a fault...


Fishworks Storage Configuration

Since our initial product was going to be a NAS appliance, we knew early on thatstorage configuration would be a critical part of the initial Fishworks experience. Thanks to the powerof ZFS storage pools, we have the ability to present a radically simplified interface,where the storage "just works" and the administrator doesn't need to worry about choosing RAID stripe widths or statically provisioning volumes. The first decisionwas to create a single storage pool (or really one per head in a cluster)1,which means that the administrator only needs to make this decision once, anddoesn't have to worry about it every time they create a filesystem or LUN.Within a storage pool, we didn't want the user to be in charge of makingdecisions about RAID stripe widths, hot spares, or allocation of devices. Thiswas primarily to avoid this complexity, but also represents the fact that we(as designers of the system) know more about its characteristics than you.RAID stripe width affects performance in ways that are not immediatelyobvious. Allowing for JBOD failure requires careful selection of stripe widths.Allocation of devices can take into account environmental factors (balancingHBAs, fan groups, backplance distribution) that are unknown to the user.To make this easy for the user, we pick several different profiles thatdefine parameters that are then applied to the current configuration to figureout how the ZFS pool should be laid out.Before selecting a profile, we ask the user to verify the storage thatthey want to configure. On a standalone system, this is just a checkto make sure nothing is broken. If there is a broken or missing disk, wedon't let you proceed without explicit confirmation. The reason we dothis is that once the storage pool is configured, there is no way to add thosedisks to the pool without changing the RAS and performance characteristicsyou specified during configuration. On a 7410 with multiple JBODs, this verification step is slightlymore complicated, as we allow adding of whole or half JBODs. This step iswhere you can choose to allocate half or all ofthe JBOD to a pool, allowing you to split storage in a cluster or reserveunused storage for future clustering options. Fundamentally, the choice of redundancy is a business decision. There is a set of tradeoffs that express your tolerance of risk and relative cost. As Jarod told us very early on in the project: "fast, cheap, or reliable - pick two."We took this to heart, and our profiles are displayed in a table withqualitative ratings on performance, capacity, and availability. To furtherhelp make a decision, we provide a human-readable description of thelayout, as well as a pie chart showing the way raw storage will be used(data, parity, spares, or reserved). The last profile parameter is called"NSPF," for "no single point of failure." If you are on a 7410 with multipleJBODs, some profiles can be applied across JBODs such that the lossof any one JBOD cannot cause data loss2. This often forces arbitrary stripewidths (with 6 JBODs your only choice is 10+2) and can result inless capacity, but with superior RAS characteristics.This configuration takes just two quick steps, and for the common case(where all the hardware is working and the user wants double parity RAID),it just requires clicking on the "DONE" button twice. We also support addingadditional storage (on the 7410), as well as unconfiguring and importingstorage. I'll leave a complete description of the storage configurationscreen for a future entry.[1] A common question we get is "why allow only one storage pool?" The actualimplementation clearly allows it (as in the failed over active-active cluster), so it'spurely an issue of complexity. There is never a reason to create multiplepools that share the same redundancy profile - this provides no additional valueat the cost of significant complexity. We do acknowledge that mirroringand RAID-Z provide different performance characteristics, but we hope that with theability to turn on and off readzilla and (eventually) logzilla usage on a per-share basis,this will be less of an issue. In the future, you may see support for multiple pools, butonly in a limited fashion (i.e. enforcing different redundancy profiles).[2] It's worth noting that all supported configurations of the 7410 havemultiple paths to all JBODs across multiple HBAs. So even without NSPF, wehave the ability to survive HBA, cable, and JBOD controller failure.

Since our initial product was going to be a NAS appliance, we knew early on that storage configuration would be a critical part of the initial Fishworks experience. Thanks to the powerof ZFS storage...


So how much work was it?

With any product, there is always some talk from the enthusiasts about how theycould do it faster, cheaper, or simpler. Inevitably, there's a little bit of truth toboth sides. Enthusiasts have been doing homebrew NAS for as long as free softwarehas been around, but it takes far morework to put together a complete, polished solution that stands up under the stressof an enterprise environment. One of the amusing things I like to do is to lookback at the total amount of source code we wrote. Lines of source code by itselfis obviously not a measure of complexity - it's possible to write complex softwarewith very few lines of source, or simple software that's over engineered - but it'san interesting measure nonetheless. Below is the current output of a little scriptI wrote to count lines of code1 in our fish-gate. This does notinclude the approximately 40,000 lines of change made tothe ON (core Solaris) gate, most of which we'll be putting back gradually overthe next few months. C (libak) 185386 # The core of the appliance kitC (lib) 12550 # Other librariesC (fcc) 11167 # A compiler adapted from dtraceC (cmd) 12856 # Miscellaneous utilitiesC (uts) 4320 # clustron driver----------------------- ------Total C 226279JavaScript (web) 69329 # Web UIJavaScript (shell) 24227 # CLI shellJavaScript (common) 9354 # Shared javascriptJavaScript (crazyolait) 2714 # Web transport layer (adapted from jsolait)JavaScript (tst) 40991 # Automated test code----------------------- ------Total Javascript 146615Shell (lib) 4179 # Support scripts (primarily SMF methods)Shell (cmd) 5295 # UtilitiesShell (tools) 6112 # Build toolsShell (tst) 6428 # Automated test code----------------------- ------Total Shell 22014Python (tst) 34106 # Automated test codeXML (metadata) 16975 # Internal metadataCSS 6124 # Stylesheets[1] This is a raw line count. It includes blank lines and comments, sointerpret it as you see fit.

With any product, there is always some talk from the enthusiasts about how they could do it faster, cheaper, or simpler. Inevitably, there's a little bit of truth toboth sides. Enthusiasts have been...


Fishworks Hardware Topology

It's hard to believe that this day has finally come. After morethan two and a half years, our first Fishworks-based product has beenreleased. You can keep up to date with the latest info at theFishworks blog.For my first technical post, I'd thought I'd give an introduction tothe chassis subsystem at the heart of our hardware integration strategy. Thissubsystem is responsible for gathering, cataloging, and presenting a unifiedview of the hardware topology. Itunderwent two major rewrites (one by myself and one by Keith) but thefundamental design has remained the same. While itmay not be the most glamorous feature (no one's going to purchase a boxbecause they can get model information on their DIMMs), I found it aninteresting cross-section of disparate technologies and awash in subtlecomplexity. You can find a video of myself talking about anddemonstrating this featurehere.libtopo discoveryAt the heart of the chassis subsystem is the FMA topology asexported bylibtopo. This library is alreadycapable of enumerating hardware in a physically meaningful manner, andFMRIs (fault managed resource identifiers) form the basis ofFMA fault diagnosis. This alone provides us the following basiccapabilities:Discover external storage enclosuresIdentify bays and disksIdentify CPUsIdentify power supplies and fansManage LEDsIdentify PCI functions beneath a particular slotMuch of this requires platform-specific XML files, or leverages IPMIbehind the scenes, but this minimal integration work is common to Solaris. Anyplatform supported by Solaris is supported by the FishWorks softwarestack.Additional metadataUnfortunately, this falls short of a complete picture:No way to identify absent CPUs, DIMMs, or empty PCI slotsDIMM enumeration not supported on all platformsHuman-readable labels often wrong or missingNo way to identify complete PCI cardsNo integration with visual images of the chassisTo address these limitations (most of which lie outside the purview oflibtopo), we leverage additional metadata for each supportedchassis. This metadata identifies all physical slots (even those thatmay not be occupied), cleans up various labels, and includes visualinformation about the chassis and its components. Andwe can identify physical cards based on devinfo properties extractedfrom firmware and/or the pattern of PCI functions and their attributes(a process worthy of its own blog entry). Combined with libtopo, we haveimages that we can assemble into acomplete view based on the current physical layout, highlightcomponents within the image, and respond to user mouse clicks.Supplemental informationHowever, we are still missing many ofthe component details. Our goal is to be able to provide completeinformation for every FRU on the system. With just libtopo, we can getthis for disks but not much else. We need to look to alternatesources of information.kstatFor CPUs, there is a rather rich set of information available viatraditional kstat interfaces. While we use libtopo to identify CPUs(it lets us correlate physical CPUs), thebulk of the information comes from kstats. This is used to get model,speed, and the number of cores.libdevinfo The device tree snapshot provides additional information for PCIdevices that can only be retrieved by private driver interfaces.Despite the existence of a VPD (Vital Product Data)standard, effectively no vendors implement it. Instead, it is read by some firmware-specificmechanism private to the driver. By exporting these as properties inthe devinfo snapshot, we can transparently pull in dynamic FRUinformation for PCI cards. This is used to get model, part, andrevision information for HBAs and 10G NICs.IPMIIPMI (Intelligent Platform Management Interface) is used tocommunicate with the service processor on most enterprise classsystems. It is used within libtopo for power supply and fanenumeration in libtopo as well as LED management. But IPMI also supports FRU data, which includes a lot of juicy tidbitsthat only the SP knows. We reference this FRU information directly toget model and part information for power supplies and DIMMs.SMBIOSEven with IPMI, there are bits of information that exist only in SMBIOS,a standard is supposed to provide information about the physicalresources on the system. Sadly, it does not provide enough informationto correlate OS-visible abstractions with their underlying physicalcounterparts. With metadata, however, we can use SMBIOS to make thiscorrelation. This is used to enumerate DIMMs on platforms notsupported by libtopo, and to supplement DIMM information with dataavailable only via SMBIOS.MetadataLast but not least, there is chassis-specific metadata. Somecomponents simply don't have FRUID information, either because they aretoo simple (fans) or there exists no mechanism to get the information(most PCI cards). In this situation, we use metadata to providevendor, model, and part information as that is generally static for aparticular component within the system. We cannot get informationspecific to the component (such as a serial number), but at least theuser will be able to know what it is and know how to order anotherone.Putting it all togetherWith all of this information tied together under one subsystem, wecan finally present the user complete information about their hardware,including images showing the physical layout of the system. In addition,this also forms the basis for reporting problems and analytics (usinglabels from metadata), manipulating chassis state (toggling LEDs, settingchassis identifiers), and making programmatic distinctions about thehardware (such as whether external HBAs are present). Over thenext few weeks I hope to expound on some of these details in furtherblog posts.

It's hard to believe that this day has finally come. After more than two and a half years, our first Fishworks-based product has been released. You can keep up to date with the latest info at theFishwo...


SES Sensors and Indicators

Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new sensor abstraction layer in libtopo. Rob did most of the hard work- my contribution consisted only of extending the SES enumerator to support the new facility infrastructure.You can find a detailed description of the changes in the original FMA portfolio here, but it's much easier to understand via demonstration. This is the fmtopo output for a fan node in a J4400 JBOD:hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0 label string Cooling Fan 0 FRU fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0 group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: ses version: 1 stability: Private/Private node-id uint64 0x1f target-path string /dev/es/ses3hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: facility version: 1 stability: Private/Private type uint32 0x1 (LOCATE) mode uint32 0x0 (OFF) group: ses version: 1 stability: Private/Private node-id uint64 0x1fhc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: facility version: 1 stability: Private/Private type uint32 0x0 (SERVICE) mode uint32 0x0 (OFF) group: ses version: 1 stability: Private/Private node-id uint64 0x1fhc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x4 (FAN) units uint32 0x12 (RPM) reading double 3490.000000 state uint32 0x0 (0x00) group: ses version: 1 stability: Private/Private node-id uint64 0x1fhc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: facility version: 1 stability: Private/Private sensor-class string discrete type uint32 0x103 (GENERIC_STATE) state uint32 0x1 (DEASSERTED) group: ses version: 1 stability: Private/Private node-id uint64 0x1fHere you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it's not used by any software. But that will change shortly, as we work on the next phases:Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it's in the system chassis or an external enclosure.Generalization of the disk-monitor fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a "vdev" - a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it's possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.

Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an...


External storage enclosures in Solaris

Over the past few years, I've been working on various parts of Solaris platform integration, with an emphasis on disk monitoring. While the majority of my time has been focused on fishworks, I have managed to implement a few more pieces of the original design.About two months ago, I integrated the libscsi and libses libraries into Solaris Nevada. These libraries, originally written by Keith Wesolowski, form an abstraction layer upon which higher level software can be built. The modular nature of libses makes it easy to extend with vendor-specific support libraries in order to provide additional information and functionality not present in the SES standard, something difficult to do with the kernel-based ses(7d) driver. And since it is written in userland, it is easy to port to other operating systems. This library is used as part of the fwflash firmware upgrade tool, and will be used in future Sun storage management products.While libses itself is an interesting platform, it's true raison d'etre is to serve as the basis for enumeration of external enclosures as part of libtopo. Enumeration of components in a physically meaningful manner is a key component of the FMA strategy. These components form FMRIs (fault managed resource identifiers) that are the target of diagnoses. These FMRIs provide a way of not just identifying that "disk c1t0d0 is broken", but that this device is actually in bay 17 of the storage enclosure whose chassis serial number is "2029QTF0809QCK012". In order to do that effectively, we need a way to discover the physical topology of the enclosures connected to the system (chassis and bays) and correlate it with the in-band I/O view of the devices (SAS addresses). This is where SES (SCSI enclosure services) comes into play. SES processes show up as targets in the SAS fabric, and by using the additional element status descriptors, we can correlate physical bays with the attached devices under Solaris. In addition, we can also enumerate components not directly visible to Solaris, such as fans and power supplies.The SES enumerator was integrated in build 93 of nevada, and all of these components now show up in the libtopo hardware topology (commonly referred to as the "hc scheme"). To do this, we walk over al the SES targets visible to the system, grouping targets into logical chassis (something that is not as straightforward as it should be). We use this list of targets and a snapshot of the Solaris device tree to fill in which devices are present on the system. You can see the result by running fmtopo on a build 93 or later Solaris machine:# /usr/lib/fm/fmd/fmtopo...hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:serial=2029QTF0000000002:part=Storage-J4400:revision=3R13/ses-enclosure=0hc://:product-id=SUN-Storage-J4400:chassis-id=22029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=0hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=1hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=0hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=1hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=2hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=3hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0386:part=375-3584-01/ses-enclosure=0/controller=0hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0074:part=375-3584-01/ses-enclosure=0/controller=1hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=0hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=1...To really get all the details, you can use the '-V' option to fmtopo to dump all available properties:# fmtopo -V '\*/ses-enclosure=0/bay=0/disk=0'TIME UUIDJul 14 03:54:23 3e95d95f-ce49-4a1b-a8be-b8d94a805ec8hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0 ASRU fmri dev:///:devid=id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________//scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X label string SCSI Device 0 FRU fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0 group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0809QCK012 server-id string group: io version: 1 stability: Private/Private devfs-path string /scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X devid string id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________ phys-path string[] [ /pci@0,0/pci10de,377@a/pci1000,3150@0/disk@1c,0 /pci@0,0/pci10de,375@f/pci1000,3150@0/disk@1c,0 ] group: storage version: 1 stability: Private/Private logical-disk string c0tATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3Xd0 manufacturer string SEAGATE model string ST37500NSSUN750G 0720A0PC3X serial-number string 5QD0PC3X firmware-revision string 3.AZK capacity-in-bytes string 750156374016So what does this mean, other than providing a way for you to finally figure out where disk 'c3t0d6' is really located? Currently, it allows the disks to be monitored by the disk-transport fmd module to generate faults based on predictive failure, over temperature, and self-test failure. The really interesting part is where we go from here. In the near future, thanks to work by Rob Johnston on the sensor framework, we'll have the ability to manage LEDs for disks that are part of external enclosures, diagnose failures of power supplies and fans, as well as the ability to read sensor data (such as fan speeds and temperature) as part of a unified framework.I often like to joke about the amount of time that I have spent just getting a single LED to light. At first glance, it seems like a pretty simple task. But to do it in a generic fashion that can be generalized across a wide variety of platforms, correlated with physically meaningful labels, and incorporate a diverse set of diagnoses (ZFS, SCSI, HBA, etc) requires an awful lot of work. Once it's all said and done, however, future platforms will require little to no integration work, and you'll be able to see a bad drive generate checksum errors in ZFS, resulting in a FMA diagnosis indicating the faulty drive, activate a hot spare, and light the fault LED on the drive bay (wherever it may be). Only then will we have accomplished our goal of an end-to-end storage strategy for Solaris - and hopefully someone besides me will know what it has taken to get that little LED to light.

Over the past few years, I've been working on various parts of Solaris platform integration, with an emphasis on disk monitoring. While the majority of my time has been focused on fishworks, I have...


Solaris platform integration - disk monitoring

Two weeks ago I putback PSARC 2007/202, the second step in generalizing the x4500 disk monitor. As explained in my previous blog post, one of the tasks of the original sfx4500-disk module was reading SMART data from disks and generating associated FMA faults. This platform-specific functionality needed to be generalized to effectively support future Sun platforms.This putback did not add any new user-visible features to Solaris, but it did refactor the code in the following ways:A new private library, libdiskstatus, was added. This generic library uses uSCSI to read data from SCSI (or SATA via emulation) devices. It is not a generic SMART monitoring library, focusing only on the three generally available disk faults: over temperature, predictive failure, and self-test failure. There is a single function, disk_status_get() that reurns an nvlist describing the current parameters reported by the drive and whether any faults are present.This library is used by the SATA libtopo module to export a generic TOPO_METH_DISK_STATUS method. This method keeps all the implementation details within libtopo and exports a generic inerface for consumers.A new fmd module, disk-transport, periodically iterates over libtopo nodes and invokes the TOPO_METH_DISK_STATUS method on any supported nodes. The module generates FMA ereports for any detected errors.These ereports are translated to faults by a simple eversholt DE. These are the same faults that were originally generated by the sfx4500-disk module, so the code that consumes them remains unchanged.These changes form the foundation that will allow future Sun platforms to detect and react to disk failures, eliminating 5200 lines of platform-specific code in the process. The next major steps are currently in progress:The FMA team, as part of the sensor framework, is expanding libtopo to include the ability to represent indicators (LEDs) in a generic fashion. This will replace the x4500 specific properties and associated machinery with generic code.The SCSI FMA team is finalizing the libtopo enumeration work that will allow arbitrary SCSI devices (not just SATA) to be enumerated under libtopo and therefore be monitored by the disk-transport module. The first phase will simply replicate the existing sfx4500-disk functionality, but will enable us to model future non-SATA platforms as well as external storage devices.Finally, I am finishing up my long-overdue ZFS FMA work, a necessary step towards connecting ZFS and disk diagnosis. Stay tuned for more info.

Two weeks ago I putback PSARC 2007/202, the second step in generalizing the x4500 disk monitor. As explained in my previous blog post, one of the tasks of the original sfx4500-disk module was...


Solaris platform integration - libipmi

As I continue down the path of improving various aspects of ZFS and Solaris platform integration, I found myself in the thumper (x4500) fmd platform module. This module represents the latest attempt at Solaris platform integration, and an indication of where we are headed in the future.When I say "platform integration", this is more involved than the platform support most people typically think of. The platform teams make sure that the system boots and that all the hardware is supported properly by Solaris (drivers, etc). Thanks to the FMA effort, platform teams must also deliver a FMA portfolio which covers FMA support for all the hardware and a unified serviceability plan. Unfortunately, there is still more work to be done beyond this, of which the most important is interacting with hardware in response to OS-visible events. This includes ability to light LEDs in response to faults and device hotplug, as well as monitoring the service processor and keeping external FRU information up to date.The sfx4500-disk module is the latest attempt at providing this functionality. It does the job, but is afflicted by the same problems that often plague platform integration attempts. It's overcomplicated, monolithic, and much of what it does should be generic Solaris functionality. Among the things this module does:Reads SMART data from disks and creates ereportsDiagnoses ereports into corresponding disk faultsImplements an IPMI interface directly on top of /dev/bmcResponds to disk faults by turning on the appropriate 'fault' disk LEDListens for hotplug and DR events, updating the 'ok2rm' and 'present' LEDsUpdates SP-controlled FRU informationMonitors the service process for resets and resyncs necessary informationNeedless to say, every single item on the above list is applicable to a wide variety of Sun platforms, not just the x4500, and it certainly doesn't need to be in a single monolithic module. This is not meant to be a slight against the authors of the module. As with most platform integration activities, this effort wasn't communicated by the hardware team until far too late, resulting in an unrealistic schedule with millions of dollars of revenue behind it. It doesn't help that all these features need to be supported on Solaris 10, making the schedule pressure all the more acute, since the code must soak in Nevada and then be backported in time for the product release. In these environments even the most fervent pleas for architectural purity tend to fall on deaf ears, and the engineers doing the work quickly find themselves between a rock and a hard place.As I was wandering through this code and thinking about how this would interact with ZFS and future Sun products, it became clear that it needed a massive overhaul. More specifically, it needed to be burned to the ground and rebuilt as a set of distinct, general purpose, components. Since refactoring 12,000 lines of code with such a variety of different functions is non-trivial and difficult to test, I began by factoring out different pieces individually, redesigning the interfaces and re-integrating them into Solaris on a piece-by-piece basis.Of all the functionality provided by the module, the easiest thing to separate was the IPMI logic. The Intelligent Platform Management Interface is a specification for communicating with service Pprocessors to discover and control available hardware. Sadly, it's anything but "intelligent". If you had asked me a year ago what I'd be doing at the beginning of this year, I'm pretty sure that reading the IPMI specification would have been at the bottom of my list (right below driving stakes through my eyeballs). Thankfully, the IPMI functionality needed was very small, and the best choice was a minimally functional private library, designed solely for the purpose of communicating with the Service Processor on supported Sun platforms. Existing libraries such as OpenIPMI were too complicated, and in their efforts to present a generic abstracted interface, didn't provide what we really needed. The design goals are different, and the ON-private IPMI library and OpenIPMI will continue to develop and serve different purposes in the future.Last week I finally integrated libipmi. In the process, I eliminated 2,000 lines of platform-specific code and created a common interface that can be leveraged by other FMA efforts and future projects. It is provided for both x86 and SPARC, even though there are currently no supported SPARC machines with an IPMI-capable service processor (this is being worked on). This library is private and evolving quite rapidly, so don't use it in any non-ON software unless you're prepared to keep up with a changing API.As part of this work, I also created a common fmd module, sp-monitor, that monitors the service processor, if present, and generates a new ESC_PLATFORM_RESET sysevent to notify consumers when the service processor is reset. The existing sfx4500-disk module then consumes this sysevent instead of monitoring the service processor directly.This is the first of many steps towards eliminating this module in its current form, as well as laying groundwork for future platform integration work. I'll post updates to this blog with information about generic disk monitoring, libtopo indicators, and generic hotplug management as I add this functionality. The eventual goal is to reduce the platform-specific portion of this module to a single .xml file delivered via libtopo that all these generic consumers will use to provide the same functionality that's present on the x4500 today. Only at this point can we start looking towards future applications, some of which I will describe in upcoming posts.

As I continue down the path of improving various aspects of ZFS and Solaris platform integration, I found myself in the thumper (x4500) fmd platform module. This module represents the latest...


DTrace sysevent provider

I've been heads down for a long time on a new project, but occasionally I do put something back to ON worth blogging about. Recently I've been working on some problems which leverage sysevents (libsysevent(3LIB)) as a common transport mechanism. While trying to understand exactly what sysevents were being generated from where, I found the lack of observability astounding. After poking around with DTrace, I found that tracking down the exact semantics was not exactly straightforward. First of all, we have two orthogonal sysevent mechanisms, the original syseventd legacy mechanism, and the more recent general purpose event channel (GPEC) mechanism, used by FMA. On top of this, the sysevent_impl_t structure isn't exactly straightforward, because all the data is packed together in a single block of memory. Knowing that this would be important for my upcoming work, I decided that adding a stable DTrace sysevent provider would be useful.The provider has a single probe, sysevent:::post, which fires whenever a sysevent post attempt is made. It doesn't necessarily indicate that the syevent was successfully queued or received. The probe has the following semantics:# dtrace -lvP sysevent ID PROVIDER MODULE FUNCTION NAME44528 sysevent genunix queue_sysevent post Probe Description Attributes Identifier Names: Private Data Semantics: Private Dependency Class: Unknown Argument Attributes Identifier Names: Evolving Data Semantics: Evolving Dependency Class: ISA Argument Types args[0]: syseventchaninfo_t \* args[1]: syseventinfo_t \*The 'syseventchaninfo_t' translator has a single member, 'ec_name',which is the name of the event channel. If this is being posted via the legacy sysevent mechanism, then this member will be NULL. The 'syeventinfo_t' translator has three members, 'se_publisher', 'se_class', and 'se_subclass'. These mirror the arguments to sysevent_post(). The following script will dump all sysevents posted to syseventd(1M):#!/usr/sbin/dtrace -s#pragma D option quietBEGIN{printf("%-30s %-20s %s\\n", "PUBLISHER", "CLASS", "SUBCLASS");}sysevent:::post/args[0]->ec_name == NULL/{printf("%-30s %-20s %s\\n", args[1]->se_publisher, args[1]->se_class, args[1]->se_subclass);}And the output during a cfgadm -c unconfigure:PUBLISHER CLASS SUBCLASS SUNW:usr:devfsadmd:100237 EC_dev_remove diskSUNW:usr:devfsadmd:100237 EC_dev_branch ESC_dev_branch_removeSUNW:kern:ddi EC_devfs ESC_devfs_devi_removeThis has already proven quite useful in my ongoing work, and hopefully some other developers out there will also find it useful.

I've been heads down for a long time on a new project, but occasionally I do put something back to ON worth blogging about. Recently I've been working on some problems which leverage sysevents (libsyse...


New ZFS Features

I've been meaning to get around to blogging about these features that Iputback a while ago, but have been caught up in a few too many things.In any case, the following new ZFS features were putback to build 48 ofNevada, and should be availble in the next Solaris ExpressCreate Time PropertiesAn old RFE has been to provide a way to specify properties at createtime. For users, this simplifies admnistration by reducing the numberof commands which need to be run. It also allows some race conditionsto be eliminated. For example, if you want to create a new dataset witha mountpoint of 'none', you first have to create it and the underlyinginherited mountpoint, only to remove it later by invoking 'zfs setmountpoint=none'.From an implementation perspective, this allows us to unify ourimplementation of the 'volsize' and 'volblocksize' properties, and pavethe way for future create-time only properties. Instead of having aseparate ioctl() to create a volume and passing in the two sizeparameters, we simply pass them down as create-time options.The end result is pretty straightforward: # zfs create -o compression=on tank/home # zfs create -o mountpoint=/export -o atime=off tank/export'canmount' propertyThe 'canmount' property allows you create a ZFS dataset that servessolely as a mechanism for inheriting properties. When we first created thehierarchical dataset model, we had the notion of 'containers' -filesystems with no associated data. Only these datasets could containother datasets, and you had to make the decision at create-time.This turned out to be a bad idea for a number of reasons. Itcomplicated the CLI, forced the user to make a create-time decision thatcould not be changed, and led to confusion when files were accidentallycreated on the underlying filesystem. So we made every filesystem ableto have child filesystems, and all seemed well.However, there is power in having a dataset that exists in the hierarchybut has no associated filesystem data (or effectively none by preventingfrom being mounted). One can do this today by setting the 'mountpoint'property to 'none'. However, this property is inherited by childdatasets, and the administrator cannot leverage the power of inheritedmountpoints. In particular, some users have expressed desire to havetwo sets of directories, belonging to different ZFS parents (or even toUFS filesystems), share the same inherited directory. With the new'canmount' property, this becomes trivial: # zfs create -o mountpoint=/export -o canmount=off tank/accounting # zfs create -o mountpoint=/export -o canmount=off tank/engineering # zfs create tank/accounting/bob # zfs create tank/engineering/anneNow, both anne and bob have directories at '/export/', except thatthey are inheriting ZFS properties from different datasets in thehierarchy. The adminsitrator may decide to turn compression on for onegroup of people or another, or set a quota to limit the amount of spaceconsumed by the group. Or simply have a way to view the total amount ofspace consumed by each group without resorting to scripted du(1).User Defined PropertiesThe last major RFE in this wad added the ability to set arbitraryproperties on ZFS datasets. This provides a way for administrators toannotate their own filesystems, as well as ISVs to layer intelligentsoftware without having to modify the ZFS code to introduce a newproperty.A user-defined property name is one which contains a colon (:). Thisprovides a unique namespace which is guaranteed to not overlap withnative ZFS properties. The emphasis is to use the colon to separate amodule and property name, where 'module' should be a reverse DNS name.For example, a theoretical Sun backup product might do: # zfs set com.sun.sunbackup:frequency=1hr tank/homeThe property value is an arbitrary string, and no additional validationis done on it. These values are always inherited. A local adminstratormight do: # zfs set localhost:backedup=9/19/06 tank/home # zfs list -o name,localhost:backedup NAME LOCALHOST:BACKEDUP tank - tank/home 9/19/06 tank/ws 9/10/06The hope is that this will serve as a basis for some innovative productsand home grown solutions which interact with ZFS datasets in awell-defined manner.

I've been meaning to get around to blogging about these features that I putback a while ago, but have been caught up in a few too many things.In any case, the following new ZFS features were putback...


ZFS on FreeBSD

More exciting news on the ZFS OpenSolaris front. In addition to the existing ZFS on FUSE/Linux work, we now have a second active port of ZFS, this time for FreeBSD. Pawel Dawidek has been hard at work, and has made astounding progress after just 10 days (!). This is both a testament to his ability as well as the portability of ZFS. As with any port, the hard part comes down to integrating the VFS layer, but Pawel has already made good progress there. The current prototype can already mount fielsystems, create files, and list directory contents. Of course, our code isn't completely without portability headaches, but thanks to Pawel (and Ricardo on FUSE/Linux), we can take patches and implement the changes upstream to ease future maintenance. You can find the FreeBSD repository Here. If you're a FreeBSD developer or user, please give Pawel whatever support you can, whether it's code contributions, testing, or just plain old compliments. We'll be helping out where we can on the OpenSolaris side.In related news, Ricard Correia has made significant progress on the FUSE/Linux port. All the management functionality of zfs(1M) and zpool(1M) is there, and he's working on mounting ZFS filesystems. All in all, it's an exciting time, and we're all crossing our fingers that ZFS will follow in the footsteps of its older brother DTrace.

More exciting news on the ZFS OpenSolaris front. In addition to the existing ZFS on FUSE/Linux work, we now have a second active port of ZFS, this time for FreeBSD. Pawel Dawidek has been hard at...


ztest on Linux

As Jeff mentioned previously, Ricardo Correia has been working on porting ZFS to FUSE/Linux as part of Google SoC. Last week, Ricardo got libzpool and ztest running on Linux, which is a major first step of the project.The interesting part is the set of changes that he had to make in order to get it working. libzpool was designed to be run from userland and the kernel from the start, so we've already done most of the work of separating out the OS-dependent interfaces. The most prolific changes were to satisfy GCC warnings. We do compile ON with gcc, but not using the default options. I've since updated the ZFS porting page with info about gcc use in ON, which should make future ports easier. The second most common change was header files that are available in both userland and kernel on Solaris, but nevertheless should be placed in zfs_context.h, concentrating platform-specific knowledge in this one file. Finally, there were some simple changes we could make (such as using pthread_create() instead of thr_create()) to make ports of the tools easier. It would also be helpful to have ports of libnvpair and libavl, much like some have done for libumem, so that developers don't have to continually port the same libraries over and over.The next step (getting zfs(1M) and zpool(1M) working) is going to require significantly more changes to our source code. Unlike libzpool, these tools (libzfs in particular) were not designed to be portable. They include a number of Solaris specific interfaces (such as zones and NFS shares) that will be totally different on other platforms. I look forward to seeing Ricardo's progress to know how this will work out.

As Jeff mentioned previously, Ricardo Correia has been working on porting ZFS to FUSE/Linux as part of Google SoC. Last week, Ricardo got libzpool and ztest running on Linux, which is a major first...


ZFS Hot Spares

It's been a long time since the last time I wrote a blog entry. I've been working heads-down on a new project and haven't had the time to keep up my regular blogging. Hopefully I'll be able to keep something going from now on.Last week the ZFS team put the following back to ON:PSARC 2006/223 ZFS Hot SparesPSARC 2006/303 ZFS Clone Promotion6276916 support for "clone swap"6288488 du reports misleading size on RAID-Z6393490 libzfs should be a real library6397148 fbufs debug code should be removed from buf_hash_insert()6405966 Hot Spare support in ZFS6409302 passing a non-root vdev via zpool_create() panics system6415739 assertion failed: !(zio->io_flags & 0x00040)6416759 ::dbufs does not find bonus buffers anymore6417978 double parity RAID-Z a.k.a. RAID66424554 full block re-writes need not read data in6425111 detaching an offline device can result in import confusionThere are a couple of cool features mixed in here. Most importantly, hot spares, clone swap, and double-parity RAID-Z. I'll focus this entry on hot spares, since I wrote the code for that feature. If you want to see the original ARC case and some of the discussion behind the feature, you should check out the original zfs-discuss thread.The following features make up hot spare support:Associating hot spares with poolsHot spares can be specified when creating a pool or adding devices by using the spare vdev type. For example, you could create a mirrored pool with a single hot spare by doing:# zpool create test mirror c0t0d0 c0t1d0 spare c0t2d0# zpool status test pool: test state: ONLINE scrub: none requestedconfig: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 spares c0t2d0 AVAIL errors: No known data errorsNotice that there is one spare, and it currently available for use. Spares can be shared between multiple pools, allowing for a single set of global spares on systems with multiple spares.Replacing a device with a hot spareThere is now an FMA agent, zfs-retire, which subscribes to vdev failure faults and automatically initiates replacements if there are any hot spares available. But if you want to play around with this yourself (without forcibly faulting drives), you can just use 'zpool replace'. For example:# zpool offline test c0t0d0Bringing device c0t0d0 offline# zpool replace test c0t0d0 c0t2d0# zpool status test pool: test state: DEGRADEDstatus: One or more devices has been taken offline by the adminstrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scrub: resilver completed with 0 errors on Tue Jun 6 08:48:41 2006config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 c0t0d0 OFFLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 spares c0t2d0 INUSE currently in useerrors: No known data errorsNote that the offline is optional, but it helps visualize what the pool would look like should and actual device fail. Note that even though the resilver is completed, the 'spare' vdev stays in-place (unlike a 'replacing' vdev). This is because the replacement is only temporary. Once the original device is replaced, then the spare will be returned to the pool.Relieving a hot spareA hot spare can be returned to its previous state by replacing the original faulted drive. For example:# zpool replace test c0t0d0 c0t3d0# zpool status test pool: test state: DEGRADED scrub: resilver completed with 0 errors on Tue Jun 6 08:51:49 2006config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c0t0d0 OFFLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 spares c0t2d0 INUSE currently in useerrors: No known data errors# zpool status test pool: test state: ONLINE scrub: resilver completed with 0 errors on Tue Jun 6 08:51:49 2006config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 spares c0t2d0 AVAIL errors: No known data errorsThe drive is actively being replaced for a short period of time. Once the replacement is completed, the old device is removed, and the hot spare is returned to the list of available spares. If you want a hot spare replacement to become permanent, you can zpool detach the original device, at which point the spare will be removed from the hot spare list of any active pools. You can also zpool detach the spare itself to cancel the hot spare operation.Removing a spare from a poolTo remove a hot spare from a pool, simply use the zpool remove command. For example:# zpool remove test c0t2d0# zpool status pool: test state: ONLINE scrub: resilver completed with 0 errors on Tue Jun 6 08:51:49 2006config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0errors: No known data errorsUnfortunately, we don't yet support removing anything other than hot spares (it's on our list, we swear). But you can see how hot spares naturally fit into the existing ZFS scheme. Keep in mind that to use hot spares, you will need to upgrade your pools (via 'zpool upgrade') to version 3 or later.Next StepsDespite the obvious usefulness of this feature, there is one more step that needs to be done for it to be truly useful. This involves phase two of the ZFS/FMA integration. Currently, a drive is only considered faulted if it 'goes away' completely (i.e. ldi_open() fails). This covers only subset of known drive failure modes. It's possible for a drive to continually return errors, and yet be openable. The next phase of ZFS and FMA will introduce a more intelligent diagnosis engine to watch I/O and checksum errors as well as the SMART predictive failure bit in order to proactively offline devices when they are experiencing an abnormal amount of errors, or appear like they are going to fail. With this functionality, ZFS will be able to better respond to failing drives, thereby making hot spare replacement much more valuable.

It's been a long time since the last time I wrote a blog entry. I've been working heads-down on a new project and haven't had the time to keep up my regular blogging. Hopefully I'll be able to...


Nobel day 5

Things have picked up over here in Stockholm. The rest of the troops have arrived, filling up the Grand Hotel to the best of our ability. My father's been running around to press conferences, interviews, and various preparations. The rest of us only have one or two events each day to keep track of. Thankfully, my father has a personal attendant to make sure he doesn't miss a thing, not that my mom would ever allow such a thing to happen.On the 6th, there was a small reception at the Nobel museum. They used the opportunity to show an "instructional" video to make sure we all knew how to behave during the ceremony and the reception. My father even signed one of the cafeteria chairs, apparently a tradition at the museum:It was a real treat to arrive in a limo with dozens of cheering children peering through the windows and waiting to see who got out. My father has even met his adoring fans outside the hotel:The whole experience is rather surreal. The next day we had a reception for all the science (physics, chemistry, and economics) laureates and their families at the Royal Academy of Science. Besides having the chance to get our whole family together in once place, it was interesting to talk with other families in the same boat. There were a few amusing moments when we were standing with Bob Grubbs and his children, who are all above 6'3" tall (including his daughter Katy), as my father, brother, and myself are all 6'3" or taller. The 7 of us are able to make almost anyone feel incredibly short.Today was the day of lectures. Besides being hung over from some ill-advised late night wine via room service, these lectures were definitely intended for colleagues, not family members. I'd like to say that I understood my father's lecture, but when you can't pronounce half the words coming out of his mouth, it makes it rather difficult to keep up. At least there were some molecular diagrams that I could pretend to understand, though even those were quite a bit more complex than the ones I learned in high school chemistry.Of course, Stockholm is a beautiful city. Lots of time is spent walking around the streets and poking our heads in the little shops. Tomorrow we'll try to hit up a few more museums in the little spare time we have.

Things have picked up over here in Stockholm. The rest of the troops have arrived, filling up the Grand Hotel to the best of our ability. My father's been running around to press...


Nobel day 2

I'm off on vacation for a while while I attend the Nobel Festivities. While this is my second trip to Stockholm, it will be rather different as a Nobel guest staying in a 5-star hotel. Last time I was here I was a poor recently graduated college student at the end of his backpacking trip. While pizza and kebabs were nice, I think I'll get a better taste of Swedish food this time around.I came rather early with my parents, thanks to a convenient business trip to the east coast. So while today is my second day in Stockholm, nothing really starts until tomorrow (when there is a tour of the Nobel museum). Eventually, we'll end up with 28 family and friends here when all is said and done. But arriving with my parents did have its advantages. We got to hang out in a private room in the SAS business lounge during our layover in Newark. When we arrived in Stockholm, we were whisked away from the gate via limousine to a private VIP lounge with a respresentative of the Royal Academy of Sciences and my father's personal handler. We handed over our passports and baggage tags, and half an hour later we were climbing back into the limo pre-packed with all our bags.I haven't done much as of yet, but I did walk around Skansen with my mother, aunt and uncle. This is probably the last time I'll have a stretch of free time with my parents, as starting tomorrow my father is busy with receptions, meetings, rehearsals, and interviews. I'll leave you with a picture from the top of Skansen, taken at 2:30 PM - we don't get much light here this time of year:I'm sure I'll have more interesting things to report in the coming days. Be sure to check out videos from last year to get an idea of what I'm in for.

I'm off on vacation for a while while I attend the Nobel Festivities. While this is my second trip to Stockholm, it will be rather different as a Nobel guest staying in a 5-star hotel. Last time I was...


Pool discovery and 'zpool import'

In the later months of ZFS development, the mechanism used to open and import pools was drastically changed. The reasons behind this change make an interesting case study in complexity management and how a few careful observations can make all the difference.The original spa_open()Skipping past some of the very early prototypes, we'll start with the scheme that Jeff implemented back in 2003. The scheme had a lofty goal:Be able to open a pool from an arbitrary subset of devices within the pool.Of course, given limited label space, it was not possible to guarantee this entirely, but we created a scheme which would work in all but the most pessimistic scenarios. Basically, each device was part of a strongly connected graph made up of each toplevel vdev. The graph began with a circle of all the toplevel vdevs. Each device had it's parent vdev config, the vdev configs of the nearest neighbors, and up to 5 other vdev configs for good measure. When we went to open the pool, we read in this graph and constructed the complete vdev tree which we then used to open the pool.You can see the original 'big theory' comment from the top of vdev_graph.c here.First signs of problemsThe simplest problem to notice was that these vdev configs were stored in a special one-off ASCII representation that required a special parser in the kernel. In a world where we were rapidly transitioning to nvlists, this wasn't really a good idea.The first signs of real problems came when I tried to implement the import functionality. For import, we needed the ability to know if a pool could be imported without actually opening it. This had existed in a rather disgusting form previously by linking to libzpool (which is the userland port of the SPA and DMU code), and hijacking the versions of these functions to construct a vdev tree in userland. In the new userland model, it wasn't acceptable to use libzpool in this manner. So, I had to construct spa_open_by_dev() that parsed the configuration into a vdev tree and tried to open the vdevs, but not actually load the pool metadata.This required a fair amount of hackery, but it was nowhere near as bad as when I had to make the resulting system tolerant of all faults. For both 'zpool import' and 'zpool status', it wasn't enough just to know that a pool couldn't be opened. We needed to know why, and exactly which devices were at fault. While the original scheme worked well when a single device was missing in a toplevel vdev, it failed miserably when an entire toplevel vdev was missing. In particular, it relied on the fact that it had at least a complete ring of toplevel vdevs to work with. For example, you were missing a single device in an unmirrored pool, there was no way to know what device was missing because the entire graph parsing algorithm would break down. So, I went in and hacked on the code to understand multiple "versions" of a vdev config. If we had two neighbors referring to a missing toplevel vdev, we could surmise its config even though all its devices were absent.At this point, things were already getting brittle. The code was enormously complex, and hard to change in any well defined way. On top of that, things got even worse when our test guys started getting really creative. In particular, if you disconnected a device and then exported a pool, or created a new pool over parts of an exported pool, things would get ugly fast. The labels on the disks would be technically valid, but semantically invalid. I had to make even more changes to the algorithms to accomodate all these edge cases. Eventually, it got to the point where every change I made was prefixed by /\* XXX kludge \*/. Jeff and I decided something needed to be done.On top of all this, the system still had a single point of failure. Since we didn't want to scan every attached device on boot, we kept an /etc/system file around that described the first device in the pool as a 'hint' for where to get started. If this file was corrupt, or that particular toplevel vdev was not available, the pool configuration could not be determined.vdev graph 2.0At this point we had a complex, brittle, and functionally insufficient system for discovering pools. As Jeff, Bill, and I talked about this problem for a while, we made two key observations:The kernel didn't have parse the graph. We already had the case where we were dependent on a cached file for opening our pool, so why not keep the whole configuration there? The kernel can do (relatively) atomic updates to this file on configuration changes, and then open the resulting vdev tree without having to construct it based on on-disk data.During import, we already need to check all devices in the system. We don't have to worry about 'discovering' other toplevel vdevs, because we know that we will, by definition, look at the device during the discovery phase.With these two observations under our belt, we knew what we had to do. For both open and import, the SPA would know only how to take a fully constructed config and parse it into a working vdev tree. Whether that config came from the new /etc/zfs/zpool.cache file, or whether it was constructed during a 'zpool import' scan, it didn't matter. The code would be exactly the same. And best of all, no complicated discovery phase - the config was taken at face value1. And the config was simply an nvlist. No more special parsing of custom 'vdev spec' strings.So how does import work?The config cache was all well and good, but how would import work? And how would it fare in the face of total device failure? You can find all of the logic for 'zpool import' in libzfs_import.c.Each device keeps the complete config for its toplevel vdev. No neighbors, no graph, no nothing. During pool discovery, we keep track of all toplevel vdevs for a given pool that we find during a scan. Using some simple heuristics, we construct the 'best' version of the pool's config, and even go through and update the path and devid information based on the unique GUID for each device. Once that's done, we have the same nvlist we would have as if we had read it from zpool.cache. The kernel happily goes off and tries to open the vdevs (if this is just a scan) or open it for real (if we're doing the import).So what about missing toplevel vdevs? In the online (cached) state, we'll have the complete config and tell you which device is missing. For the import case, we'll be a little worse off because we'll never see any vdevs indicating that there is another toplevel vdev in the pool. The most important thing is that we're able to detect this case and error out properly. To do this, we have a 'vdev guid sum' stored in the uberblock that indicates the sum of all the vdev guids for every device in the config. If this doesn't match, we know that we have missed a toplevel vdev somewhere. Unfortunately, we can't tell you what device it is. In the future, we hope to improve this by adding the concept of 'neighbor lists' - arbitrary lists of devices without entire vdev configs. Unlike the previous incarnation of vdev graph, these will be purely suggestions, and not actually be required for correctness. There will, of course, be cases where we can never provide you enough information about all your neighbors, such as plugging in a single disk from a thousand disk unreplicated pool.ConclusionsSo what did we learn from this? The main thing is that phrasing the problem slightly differently can cause you to overdesign a system beyond the point of maintainability. As Bryan is fond of saying, one of the most difficult parts of solving hard problems is not just working within a highly constrained environment, but identifying which constraints aren't needed at all. By realizing that opening a pool was a very different operation than discovering a pool, we were able to redesign our interfaces into a much more robust and maintainable state, with more features to boot.It's no surprise that there were more than a few putbacks like "SPA 2.0", "DMU 2.0", or "ZIL 3.0". Sometimes you just need to take a hammer to the foundation to incorporate all that you've learned from years of using the previous version.1 Actually, Jeff made this a little more robust by adding the config as part of the MOS (Meta objset), which is stored transactionally with the rest of the pool data. So even if we added two devices, but the /etc cache file didn't get updated correctly, we'll still be able to open the MOS config and realize that there are two new devices to be had.

In the later months of ZFS development, the mechanism used to open and import pools was drastically changed. The reasons behind this change make an interesting case study in complexity management and...



In this post I'll describe the interactions between ZFS and FMA (Fault Management Architecture). I'll cover the support that's present today, as well as what we're working on and where we're headed.ZFS Today (phase zero)The FMA support in ZFS today is what we like to call "phase zero". It's basically the minimal amount of integration needed in order to leverage (arguably) the most useful feature in FMA: knowledge articles. One of the key FMA concepts is to present faults in a readable, consistent manner across all subsystems. Error messages are human readable, contain precise descriptions of what's wrong and how to fix it, and point the user to a website that can be updated more frequently with the latest information.ZFS makes use of this strategy in the 'zpool status' command:$ zpool status pool: tank state: ONLINEstatus: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.action: Determine if the device needs to be replaced, and clear the errors using 'zpool online' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requestedconfig: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c1d0s0 ONLINE 0 0 3 c0d0s0 ONLINE 0 0 0In this example, one of our disks has experienced multiple checksum errors (because I dd(1)ed over most of the disk), but it was automatically corrected thanks to the mirrored configuration. The error message described exactly what has happened (we tried to self-heal the data from the other side of the mirror) and the impact (none - applications are unaffected). It also directs the user to the appropriate repair procedure, which is either to clear the errors (if they are not indicative of hardware fault) or replace the device.It's worth noting here the ambiguity of the fault. We don't actually know if the errors are due to bad hardware, transient phenomena, or administrator error. More on this later.ZFS tomorrow (phase one)If we look at the implementation of 'zpool status', we see that there is no actual interaction with the FMA subsystem apart from the link to the knowledge article. There are no generated events or faults, and no diagnosis engine or agent to subscribe to the events. The implementation is entirely static, and contained within libzfs_status.c.This is obviously not an ideal solution. It doesn't give the administrator any notification of errors as they occur, nor does it allow them to leverage other pieces of the FMA framework (such as upcoming SNMP trap support). This is going to be addressed in the near term by the first "real" phase of FMA integration. The goal of this phase, in addition to a number of other fault capabilities under the hood, is the following:Event Generation - I/O errors and vdev transitions will result in true FMA ereports. These will be generated by the SPA and fed through the FMA framework for further analysis.Simple Diagnosis Engine - An extremely dumb diagnosis engine will be provided to consume these ereports. It will not perform any sort of predictive analysis, but will be able to keep track of whether these errors have been seen before, and pass them off to the appropriate agent.Syslog agent - The results from the DE will be fed to an agent that simply forwards the faults to syslog for the administrator's benefit. This will give the same messages as seen in 'zpool status' (with slightly less information) synchronous with a fault event. Future work to generate SNMP traps will allow the administrator to be email, paged, or implement a poor man's hot spare system.ZFS in the future (phase next)Where we go from here is rather an open road. The careful observer will notice that ZFS never makes any determination that a device is faulted due to errors. If the device fails to reopen after an I/O error, we will mark it as faulted, but this only catches cases where a device has gotten completely out of whack. Even if your device is experiencing a hundred uncorrectable I/O errors per second, ZFS will continue along its merry way, notifying the administrator but otherwise doing nothing. This is not because we don't want to take the device offline; it's just that getting the behavior right is hard.What we'd like to see is some kind of predictive analysis of the error rates, in an attempt to determine if a device is truly damaged, or whether it was just a random event. The diagnosis engines provided by FMA are designed for exactly this, though the hard part is determining the algorithms for making this distinction. ZFS is both a consumer of FMA faults (in order to take proactive action) as well as a producer of ereports (detecting checksum errors). To be done right, we need to harden all of our I/O drivers to generate proper FMA ereports, implement a generic I/O retire mechanism, and link it in with all the additional data from ZFS. We also want to gather SMART data from the drives to notice correctible errors fixed by the firmware, as well doing experiments to determine the failure rates and pathologies of common storage drives.As you can tell, this is not easy. I/O fault diagnosis is much more complicated than CPU and memory diagnosis. There are simply more components, as well as more changes for administrative error. But between FMA and ZFS, we have laid the groundwork for a truly fault-tolerant system capable of predictive self healing and automated recovery. Once we get past the initial phase above, we'll start to think about this in more detail, and make our ideas public as well.

In this post I'll describe the interactions between ZFS and FMA (Fault Management Architecture). I'll cover the support that's present today, as well as what we're working on and where we're headed. ZFS...


UFS/SVM vs. ZFS: Code Complexity

A lot of comparisons have been done, and will continue to be done, between ZFS and other filesystems. People tend to focus on performance, features, and CLI tools as they are easier to compare. I thought I'd take a moment to look at differences in the code complexity between UFS and ZFS. It is well known within the kernel group that UFS is about as brittle as code can get. 20 years of ongoing development, with feature after feature being bolted on tends to result in a rather complicated system. Even the smallest changes can have wide ranging effects, resulting in a huge amount of testing and inevitable panics and escalations. And while SVM is considerably newer, it is a huge beast with its own set of problems. Since ZFS is both a volume manager and a filesystem, we can use this script written by Jeff to count the lines of source code in each component. Not a true measure of complexity, but a reasonable approximation to be sure. Running it on the latest version of the gate yields:------------------------------------------------- UFS: kernel= 46806 user= 40147 total= 86953 SVM: kernel= 75917 user=161984 total=237901TOTAL: kernel=122723 user=202131 total=324854------------------------------------------------- ZFS: kernel= 50239 user= 21073 total= 71312-------------------------------------------------The numbers are rather astounding. Having written most of the ZFS CLI, I found the most horrifying number to be the 162,000 lines of userland code to support SVM. This is more than twice the size of all the ZFS code (kernel and user) put together! And in the end, ZFS is about 1/5th the size of UFS and SVM. I wonder what those ZFS numbers will look like in 20 years...

A lot of comparisons have been done, and will continue to be done, between ZFS and other filesystems. People tend to focus on performance, features, and CLI tools as they are easier to compare. I...


Principles of the ZFS CLI

Well, I'm back. I've been holding off blogging for a while due to ZFS. Now that it's been released, I've got tons of stuff lined up to talk about in the coming weeks.I first started working on ZFS about nine months ago, and my primary task from the beginning was to redesign the CLI for managing storage pools and filesystems. The existing CLI at the time had evolved rather organically over the previous 3 years, each successive feature providing some immediate benefit but lacking in long-term or overarching design goals. To be fair, it was entirely capable in the job it was intended to do - it just needed to be rethought in the larger scheme of things.I have some plans for detailed blog posts about some of the features, but I thought I'd make the first post a little more general and describe how I approached the CLI design, and some of the major principles behind it.Simple but powerfulOne of the hardest parts of designing an effective CLI is to make it simple enough for new users to understand, but powerful enough so that veterans can tweak everything they need to. With that in mind, we adopted a common design philosophy:"Simple enough for 90% of the users to understand, powerful enough for the other 10% to useA good example of this philosophy is the 'zfs list' command. I plan to delve into some of the history behind its development at a later point, but you can quickly see the difference between the two audiences. Most users will just use 'zfs list':$ zfs listNAME USED AVAIL REFER MOUNTPOINTtank 55.5K 73.9G 9.5K /tanktank/bar 8K 73.9G 8K /tank/bartank/foo 8K 73.9G 8K /tank/fooBut a closer look at the usage reveals a lot more power under the hood: list [-rH] [-o property[,property]...] [-t type[,type]...] [filesystem|volume|snapshot] ...In particular, you can ask questions like 'what is the amount of space used by all snapshots under tank/home?' We made sure that sufficient options existed so that power users could script whatever custom tools they wanted.Solution driven error messagesHaving good error messages is a requirement for any reasonably complicated system. The Solaris Fault Management Architecture has proved that users understand and appreciate error messages that tell you exactly what is wrong in plain english, along with how it can be fixed.A great example of this is through the 'zpool status' output. Once again, I'll go into some more detail about the FMA integration in a future post, but you can quickly see how basic FMA integration really allows the user to get meaningful diagnositics on their pool:$ zpool status pool: tank state: ONLINEstatus: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.action: Determine if the device needs to be replaced, and clear the errors using 'zpool online' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requestedconfig: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c1d0s0 ONLINE 0 0 3 c0d0s0 ONLINE 0 0 0Consistent command syntaxWhen it comes to command line syntax, everyone seems to have a different idea of what makes the most sense. When we started redesigning the CLI, we took a look at a bunch of other tools in solaris, focusing on some of the more recent ones which had undergone a more rigorous design. In the end, our primary source of inspiration were the SMF (Server Management Facility) commands. To that end, every zfs(1M) and zpool(1M) command has the following syntax:<command> <verb> <options> <noun> ...There are no "required options". We tried to avoid positional parameters at all costs, but there are certain subcommands (zfs get, zfs get, zfs clone, zpool replace, etc) that fundamentally require multiple operands. In these cases, we try to direct the user with informative error messages indicating that they may have forgotten a parameter:# zpool create c1d0 c0d0cannot create 'c1d0': pool name is reservedpool name may have been omittedIf you mistype something and find that the error message is confusing, please let us know - we take error messages very seriously. We've already had some feedback for certain commands (such as 'zfs clone') that we're working on.Modular interface designOn a source level, the initial code had some serious issues around interface boundaries. The problem is that the user/kernel interface is managed through ioctl(2) calls to /dev/zfs. While this is a perfectly fine solution, we wound up with multiple consumers all issuing these ioctl() calls directly, making it very difficult to evolve this interface cleanly. Since we knew that we were going to have multiple userland consumers (zpool and zfs), it made much more sense to construct a library (libzfs) which was responsible for managing this direct interface, and have it present a unified object-based access method for consumers. This allowed us to centralize logic in one place, and the command themselves became little more than glorified argument parsers around this library.

Well, I'm back. I've been holding off blogging for a while due to ZFS. Now that it's been released, I've got tons of stuff lined up to talk about in the coming weeks. I first started working on ZFS...


Where have I been?

It's been almost a month since my last blog post, so I thought I'd post an update. I spent the month of July in Massachusetts, alternately on vacation, working remotely, and attending my brother's wedding. The rest of the LAE (Linux Application Environment) team joined me (and Nils) for a week out there, and we made some huge progress on the project. For the curious, we're working on how best to leverage OpenSolaris to help the project and the community, at which point we can go into more details about what the final product will look like. Until then, suffice to say "we're working on it". All this time on LAE did prevent me from spending time with my other girlfriend, ZFS. Since getting back, I've caught up with most of the ZFS work in my queue, and the team has made huge progress on ZFS in my absence. As much as I'd like to talk about details (or a schedule), I can't :-( But trust me, you'll know when ZFS integrates into Nevada; there are many bloggers who will not be so quiet when that putback notice comes by. Not to mention that the source code will hit OpenSolaris shortly thereafter.Tomorrow I'll be up at LinuxWorld, hanging out at the booth with Ben and hosting the OpenSolaris BOF along with Adam and Bryan (Dan will be there as well, though he didn't make the "official" billing). Whether you know nothing about OpenSolaris or are one of our dedicated community members, come check it out.

It's been almost a month since my last blog post, so I thought I'd post an update. I spent the month of July in Massachusetts, alternately on vacation, working remotely, and attending my brother's...


Operating system tunables

There's an interesting discussion over at opensolaris-code, spawned from an initial request to add some tunables to Solaris /proc. This exposes a few very important philosophical differences between Solaris and other operating systems out there. I encourage you to read the thread in its entirety, but here's an executive summary:When possible, the system should be auto-tuning - If you are creating a tunable to control fine grained behavior of your program or operating system, you should first ask yourself: "Why does this tunable exist? Why can't I just pick the best value?" More often than not, you'll find the answer is "Because I'm lazy" or "The problem is too hard." Only in rare circumstances is there ever a definite need for a tunable, and almost always control coarse on-off behavior.If a tunable is necessary, it should be as specific as possible - The days of dumping every tunable under the sun into /etc/system are over. Very rarely do tunables need to be system wide. Most tunables should be per process, per connection, or per filesystem. We are continually converting our old system-wide tunables into per-object controls.Tunables should be controlled by a well defined interface - /etc/system and /proc are not your personal landfills. /etc/system is by nature undocumented, and designing it as your primary interface is fundamentally wrong. While /proc is well documented, but it's also well defined to be a process filesystem. Besides the enormous breakage you'd introduce by adding /proc/tunables, its philosophically wrong. The /system directory is a slightly better choice, but it's intended primarily for observability of subsystems that translate well to a hierarchical layout. In general, we don't view filesystems as a primary administrative interface, but a programmatic API upon which more sophisticated tools can be built.One of the best examples of these principles can been seen in the updated System V IPC tunables. Dave Powell rewrote this arcane set of /etc/system tunables during the course of Solaris 10. Many of the tunables were made auto-tuning, and those that couldn't be were converted into resource controls administered on a per process basis using standard Solaris administrative tools. Hopefully Dave will blog at some point about this process, the decisions he made, and why.There are, of course, always going to be exceptions to the above rules. We still have far too many documented /etc/system tunables in Solaris today, and there will always be some that are absolutely necessary. But our philosophy is focused around these principles, as illustrated by the following story from the discussion thread:Indeed, one of the more amusing stories was a Platinum Beta customershowing us some slideware from a certain company comparing their OSagainst Solaris. The slides were discussing available tunables, and thebasic gist was something like:"We used to have way fewer tunables than Solaris, but now we've caughtup and have many more than they do. Our OS rules!"Needless to say, we thought they company was missing the point.Tags: OpenSolaris

There's an interesting discussion over at opensolaris-code, spawned from an initial request to add some tunables to Solaris /proc. This exposes a few very important philosophical differences between...


A parting MDB challenge

Like most of Sun's US employees, I'll be taking the next week off for vacation. On top of that, I'll be back in my hometown in MA for the next few weeks, alternately working remotely and attending my brother's wedding. I'll leave you with an MDB challenge, this time much more involved than past "puzzles". I don't have any prizes lying around, but this one would certainly be worth one if I had anything to give.So what's the task? To implement munges as a dcmd. Here's the complete description:Implement a new dcmd, ::stacklist, that will walk all threads (or all threads within a specific process when given a proc_t address) and summarize the different stacks by frequency. By default, it should display output identical to 'munges':> ::stacklist73 ################################## tp: fffffe800000bc80 swtch+0xdf() cv_wait+0x6a() taskq_thread+0x1ef() thread_start+8()38 ################################## tp: ffffffff82b21880 swtch+0xdf() cv_wait_sig_swap_core+0x177() cv_wait_sig_swap+0xb() cv_waituntil_sig+0xd7() lwp_park+0x1b1() syslwp_park+0x4e() sys_syscall32+0x1ff()...The first number is the frequency of the given stack, and the 'tp' pointer should be a representative thread of the group. The stacks should be organized by frequency, with the most frequent ones first. When given the '-v' option, the dcmd should print out all threads containing the given stack trace. For extra credit, the ability to walk all threads with a matching stack (addr::walk samestack) would be nice.This is not an easy dcmd to write, at least when doing it correctly. The first key is to use as little memory as possible. This dcmd must be capable of being run within kmdb(1M), where we have limited memory available. The second key is to leverage existing MDB functionality without duplicating code. You should not be copying code from ::findstack or ::stack into your dcmd. Ideally, you should be able to invoke ::findstack without worry about its inner workings. Alternatively, restructuring the code to share a common routine would also be acceptable.This command would be hugely beneficial when examining system hangs or other "soft failures," where there is no obvious culprit (such as a panicking thread). Having this functionality in KMDB (where we cannot invoke 'munges') would make debugging a whole class of problems much easier. This is also a great RFE to get started with OpenSolaris. It is self contained, low risk, but non-trivial, and gets you familiar with MDB at the same time. Personally, I have always found the observability tools a great place to start working on Solaris, because the risk is low while still requiring (hence learning) internal knowledge of the kernel.If you do manage to write this dcmd, please email me (Eric dot Schrock at sun dot com) and I will gladly be your sponsor to get it integrated into OpenSolaris. I might even be able to dig up a prize somewhere...

Like most of Sun's US employees, I'll be taking the next week off for vacation. On top of that, I'll be back in my hometown in MA for the next few weeks, alternately working remotely and attending my...


Virtualization and OpenSolaris

There's actually a decent piece over at eWeek discussing the future of Xen and LAE (the project formerly known as Janus) on OpenSolaris. Now that our marketing folks are getting the right message out there about what we're trying to accomplish, I thought I'd follow up with a little technical background on virtualization and why we're investing in these different technologies. Keep in mind that these are my personal beliefs based on interactions with customers and other Solaris engineers. Any resemblance to a corporate strategy is purely coincidental ;-)Before diving in, I should point out that this will be a rather broad coverage of virtualization strategies. For a more detailed comparison of Zones and Jails in particular, check out James Dickens' Zones comparison chart.Benefits of VirtualizationFirst off, virtualization is here to stay. Our customers need virtualization - it dramatically reduces the cost of deploying and maintaining multiple machines and applications. The success of companies such as VMWare is proof enough that such a market exists, though we have been hearing it from our customers for a long time. What we find, however, is that customers are often confused about exactly what they're trying to accomplish, and companies try to pitch a single solution to virtualization problems without recognizing that more appropriate solutions may exist. The most common need for virtualization (as judged by our customer base) is application consolidation. Many of the larger apps have become so complex that they become a system in themselves - and often they don't play nicely with other applications on the box. So "one app per machine" has become the common paradigm. The second most common need is security, either for your application administrators or your developers. Other reasons certainly exist (rapid test environment deployment, distributed system simulation, etc), but these are the two primary ones.So what does virtualization buy you? It's all about reducing costs, but there are really two types of cost associated with running a system:Hardware costs - This includes the cost of the machine, but also the costs associated with running that machine (power, A/C).Software management costs - This includes the cost of deploying new machines, and upgrading/patching software, and observing software behavior.As we'll see, different virtualization strategies provide different qualities of the above savings.Hardware virtualizationOne of the most well-established forms of virtualization, the most common examples today are Sun Domains and IBM Logical Partitions. In each case, the hardware is responsible for dividing existing resources in such a way as to present multiple machines to the user. This has the advantage of requiring no software layer, no performance impact, and hardware fault isolation. The downside to this is that it requires specialized hardware that is extremely expensive, and provides zero benefit for reducing software management costs.Software machine virtualizationThis approach is probably the one most commonly associated with the term"virtualization". In this scheme, a software layer is created which allowsmultiple OS instances to run on the same hardware. The most commercializedversions are VMware and Virtual PC,but other projects exist (such as qemu and PearPC). Typically, they require a"host" operating system as well as multiple "guests" (although VMware ESX serverruns a custom kernel as the host). While Xen uses aparavitualization technique that requires changes to the guest OS, it is stillfundamentally a machine virtualization technique. And Usermode Linux takes aradically different approach, but accomplishes the basic same task.In the end, this approach has similar strengths and weaknesses as the hardware assistedvirtualization. You don't have to buy expensive special-purpose hardware, butyou give up the hardware fault isolation and often sacrifice performance (Xen'sapproach lessens this impact, but its still visible). But most importantly, youstill don't save any costs associated with software management - administeringsoftware on 10 virtual machines is just as expensive as administering 10separate machines. And you have no visibility into what's happening within thevirtual machine - you may be able to tell that Xen is consuming 50% of your CPU,but you can't tell why unless you log into the virtual system itself.Software application virtualizationOn the grand scale of virtualization, this ranks as the "least virtualized".With this approach, the operating system uses various tricks and techniques topresent an alternate view of the machine. This can range from simplechroot(1), to BSDJails, to SolarisZones. Each of these provide a more complete OS view with varying degreesof isolation. While Zones is the most complete and the most secure, they alluse the same fundamental idea of a single operating system presenting an"alternate reality" that appears to be a complete system at the applicationlevel. The upcoming Linux Application Environment on OpenSolaris will take thisapproach by leveraging Zones and emulating Linux at the system call layer.The most significant downside to this approach is the fact there is a single kernel. You cannot run different operating systems (though LAE will add an interesting twist), and the "guest" environments have limited access to hardware facilities. On the other hand, this approach results in huge savings on the software management front. Because applications are still processes within the host environment, you have total visibility into what is happening within each guest, using standard operating system tools, as well as manage them as you would any other processes, using standard resource management tools. You can deploy, patch, and upgrade software from a single point without having to physically log into each machine. While not all applications will run in such a reduced environment, those that do will be able to benefit from vastly simplified software management. This approach also has the added bonus that it tends to make better use of shared resources. In Zones, for example, the most common configuration includes a shared /usr directory, so that no additional disk space is needed (and only one copy of each library needs to be resident in memory).OpenSolaris virtualization in the futureSo what does this all mean for OpenSolaris? Why are we continuing to pursue Zones, LAE, and Xen? The short answer is because "our customers want us to." And hopefully, from what's been said above, it's obvious that there is no one virtualization strategy that is correct for everyone. If you want to consolidate servers running a variety of different operating systems (including older versions of Solaris), then Xen is probably the right approach. If you want to consolidate machines running Solaris applications, then Zones is probably your best bet. If you require the ability to survive hardware faults between virtual machines, then domains is the only choice. If you want to take advantage of Solaris FMA and performance, but still want to run the latest and greatest from RedHat with support, then Xen is your option. If you have 90% of your applications on Solaris, and you're just missing that one last app, then LAE is for you. Similarly, if you have a Linux app that you want to debug with DTrace, you can leverage LAE without having to port to Solaris first.With respect to Linux virtualization in particular, we are always going to pursue ISV certification first. No one at Sun wants you to run Oracle under LAE or Xen. Given the choice, we will always aggressively pursue ISVs to do a native port to Solaris. But we understand that there is an entire ecosystem of applications (typically in-house apps) that just won't run on Solaris x86. We want users to have a choice between virtualization options, and we want all those options to be a fundamental part of the operating system.I hope that helps clear up the grand strategy. There will always be people who disagree with this vision, but we honestly believe we're making the best choices for our customers.Tags: OpenSolarisZonesYou may note, that I failed to mention cross-architecture virtualization. This is most common at the system level (like PearPC), but application-level solutions do exist (including Apple's upcoming Rosetta). This type of virtualization simply doesn't factor into our plans, yet, and still falls under the umbrella of one of the broad virtualization types.I also apologize for any virtualization projects out there that I missed. There are undoubtedly many more, but the ones mentioned above serve to illustrate my point.

There's actually a decent piece over at eWeekdiscussing the future of Xen and LAE (the project formerly known as Janus) on OpenSolaris. Now that our marketing folks are getting the right message out...


Fun source code facts

A while ago, for my own amusement, I went through the Solaris source base and searched for the source files with the most lines. For some unknown reason this popped in my head yesterday so I decided to try it again. Here are the top 10 longest files in OpenSolaris:LengthSource File29944usr/src/uts/common/io/scsi/targets/sd.c25920[closed]25429usr/src/uts/common/inet/tcp/tcp.c22789[closed]16954[closed]16339[closed]15667usr/src/uts/common/fs/nfs4_vnops.c14550usr/src/uts/sfmmu/vm/hat_sfmmu.c13931usr/src/uts/common/dtrace/dtrace.c13027usr/src/uts/sun4u/starfire/io/idn_proto.cYou can see some of the largest files are still closed source. Note that the length of the file doesn't necessarily indicate anything about the quality of the code, it's more just idle curiosity. Knowing the quality of online journalism these days, I'm sure this will get turned into "Solaris source reveals completely unmaintable code" ...After looking at this, I decided a much more interesting question was "which source files are the most commented?" To answer this question, I ran evey source file through a script I found that counts the number of commented lines in each file. I filtered out those files that were less than 500 lines long, and ran the results through another script to calculate the percentage of lines that were commented. Lines which have a comment along with source are considered a commented line, so some of the ratios were quite high. I filtered out those files which were mostly tables (like uwidth.c), as these comments didn't really count. I also ignored header files, because they tend to be far more commented that the implementation itself. In the end I had the following list:PercentageFile62.9%usr/src/cmd/cmd-inet/usr.lib/mipagent/snmp_stub.c58.7%usr/src/cmd/sgs/libld/amd64/amd64unwind.c58.4%usr/src/lib/libtecla/common/expand.c56.7%usr/src/cmd/lvm/metassist/common/volume_nvpair.c56.6%usr/src/lib/libtecla/common/cplfile.c55.6%usr/src/lib/libc/port/gen/mon.c55.4%usr/src/lib/libadm/common/devreserv.c55.1%usr/src/lib/libtecla/common/getline.c54.5%[closed]54.3%usr/src/uts/common/io/ib/ibtl/ibtl_mem.cNow, when I write code I tend to hover in the 20-30% comments range (my best of those in the gate is gfs.c, which with Dave's help is 44% comments). Some of the above are rather over-commented (especially snmp_sub.c, which likes to repeat comments above and within functions).I found this little experiment interesting, but please don't base any conclusions on these results. They are for entertainment purposes only.Technorati Tag: OpenSolaris

A while ago, for my own amusement, I went through the Solaris source base and searched for the source files with the most lines. For some unknown reason this popped in my head yesterday so I decided...


Adding a kernel module to OpenSolaris

On opening day, I chose to post an entry on addinga system call to OpenSolaris. Considering the feedback, I thought I'dcontinue with brief "How-To add to OpenSolaris" documents for a while.There's a lot to choose from here, so I'll just pick them off as quick as I can.Todays topic as adding a new kernel module to OpenSolaris.For the sake of discussion, we will be adding a new module that does nothingapart from print a message on load and unload. It will be architecture-neutral,and be distributed as part of a separate package (to give you a taste of ourpackaging system). We'll continue my narcissistictradition and name this the "schrock" module.1. Adding sourceTo begin, you must put your source somewhere in the tree. It must be putsomewhere under usr/src/uts/common,but exactly where depends on the type of module. Just about the only real ruleis that filesystems go in the "fs" directory, but other than that there are noreal rules. The bulk of the modules live in the "io" directory, since themajority of modules are drivers of some kind. For now, we'll put 'schrock.c' inthe "io" directory:#include <sys/modctl.h>#include <sys/cmn_err.h>static struct modldrv modldrv = {&mod_miscops,"schrock module %I%",NULL};static struct modlinkage modlinkage = {MODREV_1, (void \*)&modldrv, NULL};int_init(void){cmn_err(CE_WARN, "OpenSolaris has arrived");return (mod_install(&modlinkage));}int_fini(void){cmn_err(CE_WARN, "OpenSolaris has left the building");return (mod_remove(&modlinkage));}int_info(struct modinfo \*modinfop){return (mod_info(&modlinkage, modinfop));}The code is pretty simple, and is basically the minimum needed to adda module to the system. You notice we use 'mod_miscops' in our modldrv.If we were adding a device driver or filesystem, we would be using a different set of linkage structures.2. Creating MakefilesWe must add two Makefiles to get this building:usr/src/uts/intel/schrock/Makefileusr/src/uts/sparc/schrock/MakefileWith contents similar to the following:UTSBASE = ../..MODULE = schrockOBJECTS = $(SCHROCK_OBJS:%=$(OBJS_DIR)/%)LINTS = $(SCHROCK_OBJS:%.o=$(LINTS_DIR)/%.ln)ROOTMODULE = $(ROOT_MISC_DIR)/$(MODULE)include $(UTSBASE)/intel/Makefile.intelALL_TARGET = $(BINARY)LINT_TARGET = $(MODULE).lintINSTALL_TARGET = $(BINARY) $(ROOTMODULE)CFLAGS += $(CCVERBOSE).KEEP_STATE:def: $(DEF_DEPS)all: $(ALL_DEPS)clean: $(CLEAN_DEPS)clobber: $(CLOBBER_DEPS)lint: $(LINT_DEPS)modlintlib: $(MODLINTLIB_DEPS)clean.lint: $(CLEAN_LINT_DEPS)install: $(INSTALL_DEPS)include $(UTSBASE)/intel/Makefile.targ3. Modifying existing MakefilesThere are two remaining Makefile chores before we can continue. First, we haveto add the set of files to usr/src/uts/common/Makefile.files:KMDB_OBJS += kdrv.oSCHROCK_OBJS += schrock.oBGE_OBJS += bge_main.o bge_chip.o bge_kstats.o bge_log.o bge_ndd.o \\ bge_atomic.o bge_mii.o bge_send.o bge_recv.oIf you had created a subdirectory for your module instead of placing it in"io", you would also have to add a set of rules to usr/src/uts/common/Makefile.rules.If you need to do this, make sure you get both the object targets and thelint targets, or you'll get build failures if you try to run lint. You'll also need to modify the usr/src/uts/intel/Makefile.intelfile, as well as the corresponding SPARC version:MISC_KMODS += usba usba10MISC_KMODS += zmodMISC_KMODS += schrock## Software Cryptographic Providers (/kernel/crypto):#4. Creating a packageAs mentioned previously, we want this module to live in its own package. Westart by creating usr/src/pkgdefs/SUNWschrock and adding it to the listof COMMON_SUBDIRS in usr/src/pkgdefs/Makefile: SUNWsasnm \\ SUNWsbp2 \\ SUNWschrock \\ SUNWscpr \\ SUNWscpu \\Next, we have to add a skeleton package system. Since we're only adding amiscellaneous module and not a full blown driver, we only need a simpleskeleton. First, there's the Makefile:include ../Makefile.com.KEEP_STATE:all: $(FILES)install: all pkginclude ../Makefile.targA 'pkgimfo.tmpl' file:PKG=SUNWschrockNAME="Sample kernel module"ARCH="ISA"VERSION="ONVERS,REV=0.0.0"SUNW_PRODNAME="SunOS"SUNW_PRODVERS="RELEASE/VERSION"SUNW_PKGVERS="1.0"SUNW_PKGTYPE="root"MAXINST="1000"CATEGORY="system"VENDOR="Sun Microsystems, Inc."DESC="Sample kernel module"CLASSES="none"HOTLINE="Please contact your local service provider"EMAIL=""BASEDIR=/SUNW_PKG_ALLZONES="true"SUNW_PKG_HOLLOW="true"And 'prototype_com', 'prototype_i386', and 'prototype_sparc' (elided) files:# prototype_i386!include prototype_comd none kernel/misc/amd64 755 root sysf none kernel/misc/amd64/schrock 755 root sys# prototype_comi pkginfod none kernel 755 root sysd none kernel/misc 755 root sysf none kernel/misc/schrock 755 root sys5. Putting it all togetherIf we pkgadd our package, or BFU to the resulting archives, we can see ourmodule in action:halcyon# modload /kernel/misc/schrockJun 19 12:43:35 halcyon schrock: WARNING: OpenSolaris has arrivedhalcyon# modunload -i 197Jun 19 12:43:50 halcyon schrock: WARNING: OpenSolaris has left the buildingThis process is common to all kernel modules (though packaging is simpler forthose combined in SUNWckr, for example). Things get a little more complicatedand a little more specific when you begin to talk about drivers or filesystemsin particular. I'll try to create some simple howtos for those aswell.Technorati Tag: OpenSolaris

On opening day, I chose to post an entry on href="http://blogs.sun.com/roller/page/eschrock/20050614#how_to_add_a_system">adding a system call to OpenSolaris. Considering the feedback, I thought I'dco...


GDB to MDB Migration, Part Two

So talking to Ben last night convinced me I needed to finish up the GDB to MDB reference that I started last month. So here's part two.GDBMDBDescriptionProgram Stackbacktrace n::stack$CDisplay stack backtrace for the current thread-thread::findstack-vDisplay a stack for a given thread. In the kernel, threadis the address of the kthread_t. In userland, it's the threadidentifier.info ...-Display information about the current frame. MDB doesn't supportthe debugging data necessary to maintain the frame abstraction.Execution Controlcontinuec:cContinue target.stepisi::step]Step to the next machine instruction. MDB does not supportstepping by source lines.nextini::stepover[Step over the next machine instruction, skipping any functioncalls.finish::stepoutContinue until returning from the current frame.jump \*addressaddress>regJump to the given location. In MDB, reg depends on yourplatform. For SPARC it's 'pc', for i386 its 'eip', and for amd64 it's'rip'.Displayprint expraddr::printexprPrint the given expression. In GDB you can specify variablenames as well as addresses. For MDB, you give a particular address and thenspecify the type to display (which can include dereferencing of members,etc).print /faddr/fPrint data in a precise format. See ::formats for alist of MDB formats.disassem addraddr::disDissasemble text at the given address, or the current PC if noaddress is specifiedThis is just a primer. Both programs support a wide variety of additionaloptions. Running 'mdb -k', you can quickly see just how many commands are outthere:> ::dcmds ! wc -l 385> ::walkers ! wc -l 436One helpful trick is ::dcmds ! grep thing, which searches thedescription of each command. Good luck, and join the discussion over at theOpenSolaris MDBcommunity if you have any questions or tips of your own.Technorati tag: MDBTechnorati tag: OpenSolarisTechnorati tag: Solaris

So talking to Ben last night convinced me I needed to finish up the GDB to MDB reference that I started last month. So here's part two. GDB align=left>MDB Description Program Stack backtrace n valign=top>::...


How to add a system call to OpenSolaris

When I first started in the Solaris group, I was faced with two equallydifficult tasks: learning the development model, and understanding the sourcecode. For both these tasks, the recommended method is usually picking a smallbug and working through the process. For the curious, the first bug I putbackto ON was 4912227(ptree call returns zero on failure), a simple bug with near zero risk. Itwas the first step down a very long road.As a another first step, someone suggested adding a very simple system call to thekernel. This turned out to be a whole lot harder than one would expect, and hasso many subtle aspects that experienced Solaris engineers (myself included)still miss some of the necessary changes. With that in mind, I thought areasonable first OpenSolaris blog would be describing exactly how to add a newsystem call to the kernel.For the purposes of this post, we will assume that it's a simple system callthat lives in the generic kernel code, and we'll put the code into an existingfile to avoid having to deal with Makefiles. The goal is to print an arbitrarymessage to the console whenever the system call is issued.1. Picking a syscall numberBefore writing any real code, we first have to pick a number that willrepresent our system call. The main source of documentation here is syscall.h,which describes all the available system call numbers, as well as which ones arereserved. The maximum number of syscalls is currently 256 (NSYSCALL), whichdoesn't leave much space for new ones. This could theoretically be extended - Ibelieve the hard limit is in the size of sysset_t, whose 16 integersmust be able to represent a complete bitmask of all system calls. This puts ouractual limit at 16\*32, or 512, system calls. But for the purposes of ourtutorial, we'll pick system call number 56, which is currently unused. For myown amusement, we'll name our (my?) system call 'schrock'. So first we add thefollowing line to syscall.h#define SYS_uadmin 55#define SYS_schrock 56#define SYS_utssys 572. Writing the syscall handlerNext, we have to actually add the function that will get called when weinvoke the system call. What we should really do is add a new fileschrock.c to usr/src/uts/common/syscall,but I'm trying to avoid Makefiles. Instead, we'll just stick it in getpid.c:#include <sys/cmn_err.h>intschrock(void \*arg){charbuf[1024];size_tlen;if (copyinstr(arg, buf, sizeof (buf), &len) != 0)return (set_errno(EFAULT));cmn_err(CE_WARN, "%s", buf);return (0);}Note that declaring a buffer of 1024 bytes on the stack is a very badthing to do in the kernel. We have limited stack space, and a stack overflowwill result in a panic. We also don't check that the length of the string wasless than our scratch space. But this will suffice for illustrative purposes.The cmn_err()function is the simplest way to display messages from the kernel.3. Adding an entry to the syscall tableWe need to place an entry in the system call table. This table lives in sysent.c,and makes heavy use of macros to simplify the source. Our system call takes asingle argument and returns an integer, so we'll need to use theSYSENT_CI macro. We needto add a prototype for our syscall, and add an entry to the sysent andsysent32 tables:int rename();void rexit();int schrock();int semsys();int setgid();/\* ... \*/ /\* 54 \*/ SYSENT_CI("ioctl", ioctl, 3), /\* 55 \*/ SYSENT_CI("uadmin", uadmin, 3), /\* 56 \*/ SYSENT_CI("schrock",schrock,1), /\* 57 \*/ IF_LP64( SYSENT_2CI("utssys", utssys64, 4), SYSENT_2CI("utssys", utssys32, 4)),/\* ... \*/ /\* 54 \*/ SYSENT_CI("ioctl", ioctl, 3), /\* 55 \*/ SYSENT_CI("uadmin", uadmin, 3), /\* 56 \*/ SYSENT_CI("schrock",schrock,1), /\* 57 \*/ SYSENT_2CI("utssys", utssys32, 4),4. /etc/name_to_sysnumAt this point, we could write a program to invoke our system call, but thepoint here is to illustrate everything that needs to be done to integratea system call, so we can't ignore the little things. One of these little thingsis /etc/name_to_sysnum, which provides a mapping between system callnames and numbers, and is used by dtrace(1M), truss(1), andfriends. Of course, there is one version for x86 and one for SPARC, so you willhave to add the following lines to both theinteland SPARCversions:ioctl 54uadmin 55schrock 56utssys 57fdsync 585. truss(1)Truss does fancy decoding of system call arguments. In order to do this, weneed to maintain a table in truss that describes the type of each argument forevery syscall. This table is found in systable.c.Since our syscall takes a single string, we add the following entry:{"ioctl", 3, DEC, NOV, DEC, IOC, IOA}, /\* 54 \*/{"uadmin", 3, DEC, NOV, DEC, DEC, DEC}, /\* 55 \*/{"schrock", 1, DEC, NOV, STG}, /\* 56 \*/{"utssys", 4, DEC, NOV, HEX, DEC, UTS, HEX}, /\* 57 \*/{"fdsync", 2, DEC, NOV, DEC, FFG}, /\* 58 \*/Don't worry too much about the different constants. But be sure to read upon the truss source code if you're adding a complicated system call.6. proc_names.cThis is the file that gets missed the most often when adding a new syscall.Libproc uses the table in proc_names.cto translate between system call numbers and names. Why it doesn't make use of/etc/name_to_sysnum is anybody's guess, but for now you have to updatethe systable array in this file: "ioctl", /\* 54 \*/ "uadmin", /\* 55 \*/ "schrock", /\* 56 \*/ "utssys", /\* 57 \*/ "fdsync", /\* 58 \*/7. Putting it all togetherFinally, everything is in place. We can test our system call with a simpleprogram:#include <sys/syscall.h>intmain(int argc, char \*\*argv){syscall(SYS_schrock, "OpenSolaris Rules!");return (0);}If we run this on our system, we'll see the following output on theconsole:June 14 13:42:21 halcyon genunix: WARNING: OpenSolaris Rules!Because we did all the extra work, we can actually observe the behavior usingtruss(1), mdb(1), or dtrace(1M). As you can see,adding a system call is not as easy as it should be. One of the ideas that hasbeen floating around for a while is the Grand Unified Syscall(tm) project, whichwould centralize all this information as well as provide type information forthe DTrace syscall provider. But until that happens, we'll have to deal withthis process. Technorati Tag: OpenSolaris Technorati Tag: Solaris

When I first started in the Solaris group, I was faced with two equally difficult tasks: learning the development model, and understanding the sourcecode. For both these tasks, the recommended method...


FISL Final Day

The last day of FISL has come and gone, thankfully. I'm completely drained, both physically and mentally. As you can probably tell from the comments on yesterday's blog entry, we had quite a night out last night in Porto Alegre. I didn't stay out quite as late as some of the Brazil guys, but Ken and I made it back in time to catch about 4 hours of sleep before heading off to the conference. Thankfully I remembered to set my alarm, otherwise I probably would have ended up in bed until the early afternoon. The full details of the night are better told in person...This last day was significantly quieter than previous days. With the conference winding down, I assume that many people took off early. Most of our presentations today were to an audience of 2 or 3 people, and we even had to cancel some of the early ones as no one was there. I managed to give presentations for Performance, Zones, and DTrace, despite my complete lack of sleep. The DTrace presentation was particularly rough because it's primarily demo-driven, with no set plan. This turns out to be rather difficult after a night of no sleep and a few too many caipirinhas.The highlight of the day was when a woman (stunningly beautiful, of course) came up to me while I was sitting in one of the chairs and asked to take a picture with me. We didn't talk at all, and I didn't know who she was, but she seemed psyched to be getting her picture taken with someone from Sun. I just keep telling myself that it was my stunning good looks that resulted in the picture, not my badge saying "Sun Microsystems". I can dream, can't I?Tomorrow begins the 24 hours of travelling to get me back home. I can't wait to get back to my own apartment and a normal lifestyle.

The last day of FISLhas come and gone, thankfully. I'm completely drained, both physically and mentally. As you can probably tell from the comments on yesterday's blog entry, we had quite a night out...


FISL Day 3

The exhaustion continues to increase. Today I did 3 presentations: DTrace, Zones, and FMA (which turned into OpenSolaris). Every one took up the full hour allotted. And tomorrow I'm going to add a Solaris performance presentation, to bring the grand total to 4 hours of presentations. Given how bad the acoustics are on the exposition floor, my goal is to lose my voice by the end of the night. So far, I've settled into a schedule: wake up around 7:00, check email, work on slides, eat breakfast, then get to the conference around 8:45. After a full day of talking and giving presentations, I get back to the hotel around 7:45 and do about an hour of work/email before going out to dinner. We get back from dinner around 11:30, at which point I get to blogging and finishing up some work. Eventaully I get to sleep around 1:00, at which point I have to do the whole thing the next day. Thank god tomorrow is the end, I don't know how much more I can take.Today's highlight was when Dimas (from Sun Brazil) began an impromptu Looking Glass demo towards the end of the day. He ended up overflowing our booth with at least 40 people for a solid hour before the commotion started to die down. Those of us sitting in the corner were worried we'd have to lave to make room. Our Solaris presentations hit 25 or so people, but never so many for so long. The combination of cool eye candy and a native Portuguese speaker really helped out (though most people probably couldn't hear him anyway).Other highlights included hanging out with the folks at CodeBreakers, who really seem to dig Solaris (Thiago had S10 installed on his laptop within half a day). We took some pictures with them (which Dave should post soon), and are going out for barbeque and drinks tonight with them and 100+ other open source Brazil folks. I also helped a few other people get Solaris 10 installed on their laptops (mostly just the "disable USB legacy support" problem). It's unbelievably cool to see the results of handing out Solaris 10 DVDs before even leaving the conference. The top Solaris presentations were understandably DTrace and Zones, though the booth was pretty well packed all day.Let's hope the last day is as good as the rest. Here's to Software Livre!

The exhaustion continues to increase. Today I did 3 presentations: DTrace, Zones, and FMA (which turned into OpenSolaris). Every one took up the full hour allotted. And tomorrow I'm going to add a...


FISL Day 2

Another day at FISL, another day full of presentations. Today we did mini-presentations every hour on the hour, most of which were very well attended. When we overlapped with the major keynote sessions, turnout tended to be low, but other than that it was very successful. We covered OpenSolaris, DTrace, FMA, SMF, Security, as well as a Java presentation (by Charlie, not Dave or myself). As usual, lots of great questions from the highly technical audience.The highlight today was a great conversation with a group of folks very interested in starting an OpenSolaris users group in Brazil. Extremely nice group of guys, very interested in technology and helping OpenSolaris build a greater presence in Brazil (both through user groups and Solaris attendance at conferences). I have to say that after experiencing this conference and seeing the enthusiasm that everyone has for exciting technology and open source, I have to agree that Brazil is a great place to focus our OpenSolaris presence. Hopefully we'll see user groups pop up here as well as the rest of the world. We'll be doing everything we can to help from within Sun.The other, more amusing, highlight of the day was during my DTrace demonstration. I needed an interesting java application to demonstrate the jstack() DTrace action, so I started up the only java application (apart from some internal Sun tools) that I use on a regular basis: Yahoo! Sports Fantasy Baseball StatTracker (the classic version, not the new flash one). I tried to explain that maybe I was trying to debug why the app was lying to me about Tejada going 0-2 so far in the Sox/Orioles game; really he should have hit two homers and I should be dominating this week's scores1. I was rather amused, but I think the cultural divide was a little too wide. Not only baseball, but fantasy baseball: I don't blame the audience at all.Technorati tags: SolarisOpenSolaris1 This is clearly a lie. Despite any dreams of fantasy baseball domination, I would never root for my players in a game over the Red Sox. In the end, Ryan's 40.5 ERA was worth the bottom of the ninth comeback capped by Ortiz's 3-run shot.

Another day at FISL, another day full of presentations. Today we did mini-presentations every hour on the hour, most of which were very well attended. When we overlapped with the major keynote...


FISL Day 1

So the first day of FISL has come to a close. I have to say it went better than expected, based on the quality of questions posed by the audience and visitors to the Sun booth. If today is any indication, my voice is going to completely gone by the end of the conference. I started off the day with a technical overview of Solaris 10/OpenSolaris. You can find the slides for this presentation here. Before taking too much credit myself, the content of these slides are largely based off of Dan's USENIX presentation (thanks Dan!). This is a whirlwind tour of Solaris features - three slides per topic is nowhere near enough. Each of the major topics has been presented many times as a standalone 2-hour presentation, so you can imagine the corners I have to cut to cover them all.My presention was followed by a great OpenSolaris overview from Tom Goguen. His summary of the CDDL was one of the best I've ever seen - it was the first time I've seen an OpenSolaris presentation without a dozen questions about GPL, CDDL, and everybody's favorite pet license. Dave followed up with a detailed description of how Solaris is developed today and where we see OpenSolaris development heading in the future. All in all, we managed to cram 10+ hours of presentations into a measley 3 1/2 hours. For those of you who still have lingering questions, please stop by the Sun booth and chat with us about anything and everything. We'll be here all weekAfter retiring to the booth, we had several great discussions with some of the attendees. The highlight of the day was when Dave was talking to an attendee about SMF (and the cool GUI he's working on) and I was feeling particularly bored. Since my laptop was hooked up to the monitor in the "community theater", I decided to play around with some DTrace scripts to come up with a cool demo. Within three minutes I had 4 or 5 people watching what I was doing, so I decided to start talking about all the wonders of DTrace. The 4 or 5 people quickly turned into 10 or 12, and pretty soon I found myself in the middle of a 3 hour mammoth DTrace demo, from which my voice is still recovering. This brings us to the major thing I learned today:"If you DTrace it, they will come"Technorati tags: SolarisOpenSolaris

So the first day of FISL has come to a close. I have to say it went better than expected, based on the quality of questions posed by the audience and visitors to the Sun booth. If today is any...


In the News(.com)

So it looks like my blog made it over to the frontpage of news.com in this article about slipping Solaris 10 features. Don't get your hopes up - I'm not going to refute Genn's claims; we certainly are not scheduled for a specific update at the moment. But pay attention to the details: ZFS and Janus will be available in an earlier Solaris Express release. I also find it encouraging that engineers like myself have a voice that actually gets picked up by the regular press (without being blown out of proportion or slashdotted).I would like to point out that I putback the last major chunk of command redesign to the ZFS gate yesterday ;-) There are certainly some features left to implement, but the fact that I re-whacked all of the userland components (within six weeks, no less) should not be interpreted as any statement of schedule plans. Hopefully I can get into some of the details of what we're doing but I don't want to be seen as promoting vaporware (even though we have many happy beta customers) or exposing unfinished interfaces which are subject to change.I also happen to be involved with the ongoing Janus work, but that's another story altogether. I swear there's no connection between myself and slipping products (at least not one where I'm the cause).Update: So much for not getting blown out of proportion. Leave it to the second tier news sites to turn "not scheduled for an update" into "delayed indefinitely over deficiencies". Honestly, rewriting 5% of the code should hardly be interpreted as "delayed indefinitely" - so much for legitimate journalism. Please keep in mind that all features will hit Software Express before a S10 Update, and OpenSolaris even sooner.

So it looks like my blog made it over to the frontpage of news.com in this article about slipping Solaris 10 features. Don't get your hopes up - I'm not going to refute Genn's claims; we certainly...


GDB to MDB Migration, Part One

In past comments, it has been pointed out that a transition guide between GDB and MDB would be useful to some developers out there. A full comparison would also cover dbx(1), but I'll defer this to a later point. Given the number of available commands, I'll be dividing up this post into at least two pieces.Before diving into too much detail, it should be noted that MDB and GDB have slightly different design goals. MDB (and KMDB) replaced the aging adb(1) and crash(1M), and was designed primarily for post-mortem analysis and live kernel analysis. To this end, MDB presents the same interface when debugging a crash dump as when examining a live kernel. Solaris corefiles have been enhanced so that all the information for the process (including library text and type information) is present in the corefile. MDB can examine and run live processes, but lacks some of the features (source level debugging, STABS/DWARF support, conditional breakpoints, scripting language) that are standard for developer-centric tools like GDB (or dbx). GDB was designed for interactive process debugging. While you can use GDB on corefiles (and even LKCD crash dumps or Linux kernels - locally and remotely), you often need the original object files to take advantage of GDB's features.Before going too far into MDB, be sure to check out Jonathan's MDB Cheatsheet as a useful quick reference guide, with some examples of stringing together commands into pipelines. Seeing as how I'm not the most accomplished GDB user in the world, I'll be basing this comparison off the equivalent GDB reference card.GDBMDBDescriptionStarting Upgdb programmdb pathmdb -p pidStart debugging a command or running process. GDB will treat numeric arguments as pids, while mdb explicitly requires the '-p' optiongdb program coremdb [ program ] coreDebug a corefile associated with 'program'. For MDB, the program is optional and is generally unnecessary given the corefile enhancements made during Solaris 10.Exitingquit::quitBoth programs also exit on Ctrl-D.Getting Helphelphelp command::help::help dcmd::dcmds::walkersIn mdb, you can list all the available walkers or dcmds, as well as get help on a specific dcmd. Another useful trick is ::dmods -l module which lists walkers and dcmds provided by a specific module.Running Programsrun arglist::run arglistRuns the program with the given arguments. If the target is currently running, or is a corefile, MDB will restart the program if possible.kill::killForcibly kill and release target.show env::getenvDisplay current environment.set env var string::setenv var=stringSet an environment variable.get env var::getenv varGet a specific environment variable.Shell Commandsshell cmd!cmdExecute the given shell command.Breakpoints and Watchpointsbreak funcbreak \*addraddr::bpSet a breakpoint at the givenaddress or function.break file:line-Break at the given line of the file. MDB does notsupport source level debugging.break ... if expr-Set a conditional breakpoint. MDB doesn't supportconditional breakpoints, though you can get a close approximation via the-c option (though its complicated enough to warrant its ownpost).watch expraddr::wp -rwx [-L size]Set a watchpoint on the given region of memory.info breakinfo watch::eventsDisplay active watchpointsand breakpoints. MDB will show you signal events as well.delete [n]::delete nDelete the given breakpoint orwatchpoints.I think that's enough for now; hopefully the table is at least readable. More to come in a future post.

In past comments, it has been pointed out that a transition guide between GDB and MDB would be useful to some developers out there. A full comparison would also cover dbx(1), but I'll defer this to a...


Designing for Failure

In the last few weeks, I've been completely re-designing the ZFS commands from the ground up1. When I stood back and looked at the current state of the utilities, several glaring deficiencies jumped out at me2. I thought I'd use this blog entry to focus on one that near and dear to me. Having spent a great deal of time with the debugging and observability tools, I've invariably focused on answering the question "How do I diagnose and fix a problem when something goes wrong?". When it comes to command line utilities, the core this problem is in well-designed error messages. To wit, running the following (former) ZFS command demonstrates the number one mistake when reporting error messages: # zfs create -c pool/foo pool/bar zfs: Can't create pool/bar: Invalid argument #The words "Invalid argument" should never appear as an error message. This means that at some point in the software stack, you were able to determine there was a specific problem with an argument. But in the course of passing that error up the stack, any semantic information about the exact nature of the problem has been reduced to simply EINVAL. In the above case, all we know is that one of the two arguments was invalid for some unknown reason, and we have no way of knowing how to fix it. When choosing to display an error message, you should always take the following into account:An error message must clearly identify the source of the problem in a way that that the user can understand.An error message must suggest what the user can do to fix the problem.If you print an error message that the administrator can't understand or doesn't suggest what to do, then you have failed and your design is fundamentally broken. All too often, error semantics are given a back seat during the design process. When approaching the ZFS user interface, I made sure that error semantics were a fundamental part of the design document. Every command has complete usage documentation, examples, and every possible error message that can be emitted. By making this part of the design process, I was forced to examine every possible error scenario from the perspective of an administrator.A grand vision of proper failure analysis can be seen in the Fault Management Architecture in Solaris 10, part of Predictive Self Healing. A complete explanation of FMA and its ramifications is beyond the scope of a single blog entry, but the basic premise is to move from a series of unrelated error messages to a unified framework of fault diagnosis. Historically, when hardware errors would occur, an arbitrary error message may or may not have been sent to the system log. The error may have been transient (such as an isolated memory CE), or the result of some other fault. Administrators were forced to make costly decisions based on a vague understanding of our hardware failure semantics. When error messages did succeed in describing the problem sufficiently, they invariably failed in suggesting how to fix the problem. With FMA, the sequence of errors is instead fed to a diagnosis engine that is intimately familiar with the characteristics of the hardware, and is able to produce a fault message that both adequately describes the real problem, as well as how to fix it (when it cannot be automatically repaired by FMA).Such a wide-ranging problem doesn't necessarily compare to a simple set of command line utilities. A smaller scale example can be seen with the Solaris Management Facility. When SMF first integrated, it was incredibly difficult to diagnose problems when they occurred3. The result, after a few weeks of struggle, was one of the best tools to come out of SMF, svcs -x. If you haven't tried this command on your Solaris 10 box, you should give it a shot. It does automated gathering of error information and combines it into output that is specific, intelligible, and repair-focused. During development of the final ZFS command line interface, I've taken a great deal of inspiration from both svcs -x and FMA. I hope that this is reflected in the final product.So what does this mean for you? First of all, if there's any Solaris error message that is unclear or uninformative that is a bug. There are some rare cases when we have no other choice (because we're relying on an arbitrary subsystem that can only communicate via errno values), but 90% of the time its because the system hasn't been sufficiently designed with failure in mind.I'll also leave you with a few cardinal4 rules of proper error design beyond the two principles above:Never distill multiple faults into a single error code. Any error that gets passed between functions or subsystems must be traceable back to a single specific failure.Stay away from strerror(3c) at all costs. Unless you are truly interfacing with an arbitrary UNIX system, the errno values are rarely sufficient.Design your error reporting at the same time you design the interface. Put all possible error messages in a single document and make sure they are both consistent and effective.When possible, perform automated diagnosis to reduce the amount of unimportant data or give the user more specific data to work with.Distance yourself from the implementation and make sure that any error message makes sense to the average user.1No, I cannot tell you when ZFS will integrate, or when it will be available. Sorry.2This is not intended as a jab at the ZFS team. They have been working full steam on the (significantly more complicated) implementation. The commands have grown organically over time, and are beginning to show their age.3Again, this is not meant to disparage the SMF team. There were many more factors here, and all the problems have since been fixed.4 "cardinal" might be a stretch here. A better phrase is probably "random list of rules I came up with on the spot".

In the last few weeks, I've been completely re-designing the ZFS commands from the ground up1. When I stood back and looked at the current state of the utilities, several glaring deficiencies jumped...


Bug of the week

There are many bugs out there that are interesting, either because of an implementation detail or the debugging necessary to root cause the problem. As you may have noticed, I like to publicly expound upon the most interesting ones I've fixed (as long as it's not a security vulnerability). This week turned up a rather interesting specimen:6198523 dirfindvp() can erroneously return ENOENTThis bug was first spotted by Casper back in November last year while trying to do some builds on ZFS. The basic pathology is that at some point during the build, we'd get error messages like:sh: cannot determine current directorySome ideas were kicked around by the ZFS team, and after the problem seemed to go away, the team believed that some recent mass of changes had also fixed the problem. Five months later, Jonathan hit the same bug on another build machine running ZFS. As I wrote the getcwd() code, I was determined to root cause the problem this time around.Back in build 56 of S10, I moved getcwd(3c) into the kernel, along with changes to store pathnames with vnodes (which is used by the DTrace I/O provider as well as pfiles(1)). Basically, we first try to do a forward lookup on the stored pathname; if that works, then we simply return the resolved path1. If this fails (vnode paths are never guaranteed to be correct), then we have to fall back into the slow path. This slow path involves looking up the parent, finding the current vnode in parent, prepending path, and repeat. Once we reach the root of the filesystem, we have a complete path.To debug this problem, I used this D script to track the behavior of dirtopath(), the function that performs the dirty work of the slow path. Running this for a while produced a tasty bit of information:dirtopath /export/ws/build/usr/src/cmd/sgs/ldlookup(/export/ws/build/usr/src/cmd, .make.dependency.8309dfdc.234596.166) failed (2)dirfindvp(/export/ws/build/usr/src/cmd,/export/ws/build/usr/src/cmd/sgs) failed (2)dirtopath() returned 2Looking at this, it was clear that dirfindvp() (which finds a given vnode in its parent) was inappropriately failing. In particular, after a failed lookup for a temporary make file, we bail out of the loop and report failure, despite the fact that "sgs" is still sitting there in the directory. A long look at the code revealed the problem. Without revealing too much of the code (OpenSolaris, where are you?), it's essentially structured like so: while (!err && !eof) { /\* ... \*/ while ((intptr_t)dp < (intptr_t)dbuf + dbuflen) { /\* ... \*/ /\* \* We only want to bail out if there was an error other \* than ENOENT. Otherwise, it could be that someone \* just removed an entry since the readdir() call, and \* the entry we want is further on in the directory. \*/ if (err != ENOENT) { break; } } } The code is trying to avoid exactly our situation: we fail to do a lookup of a file we just saw beacuse the contents are rapidly changing. The bug is that in the while loop we have a check for !err && !eof. If we fail to look up an entry, and it's the last entry in the chunk we just read, then we'll prematurely bail out of the enclosing while loop, returning ENOENT when we shouldn't. Using this test program, it's easy to reproduce on both ZFS and UFS. There are several noteworthy aspects of this bug:The bug had been in the gate for over a year, and there hadn't been a single reported build failure.It only happens when the cached vnode value is invalid, which is rare2.It is a race condition between readdir, lookup, and remove.On UFS, inodes are marked as deleted but can still be looked up until the delete queue is processed at a later point. ZFS deletes entries immediately, so this was much more apparent on ZFS.Because of the above, it was incredibly transient. It would have taken an order of magnitude more time to root cause if not for DTrace, which excels at solving these transient phenomenaA three line change and the bug was fixed, and will make it back to S10 in time for Update 1. If it hadn't been for those among us willing to run our builds on top of ZFS, this problem would not have been found until ZFS integrated, or a customer escalation cost the company a whole bunch of money.1 There are many more subtleties here relating to Zones, and verifying that the path hasn't been changed to refer to another file. The curious among you will have to wait for OpenSolaris.2 I haven't yet investigated why we ended up in the slow path in this case. First things first.

There are many bugs out there that are interesting, either because of an implementation detail or the debugging necessary to root cause the problem. As you may have noticed, I like to publicly expound...


How not to code

This little gem came up in conversation last night, and it was suggested that it would make a rather amusing blog entry. A Solaris project had a command line utility with the following, unspeakably horrid, piece of code: /\* \* Use the dynamic linker to look up the function we should \* call. \*/ (void) snprintf(func_name, sizeof (func_name), "do_%s", cmd); func_ptr = (int (\*)(int, char \*\*)) dlsym(RTLD_DEFAULT, func_name); if (func_ptr == NULL) { fprintf(stderr, "Unrecognized command %s", cmd); usage(); } return ((\*func_ptr)(argc, argv));So when you type "a.out foo", the command would sprintf into a buffer to make "do_foo", and rely on the dynamic linker to find the appropriate function to call. Before I get a stream of comments decrying the idiocy of Solaris programmers: the code will never reach the Solaris codebase, and the responsible party no longer works at Sun. The participants at the dinner table were equally disgusted that this piece of code came out of our organization. Suffice to say that this is much better served by a table: for (i = 0; i < sizeof (func_table) / sizeof (func_table[0]); i++) { if (strcmp(func_table[i].name, cmd) == 0) return (func_table[i].func(argc, argv)); } fprintf(stderr, "Unrecognized command %s", cmd); usage();I still can't imagine the original motivation for this code. It is more code, harder to understand, and likely slower (depending on the number of commands and how much you trust the dynamic linker's hash table). We continually preach software observability and transparency - but I never thought I'd see obfuscation of this magnitude within a 500 line command line utility. This prevents us from even searching for callers of do_foo() using cscope.This serves as a good reminder that the most clever way of doing something is usually not the right answer. Unless you have a really good reason (such as performance), being overly clever will only make your code more difficult to maintain and more prone to error.Update - Since some people seem a little confused, I thoght I'd elaborate two points. First off, there is no loadable library. This is a library linked directly to the application. There is no need to asynchronously update the commands. Second, the proposed function table does not have to live separately from the code. It would be quite simple to put the function table in the same file with the function definitions, which would improve maintainability and understability by an order of magnitude.

This little gem came up in conversation last night, and it was suggested that it would make a rather amusing blog entry. A Solaris project had a command line utility with the following,...


Google coredumper

In the last few days you may have noticed that Google released a site filled with Open Source applications and interfaces. First off, kudos to the Google guys for putting this together. It's always great to see a company open sourcing their tools, as well as encouraging open standards to take advantage of their services.That being said, I found the google coredumper particularly amusing. From the google page:coredumper: Gives you the ability to dump cores from programs when it was previously not possible.Being very close to the debugging tools on Solaris, I was a little taken aback by this statement. On Solaris, the gcore(1) command has always been a supported tool for generating standard Solaris core files readable by any debugger. Seeing as how I can't imagine a UNIX system without this tool, I went looking in some old source trees to find out when it was originally written. While the current Solaris version has been re-written over the course of time, I did find this comment buried in the old SunOS 3.5 source: /\* \* gcore - get core images of running processes \* \* Author: Eric Cooper \* Written: Fall 1981. \* \* Inspired by a version 6 program by Len Levin, 1978. \* Several pieces of code lifted from Bill Joy's 4BSD ps. \*/So this tool has been a standard part of UNIX since 1981, and based on sources as old as 1978. This is why the statement that it was "previously not possible" on Linux seemed shocking to me. Just to be sure, I logged into one of our machines running Linux and tried poking around: $ find /usr/bin -name "\*core\*" $No luck. Intrigued, I took a look at the google project. From the included README:The coredumper library can be compiled into applications to createcore dumps of the running program, without having to terminatethem. It supports both single- and multi-threaded core dumps, even ifthe kernel does not have native support for multi-threaded core files.So the design goal appears to be slightly different; being able to dump core from within the program itself. On Solaris, I would just fork/exec a copy of gcore(), or use the (unfortunately private) libproc interface to do so. I find it hard to believe that there are kernels out there without support for multi-threaded core files, though. I took a quick google search for 'gcore linux', and turned up a few mailing list articles here here and here. I went and downloaded the latest GDB sources, and sure enough there is a "gcore" command. I went back to our lab machine and tested it out with gdb 5.1, and sure enough it worked. But reading back the file was not as successful: # gdb -p `pgrep nscd` ... (gdb) info threads 7 Thread 5126 (LWP 1018) 0x420e7fc2 in accept () from /lib/i686/libc.so.6 6 Thread 4101 (LWP 1017) 0x420e7fc2 in accept () from /lib/i686/libc.so.6 5 Thread 3076 (LWP 1016) 0x420e7fc2 in accept () from /lib/i686/libc.so.6 4 Thread 2051 (LWP 1015) 0x420e0037 in poll () from /lib/i686/libc.so.6 3 Thread 1026 (LWP 1014) 0x420e7fc2 in accept () from /lib/i686/libc.so.6 2 Thread 2049 (LWP 1013) 0x420e0037 in poll () from /lib/i686/libc.so.6 1 Thread 1024 (LWP 1007) 0x420e7fc2 in accept () from /lib/i686/libc.so.6 (gdb) bt #0 0x420e7fc2 in accept () from /lib/i686/libc.so.6 #1 0x40034603 in accept () from /lib/i686/libpthread.so.0 #2 0x0804acd5 in geteuid () #3 0x4002ffef in pthread_start_thread () from /lib/i686/libpthread.so.0 (gdb) gcore Saved corefile core.1014 (gdb) quit The program is running. Quit anyway (and detach it)? (y or n) y # gdb core.1014. ... "/tmp/core.1014": not in executable format: File format not recognized (gdb) quit # gdb /usr/sbin/nscd core.1014 ... Core was generated by `/usr/sbin/nscd'. Program terminated with signal 17, Child status changed. #0 0x420e0037 in poll () from /lib/i686/libc.so.6 (gdb) info threads 7 process 67109878 0x420e7fc2 in accept () from /lib/i686/libc.so.6 6 process 134284278 0x420e0037 in poll () from /lib/i686/libc.so.6 5 process 67240950 0x420e7fc2 in accept () from /lib/i686/libc.so.6 4 process 134415350 0x420e7fc2 in accept () from /lib/i686/libc.so.6 3 process 201589750 0x420e7fc2 in accept () from /lib/i686/libc.so.6 2 process 268764150 0x420e7fc2 in accept () from /lib/i686/libc.so.6 \* 1 process 335938550 0x420e0037 in poll () from /lib/i686/libc.so.6 (gdb) bt #0 0x420e0037 in poll () from /lib/i686/libc.so.6 #1 0x0804aca8 in geteuid () #2 0x4002ffef in pthread_start_thread () from /lib/i686/libpthread.so.0 (gdb) quit #This whole exercise was rather distressing, and brought me straight back to college when I had to deal with gdb on a regular basis (Brown moved to Linux my senior year and I was responsible (together with Rob) for porting the Brown Simulator and Weenix OS from Solaris). Everything seemed fine when first attaching to the process; the gcore command appeared to work fine. But when reading back a corefile, gdb can't understand a lone corefile, the process/thread IDs have been completely garbled, and I've lost floating point state (not shown above). It makes me glad that we have MDB, and configurable corefile content in Solaris 10.This is likely an unfair comparison since it's using GDB version 5.1, when the latest is 6.3, but at least it validates the existence of the google library. I always pay attention to debugging tools around the industry, but it seems like I need to get a little more hands-on experience to really guage the current state of affairs. I'll have to get access to a system running a more recent version of GDB to see if it is any better before drawing any definitive conclusions. Then again, Solaris has had a working gcore(1) and mdb(1)/adb(1) since the SunOS days back in the 80s, so I don't see why I should have to lower my expectations just because it's GNU/Linux.

In the last few days you may have noticed that Google released a site filled with Open Source applications and interfaces. First off, kudos to the Google guys for putting this together. It's always...


Another interesting bug

I know it's been a long time since I posted a blog entry. But I've either been too busy, out of the country, working on (not yet public) projects, or fixing relatively uninteresting bugs. But last week I finally nailed down a nasty bug that had been haunting me for several weeks, so I thought I'd share some of the experience. I apologize if this post gets a little too technical and/or incomprehensible. But I found it to be an interesting exercise, and hopefully sharing it will get me back in the blogging spirit.First a little background. In Solaris, we have a set of kernel functions known as 'copyops' used to transfer data between the kernel and userland. In order to support watchpoints and SunCluster, we maintain a backup vector of functions used when one of these fails. For example, if you have a piece of data on a watched page, we keep that page entirely unmapped. If the kernel tries to read data from this page, the copyin() function will initially fail, before falling back on watch_copyin(). This goes and temporarily maps in the page, does the copy (triggering a watchpoint if necessary) and then unmapping the page. In this way, the average kernel consumer has no idea that there was a watched area on the page.Clustering uses this facility in their pxfs (proxy filesystem) implementation. In order to support ioctl() calls that access an unspecified amount of memory, they use the copyops vector to translate any reads or writes into over-the-wire requests for the necessary data. These requests are always done from kernel threads, with no attached user space, so any attempt to access userland should fault before vectoring off to their copyops vector.OK, on to the bug. During testing, SunCluster folks found that they were getting essentially random memory corruption during some ioctl() calls over pxfs on SPARC machines. After trying in vain to understand the crash dumps, the Clustering folks were able to reproduce the problem on DEBUG bits. In addition to getting traptrace output (a black-box style record of OS traps), the kernel failed an ASSERT() deep in the sfmmu HAT (Spitfire Memory Management Unit Hardware Address Translation) layer during a copyin() call. This smoking gun pointed straight to the copyops. We expect a kernel thread accessing userland to generate a T_DATA_EXCEPTION trap, but instead we were getting a T_DATA_MMU_MISS trap, which the HAT was not prepared to handle (nor should it have to).I spent nearly a week enhancing my copyops test suite, and following several wrong paths deep into SPARC trap tables and the HAT layer. But no amount of testing could reproduce the problem. Finally, I noticed that we had reached the sfmmu assertion as a kernel thread, but our secondary ASI was set to INVALID_CONTEXT instead of KCONTEXT. On SPARC, all addresses are implicitly tagged with an ASI (address space identifier) that lets us refer to kernel addresses and user addresses without having to share the address psace like we do on x86. All kernel threads are supposed to use KCONTEXT (0) as their secondary ASI. INVALID_CONTEXT (1) is reserved for userland threads in various invalid states. Needless to say, this was confusing.I knew that somehow we were setting the secondary ASI improperly, or forgetting to set it when we should. I began adding some ASSERTs to a custom kernel and quickly ruled out the former. Finally I booted a kernel with some debug code added to resume(), and panicked almost instantly. It was clear that we were coming out of resume() as a kernel thread, but with INVALID_CONTEXT as our secondary ASI. Many hours of debugging later, I finally found my culprit in resume_from_zombie(), which is used when resuming from an exited thread. When a user thread is exiting, we re-parent to p0 (the kernel 'sched' process) and set our secondary ASI to INVALID_CONTEXT. If, in resume(), we switch from one of these threads to another kernel thread, we see that they both belong to the same process (p0) and don't bother to re-initialize the secondary ASI. We even have a function, hat_thread_exit(), designed to do exactly this, only it was a no-op on SPARC. I added a call to sfmmu_setctx_sec() to this function, and the problem disappeared. Technically, this has been a bug since the dawn of time, but it had no ill side effects until I changed the way the copyops were used, and SunCluster began testing on S10.Besides the sheer amount of debugging effort, this bug was interesting for several reasons:It was impossible to root cause on a non-DEBUG kernel. While we try to make the normal kernel as debuggable as possible, memory corruption (especially due to corruption in the HAT layer) is one of those problems that needs to be caught as close to the source as possible. Solaris has a huge amount of debug code, as well as facilities like traptrace that can only be enabled on a debug kernel due to performance overhead.The cause of the problem was separated from the symptom by an arbitrary period of time. Once we switched to a kernel thread with a bad ASI, we could harmlessly switch between any number of kernel threads before we find one that actually tries to access userland.It was completely unobservable in constrained test scenarios. We not only needed to create kernel threads that accessed userland, but we needed to have a userland thread exit and then switch immediately to one of these threads. Needless to say, this is not easy to reproduce, especially when you don't understand exactly what's going on.This would have been nearly unsolvable on most other OSes. Without a kernel debugger, post mortem crash dump analysis, and tools like DTrace and traptrace records, I doubt I could have ever solved this problem. This is one of those situations where a stack trace and a bunch of printf() calls would never have solved the problem.While this wasn't the most difficult problem I've ever had to debug, it certainly ranks up there in recent memory.

I know it's been a long time since I posted a blog entry. But I've either been too busy, out of the country, working on (not yet public) projects, or fixing relatively uninteresting bugs. But...


::whatthread and MDB modules

A long time ago I described a debugging problem where it was necessary to determine which threads owned a reader lock. In particular, I used the heuristic that if the address of the rwlock is in a particular thread's stack, then it most likely held by the thread (and can be verified by examining the thread's stack). This works 99% of the time, because you typically have the following: rw_enter(lock, RW_READER); /\* ... do something ... \*/ rw_exit(lock);The compiler has to preserve the address of the lock across all the junk in the middle, so it almost always ends up getting pushed on the stack. At described in the previous post, this means a combination of ::kgrep and ::whatis, plus some hand-pruning, to get the threads in question. At the time, I mentioned how nice it would be to have a dedicated command do this dirty work. Now that Solaris 10 has shipped, I finally sat down and gave it a try. In a testament to MDB's well-designed interfaces, I was able to write the entire command in under 5 minutes with just 50 lines of code. On top of that, it runs in a fraction of the time. Rather than searching the entire address space, we only have to look at the stack for each thread. For example: > c8d45bb6::kgrep | ::whatis c8d45ae4 is c8d45aa0+44, allocated as a thread structure cae92ed8 is in thread c8d45aa0's stack cae92ee4 is in thread c8d45aa0's stack cae92ef8 is in thread c8d45aa0's stack cae92f24 is in thread c8d45aa0's stack > c8d45bb6::whatthread 0xc8d45aa0 >The simple output allows it to be piped to ::findstack to quickly locate questionable threads. There have been discussions about maintaining a very small set of held reader locks in the thread structure, but it's a difficult problem to solve definitively (without introducing massive performance regressions).This demonstrates an oft-overlooked benefit of MDB. Though very few developers exist outside of the Solaris group, developing MDB modules is extremely simple and powerful (there are more than 500 commands and walkers in MDB today). Over time, I think I've almost managed to suppress all the painful GDB memories from my college years...

A long time ago I described a debugging problem where it was necessary to determine which threads owned a reader lock. In particular, I used the heuristic that if the address of the rwlock is in a...


DTrace and customer service

Today, I thought I'd share a real-world experience that might portray DTrace in a slightly different light than you're used to. The other week, I was helping a customer with the following question:Why is automountd constantly taking up 1.2% of CPU time?The first thought that came to mind was a broken automountd. But if that were the case, you'd be more likely to see it spinning and stealing 100% of the CPU. Just to be safe, I asked the customer to send truss -u a.out:: output for the automountd process. As expected, I saw automountd chugging away, happily servicing each request as it came in. Automountd was doing nothing wrong - some process was indirectly sending millions of requests a day to the automounter. Taking a brief look at the kernel code, I responded with the following D script: #!/usr/sbin/dtrace -s auto_lookup_request:entry { @lookups[execname, stringof(args[0]->fi_path)] = count(); }The customer gave it a shot, and found a misbehaving program that was continuously restarting and causing loads of automount activity. Without any further help from me, the customer could easily see exactly which application was the source of the problem, and quickly fixed the misconfiguration.Afterwards, I reflected on how simple this exchange was, and how difficult it would have been in the pre-Solaris 10 days. Now, I don't expect customers to be able to come up with the above D script on their own (though industrious admins will soon be able to wade through OpenSolaris code). But I was able to resolve their problem in just 2 emails. I was reminded of the infamous gtik2_applet2 fiasco described in the DTrace USENIX paper - automountd was just a symptom of an underlying problem, part of an interaction that was prohibitively difficult to trace to its source. One could turn on automountd debug output, but you'd still only see the request itself, not where it came from. To top it off, the offending processes were so short-lived, that they never showed up in prstat(1) output, hiding from traditional system-wide tools.After a little thought, I imagined a few Solaris 9 scenarios where I'd either set a kernel breakpoint via kadb, or set a user breakpoint in automountd and use mdb -k to see which threads were waiting for a response. But these (and all other solutions I came up with) were:Disruptive to the running systemNot guaranteed to isolate the particular problemDifficult for the customer to understand and executeIt really makes me feel the pain our customer support staff must go through now to support Solaris 8 and Solaris 9. DTrace is such a fundamental change in the debugging and observability paradigm that it changes not only the way we kernel engineers work, but also the way people develop applications, administer machines, and support customers. Too bad we can't EOL Solaris 8 and Solaris 9 next week for the benefit of Sun support...

Today, I thought I'd share a real-world experience that might portray DTrace in a slightly different light than you're used to. The other week, I was helping a customer with the following question: Why...


Under the bootchart hood

Last week I announced our bootchart results. Dan continued with a sample of zone boot, as well as some interesting bugs that have been found already thanks to this tool. While we're working on getting the software released, I thought I'd go into some of the DTrace implementation.To begin with, we were faced with the annoying task of creating a parsable log file. After looking at the existing implementation (which parses top output) and a bunch of groaning, Dan suggested that we should output XML data and leverage the existing Java APIs to make our life easier. Faced with the marriage between something as "lowlevel" as DTrace and something as "abstract" as XML, my first reaction was one of revulsion and guilt. But quickly we realized this was by far the best solution. Our resulting parser was 230 lines of code, compared with 670 for the set of parsers that make up the open source version.Once we settled on an output format, we had to determine exactly what we would be tracing, and exactly how to do it. First off, we had to trace process lifetime events (fork, exec, exit, etc). With the top implementation, we cannot catch exact event times, nor can we catch short-lived processes which begin and end within a sample period. So we have the following D probes:proc:::create - Fires when a new process is created. We log the parent PID, as well as the new child PID.proc:::exec-success - Fires when a process calls exec(2) successfully. We log the new process name, so that we can convert between PIDs and process names at any future point.proc:::exit - Logs an event when a process exits. We log the current PID.exec_init:entry - This one is a little subtle. Due to the way in which init(1M) is started, we don't get a traditional proc:::create probe. So we have to use FBT and catch calls to exec_init(), which is responsible for spawning init.This was the easy part. The harder part was to gather usage statistics on a regular basis. The approach we used leveraged the following probes:sched:::on-cpu, sched:::off-cpu - Fires when a thread goes on or off CPU. We keep track of the time spent on CPU, and increment an aggregation using the sum() aggregation.profile:::tick-200ms - Fires on a single CPU every 200 milliseconds. We use printa() to dump the contents of the CPU aggregation on every intervalThere were several wrinkles in this plan. First of all, printa() is processed entirely in userland. Given the following script: #!/usr/sbin/dtrace -s profile:::tick-10ms { @count["count"] = count(); } profile:::tick-200ms { printa(@count); clear(@count); }One would expect that you would see 5 consecutive outputs of "20". Instead, you see one output of "100", and four more of "0". Because the default switchrate for DTrace is one second, and aggregations are processed by the dtrace(1M) process, we only see the aggregations once a second. This can be fixed by decreasing the switchrate tunable. This also means we can't make use of printa() during anonymous tracing, so we had to have two separate scripts (one for early boot, one for later).The results are reasonable, but Ziga (the original author of bootchart) suggested a much more clever way of keeping track of samples. Instead of relying on printa(), we key the aggregation based on "sample number" (time divided by a large constant), and then dump the entire aggregation at the end of boot. Provided the amount of data isn't too large, the entire thing can be run anonymously, and we don't have the overhead of DTrace waking up every 10 milliseconds (in the realtime class, no less) to spit out data. We'll likely try this approach in the future.There's more to be said, but I'll leave this post to be continued later by myself or Dan. In the meantime, you can check out a sample logfile produced by the D script.

Last week I announced our bootchart results. Dan continued with a sample of zone boot, as well as some interesting bugs that have been found already thanks to this tool. While we're working...


Boot chart results

I've been on a vacation for a while, but last time I mentioned that Dan and myself had been working on a Solaris port of the open source bootchart program. After about a week of late night hacking, we had a new version (we kept only the image renderer). You can see the results (running on my 2x2 GHz opteron desktop) by clicking on the thumbnail below:In the next few posts, I'll go over some of the implementation details. We are working on open sourcing the code, but in the meantime I can talk about the instrumentation methodology and some of the hurdles that had to be overcome. A few comparisons with the existing open source implementation:Our graphs show every process used during boot. Unlike the open implementation, which relies on top, we can leverage DTrace to catch every process.We don't have any I/O statistics. Part of this is due to our instrumentation methodology, and part of it is because the system-wide I/O wait statistic has been eliminated from S10 (it was never very useful or accurate). Since we can do basically anything with DTrace, we hope to include per-process I/O statistics at a future date, as well as duplicating the iostat graph with a DTrace equivalent.We include absolute boot time, and show the beginning of init(1M) and those processes that run before our instrumentation can start. So if you wish to compare the above chart with the open implementation, you will have to subtract approximately 9 seconds to get a comparable time.We chose an "up" system as being one where all SMF services were running. It's quite possible to log into a system well before this point, however. In the above graph, for example, one could log in locally on the console after about 20 seconds, and log in graphically after about 30 seconds.We cleaned up the graph a little, altering colors and removing horizontal guidelines. We think it's more readable (given the large number of processes), but your opinion may differ.Stay tuned for more details.

I've been on a vacation for a while, but last time I mentioned that Dan and myself had been working on a Solaris port of the open source bootchart program. After about a week of late night hacking,...


Boot time performance

Dan circulated the following internally, but I thought it was cool enough to share with everyone out there. It seems there's a very slick project to chart boot time performance across a variety of GNU/Linux systems. This makes it much easier to spot performance bottlenecks and improve boot time performance.It would be really interesting to see this on a Solaris system for two reasons. First, we have the Service Management Facility which significantly parallelizes boot. Having a chart like this would let us see just how well it accomplishes this task. Second, DTrace provides a far superior mechanism for gathering data. The current project uses a combination of top and iostat sampling to gather data. With DTrace, we can get much more accurate data, and do cool stuff like associate I/O with individual processes, track interrupt activity, or gather performance data well before init(1M) begins executing. Our performance team has had a variation on this for a while now, using DTrace to track "events of interest" in order to analyze boot time regressions, but it's presentation is nowhere near that of this tool.Some of us have downloaded the software and are beginning to poke around with some D scripts. Stay tuned...

Dan circulated the following internally, but I thought it was cool enough to share with everyone out there. It seems there's a very slick project to chart boot time performance across a variety of...


Jonathan on OpenSolaris

As Alan points out, Jonathan had a few comments about OpenSolaris in an interview in ComputerWorld. One interesting bit is an "official" timeline:We will have the license announced by the end of this calendar year and the code fully available [by the] first quarter of next year.I can tell you the OpenSolaris team is working like gangbusters to make this a reality. As soon as there is an exact date, you can bet that we'll be making some noise. Jonathan also made a statement that seems to directly conflict with my blog post yesterday:Is there anything preventing you from making all of Solaris open-source? Nothing at all. And let me repeat that. Nothing at all.That's a little different than my statement that "there are pieces of Solaris that cannot be open sourced due to encumbered licenses." The point here is that there are no wide-ranging problems (patents, System V code, SCO, Novell) preventing us from opening all of Solaris. Nor is there any internal pressure to keep some parts of Solaris secret. The "pieces" that I mentioned are few and far between - a single file, a command, or maybe a driver. We're also securing rights to these pieces on a monthly basis, so the number keeps dropping (and may reach zero by the time OpenSolaris goes live).The other point is that we won't be releasing a crippled system. Even if some pieces are not immediately available in source form, we will make sure that anyone can build the same version of Solaris that we do. We hope that these pieces can be re-written and released under the OpenSolaris license as quickly as possible.

As Alan points out, Jonathan had a few comments about OpenSolaris in an interview in ComputerWorld. One interesting bit is an "official" timeline: We will have the license announced by the end of...


More on OpenSolaris

So Novell has made a vague threat of litigation over OpenSolaris, which was prompty spun by the media into a declaration of war by Novell. But it has generated quite a bit of discussion over at OSNews. The claim itself is largely FUD (the "article" is little more than gossip), and the discussion covers a wide range of (often unrelated) topics. But I thought I'd pick out a few of the points that keep coming up and address them here, as the article over at LWN seems to make some of the same mistakes and assumptions.Sun does not own sysV, and therefore cannot legally opensource itNo one can really say what Sun owns rights to. Unless you have had the privilege of reading the many contracts Sun has (which most Sun employees haven't, myself included), it's presumptuous to state definitively what we can or cannot do legally. We have spent a lot of money acquiring rights to the code in Solaris, and we have a large legal department that has been researching and extending those rights for a very long time. Novell thinks they own rights to some code that we may or may not be using, but I trust that our legal department (and the OpenSolaris team) has done due diligence in ensuring that we have the necessary rights to open source Solaris.Sun has been "considering" open sourcing solaris for about five years now. It's all just a PR stunt.I can't emphasize enough that this is not a PR stunt. We have a dozen engineers working full-time getting OpenSolaris out the door. We have fifty external customers participating in the OpenSolaris pilot. We have had discussions with dozens of customers and ISVs, as well as presenting at numerous BOFs across the country. This will happen. Yes, it has taken five years - there's a lot of ground to cover when open sourcing 20 years of OS development.Even if it is open source it still is proprietary, because no one can modify its code and can't make changes, all one can do is watch and suggest to Sun.We have already publicly stated that our goal is to build a community. There is zero benefit to us throwing source code over the wall as a half-hearted guesture towards the open source community. While it may not happen overnight, there will be community contributions to OpenSolaris. We want the responsibility to rest outside of Sun's walls, at which point we become just another (rather large) contributor to an open source project.However, the company has not yet announced a license, whether the license will be OSI-compliant or exactly how much of Solaris 10 will be under this open source license.We have not announced a license, but we have also stated numerous times that it will be OSI-compliant. We know that using a non-OSI license will kill OpenSolaris before it leaves the gate. As to how much of Solaris will be released, the answer is "everything we possibly can." There are pieces of Solaris that cannot be open sourced due to encumbered licenses. But time and again people suggest that we will open source "everything but the crown jewels" - as if we could open source everything but DTrace, or everything but x86 support. Every decision is made based on existing legal agreements, not some misguided attempt to create a crippled open source port.OpenSolaris is still under development - some of the specifics (licensing, governance model, etc) are still being worked out. All of us are involved one way or another in the future of OpenSolaris. Our words may not carry the "official" tag associated with a press release or news conference, but we're the ones working on OpenSolaris every single day. All of this will be settled when OpenSolaris goes live (as soon as we have a date we'll let you know). Until then, we'll keep trying to get the message out there. I encourage you to ignore your preconceived notions of Sun, of what has and has not been said in the media, and instead focus on the real message - straight from the engineers driving OpenSolaris.

So Novell has made a vague threat of litigation over OpenSolaris, which was prompty spun by the media into a declaration of war by Novell. But it has generated quite a bit of discussion over at OSNews....


Inside the cyclic subsystem

A little while ago I mentioned the cyclic subsystem. This is an interesting little area of Solaris, written by Bryan back in Solaris 8. It is the heart of all timer activity in the Solaris kernel.The majority of this blog post comes straight from the source. Those of you inside Sun or part of the OpenSolaris pilot should check out usr/src/uts/common/os/cyclic.c. A precursor to the infamous sys/dtrace.h, this source file has 1258 lines of source code, and 1638 lines of comments. I'm going to briefly touch on the high-level aspects of the system; but as you can imagine, it's quite complicated.Traditional methodsHistorically, operating systems have relied a regular clock interrupt. This is different from the clock frequency of the chip - the clock interrupt typically fires every 10ms. All regular kernel activity was scheduled around this omnipresent clock. One of these activities would be to check if there any expired timeouts that need to be triggered.This granular frequency is usually enough for average activities, but can kill realtime applications that require high-precision timing becaus it forces timers to align on these artificial boundaries. For example, imagine we need a timer to fire every 13 milliseconds. Rather than having the timers fire at 13, 26, 39, and 52 ms, we would instead see it fire at 20, 30, 40, and 60 ms. This is clealy not what we wanted. The result is known as "jitter" - timing deviations from the ideal. Timing granuality, scheduling artifacts, system load, and interrupts all introduce arbitrary latency into the system. By using existing Solaris mechanisms (processor binding, realime scheduling class) we could eliminate much of the latency, but we were still stuck with the granularity of the system clock. The frequency could be tuned up, but this would also increase the time spent doing other clock activity (such as process accounting), and induce significant load on the system.Cyclic subsystem basicsEnter the cyclic subsystem. It provides for highly accurate interval timers. The key feature is that it is based on programmable timestamp counters, which have been available for many years. In particular, these counters can be programmed (quickly) to generate an interrupt at arbitrary and accurate intervals. Originally available only for SPARC, x86 support (based on programmable APICs) is now available in Solaris 10.The majority of the kernel sees a very simple interface - you can add, remove, or bind cyclics. Internally, we keep around a heap of cyclics, organized by expiration time. This internal interface connects to a hardware-specific backend. We pick off the next cyclic to process, and then program the hardware to notify us after the next interval. This basic layout can be seen on any system with the ::cycinfo -v dcmd:# mdb -kLoading modules: [ unix krtld genunix specfs dtrace ufs ip sctp uhci usba nca crypto random lofs nfs ptm ipc ]> ::cycinfo -vCPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER 0 d2da0140 online 4 d2da00c0 50d7c1308c180 apic_redistribute_compute 1 | +------------------+------------------+ 0 3 | | +---------+--------+ +---------+---------+ 2 | +----+----+ ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER d2da00c0 0 1 high 0 50d7c1308c180 10000 cbe_hres_tick d2da00e0 1 0 low 0 50d7c1308c180 10000 apic_redistribute_compute d2da0100 2 3 lock 0 50d7c1308c180 10000 clock d2da0120 3 2 high 0 50d7c35024400 1000000 deadman>On this system there are no realtime timers active, so the intervals (USECINT) are pretty boring. You may notice one elegant feature of this implementation - the clock() function is now just a cyclic consumer. If you're wondering what 'deadman' is, and why it has such a high interval, it's a debugging feature that saves the system from hanging indefinitely (most of the of the time). Turn it on by adding 'set snooping = 1' in /etc/system. If the clock cannot make forward progress in 50 seconds, a high level cyclic will fire and we'll panic.To register your own cyclic, use the timer_create(3RT) function with the CLOCK_HIGHRES type (assuming you have the PROC_CLOCK_HIGHRES privilege). This will create a low level cyclic with the appropriate timeout. The average latency is extremely small when done properly (bound to a CPU with interrupts disabled) - on the order of a few microseconds on modern hardware. Much better than the 10 millisecond artifacts possible with clock-based callouts.More detailsAt a high level, this seems pretty straightforward. Once you figure out how to program the hardware, just toss some function pointers into an AVL tree and be done with it, right? Here are some of the significant wrinkles in this plan:Fast operation - Because we're dispatching real-time timers, we need to be able to trigger and re-schedule cyclics extremely quickly. In order to do this, we make use of per-CPU data structures, heap management that touches a minimal number of cache lines, and lock-free operation. The latter point is particularly difficult, considering the presence of low-level cyclics.Low level cyclics - The cylic subsystem operates at a high interrupt level. But not all cyclics should run at such a high level (and very few do). In order to support low level cyclics, the subsystem will post a software interrupt to deal with the cyclic at a lower level interrupt. This opens up a whole can of worms, because we have to guarantee a 1-to-1 mapping, as well as maintain timing constraints.Cyclic removal - While rare, it is occasionally necessary to remove pending cyclics (the most common occurence is when unloading modules with registered cyclics). This has to be done without disturbing the other running cyclics.Resource resizing - The heap, as well as internal buffers used for pending lowlevel cyclics, must be able to handle any number of active cyclics. This means that they have to be resizable, while maintaining lock-free operation in the common path.Cyclic jugging - In order to offline CPUs, we must be able to re-schedule cyclics on other active CPUs, without missing a timeout in the process.As you can see, the cyclic subsystem is a complicated but well-contained subsystem. It uses a well-organized layout to expose a simple interface to the rest of the kernel, and provides great benefit to both in-kernel consumers and timing-sensitive realtime applications.

A little while ago I mentioned the cyclic subsystem. This is an interesting little area of Solaris, written by Bryan back in Solaris 8. It is the heart of all timer activity in the Solaris kernel. The...


Dual Core Opterons

So it's no secret that AMD and Intel are in a mad sprint to the finish for dual-core x86 chips. The offical AMD roadmap, as well as public demos have all shown AMD well on track. The latest tidbits of information indicate Linux is up and running on these dual-core systems. Very cool.Given our close relationship with AMD and the sensitive nature of hardware plans, I'll refrain from saying what we may or may not have running in our labs. But Solaris has some great features that make it well-suited for these dual core chips. First of all, Solaris 10 has had support for both Chip Multi Threading (hyperthreading) and Chip Multi Processing (multi core) for about a year and half now. Solaris has also been NUMA-aware for much longer (with the current lgroups coming in mid-2001, or Solaris 9). I'm sure AMD has made these cores appear as two processesors for legacy purposes, but with a little cpuid tweaks, we'll see them as sibling cores and get all the benefits inherent in Solaris 10 CMP.Despite this, the NUMA system in Solaris is undergoing drastic change due to the Opteron memory architecture. While Solaris is NUMA-aware, it uses a simplistic memory heirarchy based on the physical architecture of Sun's high end SPARC systems. We have the notion of a "locality group", which represents the logical relationship of CPUs and memory. Currently, there are only two notions of locality - "near" and "far". Solaris tries its best to keep logically connected memory and processes in the same locality group. On Opteron, things get a bit more complicated due to the integrated memory controller and HyperTransport layout. On 4-way machines the processors are laid out in a square, and on 8-way machines we have a ladder formation. Memory transfers must pass through neighboring memory controllers, so now memory could be "near", "far", or "farther". We're revamping the current lgroup system to support arbitrary memory heirachies, which should produce some nice performance gains on 4- and 8-way Opteron machines. Hopefully one of the NUMA folks will blog some more detailed information once this project integrates.In conclusion: Opterons are cool, but dual-core Opterons are cooler. And Solaris will rip on both of them.

So it's no secret that AMD and Intel are in a mad sprint to the finish for dual-core x86 chips. The offical AMD roadmap, as well as public demos have all shown AMD well on track. The latest tidbits...


Debugging on AMD64 - Part Three

Given that the amd64 ABI is nearly set in stone, and (as pointed out in comments on my last entry) future OpenSolaris ports could run into similar problems on other architectures (like PowerPC), you may wonder how we can make life easier in Solaris. In this entry I'll elaborate on two possibilities. Note that these are little more than fantasies at the moment - no real engineering work has been done, nor is there any guarantee that they will appear in a future Solaris release.DWARF Support for MDBEven though DWARF is a complex beast, it's not impossible to write an interpreter. It's just a matter of doing the work. The more subtle problem is designing it correctly, and making the data accessible in the kernel. Since MDB and KMDB are primarily kernel or post-mortem userland tools, this has not been a high priority. CTF gives us most of what we need, and including all the DWARF information in the kernel (or corefiles) is prohibitively expensive. That being said, there are those among us that would like to see MDB take a more prominent userland role (where it would compete with dbx and gdb), at which point proper DWARF support would be a very nice thing to have.If this is done properly, we'll end up with a debugging library that's format-independent. Whether the target has CTF, STABS, or DWARF data, MDB (and KMDB) will just "do the right thing". No one argues that this isn't a cool idea - it's just a matter of engineering resources and business justification.Programmatic DisassemblerThe alternative solution is to create a disassembler library that understands code at a semantic level. Once you have a disassembler that understands the logical breakdown of a program, you can determine (via simulation) the original argument values to functions. Of course, it's not always guaranteed to work, but you'll always know when you're guessing (even DWARF can't be correct 100% of the time). This requires no debugging information, only the machine text. It will also help out the DTrace pid provider, which has to wrestle with jump tables and other werid compiler-isms. Of course, this is monumentally more difficult than a DWARF parser - especially on x86.This idea (along with a prototype) has been around for many years. The converted have prophesized that libdis will bring peace to the world and an end to world hunger. As with many great ideas, there just hasn't been justification for devoting the necessary engineering resources. But if it can get the arguments to functions on amd64 correct in 98% of the situations, it would be incredibly valuable.OpenSolaris Debugging FuturesThere are a host of other ideas that we have kicking around here in the Solaris group. They range from pretty mundance to completely insane. As OpenSolaris finishes getting in gear, I'm looking forward to getting these ideas out in the public and finding support for all the cool possibilities that just aren't high enough priority for us right now. The existence of a larger development community will also make good debugging tools a much better business proposition.

Given that the amd64 ABI is nearly set in stone, and (as pointed out in comments on my last entry) future OpenSolaris ports could run into similar problems on other architectures (like PowerPC), you...


Debugging on AMD64 - Part Two

Last post I talked about one of the annoying features of the amd64 ABI - the optional frame pointer. Today, I'll examine the much more painful problem of argument passing on amd64. For sake of discussion, I'll avoid structure passing and floating point - nasty little kinks in the problem.Argument Passing on i386On i386, all arguments are passed on the stack. Before establishing a frame, the caller pushes each argument to the function in reverse order. This gives you this stack layout:...arg1arg0return PCprevious frame%ebpcurrent frame%espIf you want to access the third argument, you simply reference 16(%ebp) (8 for the frame + 8 to skip first two args). This makes debugging a breeze. For any given frame pointer (easy to find thanks to the i386 ABI), we can always find the initial arguments to the function. Another trick we use is that nearly every function call is followed by a addl x, %esp instruction. Using this information, we can figure out how many arguments were passed to the function, without relying on CTF or STABS data. Putting this all together, it's easy to get a meaningful stack trace:        > a76de800::findstack -v        stack pointer for thread a76de800: a77c5dd4        [ a77c5dd4 0xfe81994d() ]          a77c5dec swtch+0x1cb()          a77c5e10 cv_wait_sig+0x12c(a78a79b0, a6c57028)          a77c5e70 cte_get_event+0x4d()          a77c5ea4 ctfs_endpoint_ioctl+0xc2()          a77c5ec4 ctfs_bu_ioctl+0x2f()          a77c5ee4 fop_ioctl+0x1e(a79a7980, 63746502, 80d3f48, 102001, a69daf08, a77c5f74)          a77c5f80 ioctl+0x19b()          a77c5fac sys_call+0x16e()Arguments Passing on AMD64Enter amd64. As previously mentioned, the amd64 ABI was designed primarily for for performance, not debugging. The architects decided that pushing arguments on the stack was expensive, and that with 16 general purpose registers, we might as use some of them to pass arguments. Specifically, we have:arg0%rdiarg1%rsiarg2%rdxarg3%rcxarg4%r8arg5%r9argN8\*(N-4)(%ebp)This is an disaster for debugging. Debugging tools that operate in-place (DTrace and truss) can get meaningful arguments, but cannot know how many there are. Tools which examine a stack trace (pstack, mdb) cannot get arguments for any frame. The arguments may or may not be pushed on the stack, or they could be lost completely. If we try to get a stack with arguments, we find:        > ffffffff8af1c720::findstack -v        stack pointer for thread ffffffff8af1c720: ffffffffb2a51af0          ffffffffb2a51d00 vpanic()          ffffffffb2a51d30 0xfffffffffe972ae3()          ffffffffb2a51d60 exitlwps+0x1f1()          ffffffffb2a51dd0 proc_exit+0x40()          ffffffffb2a51de0 exit+9()          ffffffffb2a51e40 psig+0x2bc()          ffffffffb2a51ee0 post_syscall+0x7d5()          ffffffffb2a51f00 syscall_exit+0x5d()          ffffffffb2a51f10 sys_syscall32+0x1d8()The solutionThe solution, as envisioned by the amd64 ABI designers, is to rely on DWARF to get the necessary information. If you have ever read the DWARF spec, you know that it a gigantic, ugly beast - an interpreted language that can be used to mine virtually any debugging data in an abstract manner. The problem here is that it requires significantly more work than on i386, and it requires debugging information to be present in the target object.Implementing a DWARF interpreter is technically quite doable. We even had one brave soul go so far as to implement a limited DWARF disassembler capable of grabbing arguments for functions. But it turns out that the sheer amount of data we would have to add to the kernel to enable this was prohibitive. The bloat would have pushed us past the limit of the miniroot, not to mention the increased memory footprint and necessary changes to krtld and KMDB. That's not to say we won't support it in userland some day.The lack of an argument count is a less serious. DTrace doesn't need to know how many arguments there are. For the moment, truss simply shows the first 6 arguments always. But truss could be enhanced to use CTF and/or DWARF data to determine the number of arguments to a given function. But it probably won't happen any time soon.WorkaroundGiven that there will be no solution to this problem any time soon, you may ask how one can do any kind of debugging at all. The answer is "painfully". I'll walk through an example of finding the arguments to a function, using the following stack:        > ffffffff8356c100::findstack -v        stack pointer for thread ffffffff8356c100: ffffffffb2bbdb10        [ ffffffffb2bbdb10 _resume_from_idle+0xe4() ]          ffffffffb2bbdb40 swtch+0xc9()          ffffffffb2bbdb90 cv_wait_sig+0x170()          ffffffffb2bbdc50 cte_get_event+0xb0()          ffffffffb2bbdc70 ctfs_endpoint_ioctl+0x7e()          ffffffffb2bbdc80 ctfs_bu_ioctl+0x32()          ffffffffb2bbdc90 fop_ioctl+0xb()          ffffffffb2bbdd70 ioctl+0xac()          ffffffffb2bbde00 dosyscall+0x12b()          ffffffffb2bbdf00 trap+0x1308()        >Let's say that we want to know the first argument to fop_ioctl(), which is a vnode. The first step is to look at the caller and see where the argument came from:        > ioctl+0xac::dis -n 6------> ioctl+0x8e:                     movq   0x10(%r12),%rdi        ioctl+0x93:                     movq   0x1a0(%rax),%r8        ioctl+0x9a:                     leaq   -0xcc(%rbp),%r9        ioctl+0xa1:                     movq   %r15,%rdx        ioctl+0xa4:                     movl   %r13d,%esi------> ioctl+0xa7:                     call   +0xeed99 <fop_ioctl>        ioctl+0xac:                     testl  %eax,%eax        ioctl+0xae:                     movl   %eax,%ebx        ioctl+0xb0:                     jne    +0x74    <ioctl+0x124>        ioctl+0xb2:                     cmpl   $0x8004667e,%r13d        ioctl+0xb9:                     je     +0x27    <ioctl+0xe0>        ioctl+0xbb:                     movl   %r14d,%edi        ioctl+0xbe:                     call   -0x1408e <releasef>We can see that %rdi (the first argument) came from %r12. Looks like we lucked out - %r12 must be preserved by the function being called. So we look at fop_ioctl():        > fop_ioctl::dis        fop_ioctl:                      movq   0x40(%rdi),%rax        fop_ioctl+4:                    pushq  %rbp        fop_ioctl+5:                    movq   %rsp,%rbp        fop_ioctl+8:                    call   \*0x28(%rax)        fop_ioctl+0xb:                  leave        fop_ioctl+0xc:                  retNo dice. We can see that %r12 (as well as %rdi) is still active at this point. Let's keep looking:        > ctfs_bu_ioctl::dis ! grep r12        > ctfs_endpoint_ioctl::dis ! grep r12        > cte_get_event::dis ! grep r12        cte_get_event+0x13:             pushq  %r12        cte_get_event+0x32:             movq   0x20(%rdi),%r12        ...Finally, we found a function that preserves %r12. Taking a closer look at cte_get_event():        > cte_get_event::dis -n 8        cte_get_event:                  pushq  %rbp        cte_get_event+1:                movq   %rsp,%rbp        cte_get_event+4:                pushq  %r15        cte_get_event+6:                movl   %esi,%r15d        cte_get_event+9:                pushq  %r14        cte_get_event+0xb:              movq   %rcx,%r14        cte_get_event+0xe:              pushq  %r13        cte_get_event+0x10:             movl   %r9d,%r13d        cte_get_event+0x13:             pushq  %r12We can see that %r12 was pushed fourth after establishing the frame pointer. This would put it 32 bytes below %rbp for this frame. Remembering that what was really passed was 0x10(%r12), we can finally find our original argument:        > ffffffffb2bbdc50-20/K        0xffffffffb2bbdc30:             ffffffff8330ec88        > ffffffff8330ec88+10/K        0xffffffff8330ec98:             ffffffff83a5f600        > ffffffff83a5f600::print vnode_t v_path        v_path = 0xffffffff83978c40 "/system/contract/process/pbundle"Whew. We can see that we have the proper vnode, since the path references a /system/contract file. And all it took was about 12 steps! You can see how this has become such a pain for us kernel developers. From the above example, you can see the approximate method is:Determine where the argument came from in the caller. Hopefully, you will find something that came from the stack, or one of the callee-saved registers (%r12-%r15). If not, look at the function and see if the argument was pushed on the stack or moved somewhere more permanent. This doesn't happen often, so it may be that your argument is lost forever.If the argument came from a callee-saved register, examine every function in the stack until you find one that saves the value.By this point, you've hopefully found a place where the value is stored relative to %ebp. Using the frame pointers displayed in the stack trace, fetch the value from the stack.This is not always guaranteed to work, and is obviously a royal pain. In my next post, I'll go into some future ideas we have to make this (and other debugging) better.

Last post I talked about one of the annoying features of the amd64 ABI - the optional frame pointer. Today, I'll examine the much more painful problem of argument passing on amd64. For sake...


Debugging on AMD64 - Part One

The amd64 port of Solaris has been available (internally) for about a month and a half, and the rest of the group is starting to realize what those of us on the project team have known for a while: debugging on amd64 is a royal pain. The difficulty comes not from processor changes, but from design choices made in the AMD64 ABI. The ABI was designed primarily with performance in mind - debuggability and observability was largely an afterthought. There are two features of the ABI that really hurt debuggability. In this post I'll cover the less annoying of the two - look for another followup soon.Frame PointersIn the i386 ABI, you almost always have to establish a frame pointer for the current function (leaf routines being the exception). This gives you the familiar opening function sequence:        pushl   %ebp        movl    %esp, %ebpAnd your frame ends up looking like this:...arg1arg0return PCprevious frame%ebpcurrent frame%espThis is a restriction of the ABI, not the processor. You can cheat by using the -fomit-frame-pointer flag to gcc, but this is not ABI compliant (although some people still think it's a great idea).The problemWith amd64, you would think that they would just keep this convention. At first glance it seems that way, until you find this little footnote in section 3.3.2:The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available.On amd64, the frame pointer is explicitly optional. To make debugging somewhat easier, they provide a .eh_frame ELF section that gives enough information (in the form of a binary search table) to traverse a stack from any point. This is slightly better than DWARF, but still requires a lot of processing. The problem with this is that it unnecessarily restricts the context from which you can gather a backtrace. On i386, your stack walking function is something like:      frame = %ebp      while (not at top of stack)             process frame             frame = \*frameSimple and straightforward. This omits a few nasty details like signal frames and #gp faults, but it's largely correct. On amd64, you now have to load the .eh_frame section, process it, and keep it someplace where you have easy access to it. While this doesn't sound so bad for gdb, it becomes a huge nightmare for something like DTrace. If you read a little bit of the technical details behind DTrace, you'll understand that probes execute in arbitrary context. You may be in the middle of handling an interrupt, in dispatcher or VM code, or processing a trap (although on SPARC, DTracing code that executes at TL > 0 is strictly verboten). This means that the set of possible actions is severely limited, not to mention performance-critical. In order to process a stack() directive on amd64, we would now have to do something like:        frame = %ebp        while (not at top of stack)                process frame                for (each module in the system)                        next = binary search in .eh_frame                        if (next)                                frame = next                if (frame not found)                        frame = \*frameOf course, you could maintain a merged lookup table for all modules on the system, but this is considerably more difficult and a maintenance nightmare. The real show stopper comes with the ustack() action. It is impossible, from arbitrary context within the kernel, to process the objects in userland and find the necessary debugging information. And unless we're using only the pid provider, there's no way to know a priori what processes we will need to examine via ustack(), so we can't even cache the information ahead of time.The solutionWhat do we do in Solaris? We punt. Our linkers will happily process .eh_frame sections correctly, but our debugging tools (DTrace, mdb, pstack, etc) will only understand executables that use a frame pointer. All of our code (kernel, libraries, binaries) is compiled with frame pointers, and hopefully our users will do so as well.The amd64 ABI is still a work in progress, and the Solaris supplement is not yet finished. More language may be added to clarify the Solaris position on this "feature". It will probably be a non-issue as long as GCC defaults to having frame pointers on amd64 Solaris. I'm not completely sure how the latest GCC behaves - I believe that it defaults to using frame pointers, which is good. I just hope -fomit-frame-pointer never becomes common practice as we move to OpenSolaris and a larger development community.MotivationWhy was this written into the amd64 ABI? It's a dubious optimization that severely hinders debuggability. Some research claims a substantial improvement, though their own data shows questionable gains. On i386, you at least had the advantage of increasing the number of usable registers by 20%. On amd64, adding a 17th general purpose register isn't going to open up a whole new world of compiler optimizations. You're just saving a pushl, movl, an series of operations that (for obvious reasons) is highly optimized on x86. And for leaf routines (which never establish a frame), this is a non-issue. Only in extreme circumstances does the cost (in processor time and I-cache footprint) translate to a tangible benefit - circumstances which usually resort to hand-coded assembly anyway. Given the benefit and the relative cost of losing debuggability, this hardly seems worth it.It may seem a moot point, since you've been able to use -fomit-frame-pointer on i386 for years. The difference here is that on i386, you were knowingly breaking ABI compatibility by using that option. Your application was no longer guaranteed to work properly, especially when it came to debugging. On amd64, this behavior has received official blessing, so that your application can be ABI compliant but completely opaque to DTrace and mdb. I'm not looking forward to "DTrace can't ustack() my gcc-compiled app" bugs (DTrace already has enough trouble dealing with curious gcc-isms as it is).It's conceivable that we could add support for this functionality in our userland tools, but don't expect it any time soon. And it will never happen for DTrace. If you think saving a pushl, movl here or there is worth it, then you're obviously so performance-oriented that debuggability is the last thing on your mind. I can understand some of our HPC customers needing this; it's when people start compiling /usr/bin/\* without frame pointers that it gets out of control. Just don't be suprised when you try to DTrace your highly tuned app and find out you can't get a proper stack trace...Next post, I'll discuss register passing conventions, which is a much more visible (and annoying) problem.

The amd64 port of Solaris has been available (internally) for about a month and a half, and the rest of the group is starting to realize what those of us on the project team have known for a while:...


Back from the dead

Despite what it may seem, I have not fallen off the face of the earth. Those of us in the solaris kernel group have been a little busy lately, as we're in the final stretch of Solaris 10. Hopefully, this post will be the start of a return to my old blogging form...A few things I've been up to recently:Fixing bugs. Lots of bugs. SMF, procfs, amd64, you name it.The only user-visible changes is one of the SMF features: abbreviated FMRIs. Those of you out there never had to endure the dark ages where svcadm disable sendmail was not a valid command - you had to make sure to remember the entire FMRI (network/smtp:sendmail). This has since propagated to all the SMF commands, including svccfg(1M) and svcprop(1).On the not-so-visible side, one of the more interesting bugs I tracked down was a nasty hang in the kernel when running the GDB test suite. If a process with watchpoints enabled received a SIGKILL at an inopportune moment, it would descend into the dark pits of the kernel, never to return. Once OpenSolaris goes live, I'd love to blog about some of the crazy dances procfs has to do, but it's incomprehensible without some source code to go along with it.Tinkering with ZFS. I'm working part-time on a cool ZFS enhancement that (we hope) will do wonders for remote administration and zones virtualization. I'll post some more as the details get flushed out.Solaris 10 launch. I was at the launch, talking about DTrace and sitting in on the Experts Exchange. It was a great event - full of great announcements and lots of customer enthusiasm. Once you see S10 in action, and experience it for yourself, it's hard not to be enthusiastic.It's hard to surf tech news these days without hitting a Solaris 10 article, but one particularly interesting one is this Forrester analysis. You can also catch the HP backlash, with a mother-approved response, as well an I-don't-have-to-answer-to-my-boss response.More tech-heavy posts are in the works...

Despite what it may seem, I have not fallen off the face of the earth. Those of us in the solaris kernel group have been a little busy lately, as we're in the final stretch of Solaris 10. Hopefully,...


Microstate accounting in Solaris 10

With the tidal wave of features that is Solaris 10, it's all too easy to miss the less visible enhancements. Once of these "lost features" is microstate accounting turned on by default for all processes. Process accounting keeps track of time spent by each thread in various states - idle, in system mode, in user mode, etc. This is used for calculating the usage percentages you see in prstat(1) as well as time(1) and ptime(1). Historically there have been two ways of doing process accounting:Clock-based accountingVirtually every operating system in existence has a clock function - a periodic interrupt used to perform regular system activity1. On Solaris, the clock function does a number of simple tasks, mostly having to do with process and CPU accounting. Clock based accounting examines each CPU, and increments per-process statistics depending on whether the current thread is in user mode, system mode, or idle. This uses statistical sampling (taking a snapshot every 10 milliseconds) to come up with a picture of system resource usage.Microstate accountingWith microstate accouning, the counters are updated with each state transition as they occur in real time. This results in much more accurate, if more expensive, accounting information. Previously, you could turn this on through the /proc interfaces (proc(4)) using the PR_MSACCT control. This has since become a no-op; you cann't disable microstate accounting in Solaris 10.The clock-based sampling method has several drawbacks. One is that it is only a guess for rapidly transitioning threads. If you're going between system and user mode faster than once every 10 milliseconds, the clock will only account for a single state during this time period. Even worse is the fact that it can miss threads entirely if it catches a CPU in an idle state. If you have a thread that wakes up every 10 milliseconds and does 9 milliseconds of work before going to sleep, you will never account for the fact that it is using 90% of the CPU2.These 'phantom' threads can hide from traditional process tools. The now-infamous gtik2_applet2 outlined in the DTrace USENIX paper is a good example of this. These inaccuracies also affect the fair share scheduler (FSS(7)), which must make resource allocation decisions based on the accuracy of these accounting statistics.With microstate accounting, all of these artifacts disappear. Since we do the calculations in situ, it's impossible to "miss" a transition. The reason we haven't always relied on microstate accounting is that it used to be a fairly expensive process. Threads transition between states with high frequency, and with each transition we need to get a timestamp to determine how long we spent in the last state.The key obversation that made this practical is that the expensive part is not reading the hardware timestamp. The expensive part is the conversion to nanoseconds - clock ticks are useless when doing any sort of real accounting. The solution was to use clock ticks for all the internal accounting, and only do the conversion to nanoseconds when actually requested. With a few other tricks, we were able to reduce the impact to virtually nil. You can see the difference in micro-benchmark performance of short system calls (such as getpid(2)), but it's unnoticeable in any normal system call (like read(2)), and nonexistent in any macro benchmark.All the great benefits of microstate acocunting at a fraction of the price.1 The clock function is actually a consumer of the cyclic subsystem - a nice callout mechanism written by Bryan back in Solaris 8. Certainly worth a post of its own in the future.2This is especially true on x86, because historically the hardware clock has been the source of all timer activity. This means every thread ran lockstep with the clock with respect to timers and wakeups. This has been addressed in Solaris 10 with the introduction of arbitrary precision timer interrupts, which makes the cyclic subsystem far more interesting on x86 machines (Solaris SPARC has had this functionality for a while).

With the tidal wave of features that is Solaris 10, it's all too easy to miss the less visible enhancements. Once of these "lost features" is microstate accounting turned on by default for...


Solaris on AMD64

Those students in the class paying attention may have noticed that Solaris on amd64 is quietly coming to life. While it's been thriving in development for a while now, it's finally flexing its muscles publicly. A few sightings at blogs.sun.com:Pete Shanahan pointing out amd64 on his sleek 64-bit laptop, as well as describing some differences between compiler flags in the 64-bit world.Alan Duboff getting JDS installed and playing around with 64-bit compilation.Joe Bonasera, one of the main guns behind the amd64 portAdam Leventhal polishing off DTrace for amd64.As for myself, I have been pitching in with amd64 development here and there; mostly getting the userland debugging tools functional. Recently, I ported ZFS to amd64 - which was really only a bunch of Makefile and compiler changes (the joys of porting a well-designed 64-bit capable subsystem). There are, of course, others doing most of the work - Bryan, Bart, and the many anonymous team members too busy doing the heavy lifting to maintain an external blog.For you Solaris Express junkies out there, amd64 support will probably not be available in the next release. Look for it in a later release or S10 FCS.All told, amd64 support is pretty exciting. The hardware is blazingly fast (and cheap!), and we finally have an OS that can really take advantage of all that it has to give. With our resident hardware genius behind the wheel of our amd64 platforms, we're going to be coming out with some absolutely killer hardware. 2005 will be an interesting year...

Those students in the class paying attention may have noticed that Solaris on amd64 is quietly coming to life. While it's been thriving in development for a while now, it's finally flexing its muscles...


Behind the music: /system/object

About a month ago, I added a new pseudo filesystem to Solaris: 'objfs', mounted at /system/object. This is an "under the hood" filesystem that no user will interact with directly. But it's a novel solution to a particularly thorny problem for DTrace, and may be of some interest to the curious techies out there.When DTrace was first integrated into Solaris, it had a few hacks to get around the problem of accessing kernel module data from userland. In particular, it opened /dev/kmem in order to find the address ranges of module text and data segments, and introduced a private modctl call in order to extract CTF and symbol table information. The end result was something that mostly worked, but with a few drawbacks. Opening /dev/kmem requires all privileges or membership in group sys, so even if you give a user the dtrace_kernel privilege, they still were unable to access kernel CTF and symbol information. Direct access via modctl necessitated a complicated (and sometimes broken) dance to allow 32-bit dtrace apps to work on a 64-bit kernel.The solution was to create a new pseudo filesystem which would export information about the currently loaded objects in the kernel as standard ELF files. Choosing a pseudo filesystem over a new system call or modctl has several advantages:Filesystems are great for exporting heirarchal data. We needed to export a collection of named data - perfect for a directory layout.By modeling objects as directories and choosing an ELF file format, we have left room for expansion without having to go back and modify existing implementations.We can leverage our existing toolset for working with ELF files: elfdump(1), nm(1), libelf(3LIB), and libctf. The userland changes to libdtrace(3LIB) were minimal because we already have established interfaces for working with ELF files.Filesystems are easily virtualized in a local zone. DTrace is still not usable from within a local zone for a few small reasons, but we're one step closer.There are no data model issues. We simply export a 64-bit ELF object, and the gelf_xxx() routines handle the conversion transparently.The final result is:$ elfdump -c /system/object/genunix/object Section Header[1]: sh_name: .shstrtab sh_addr: 0xa6eaea30 sh_flags: [ SHF_STRINGS ] sh_size: 0x46 sh_type: [ SHT_STRTAB ] sh_offset: 0x1c4 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[2]: sh_name: .SUNW_ctf sh_addr: 0xa61f7000 sh_flags: 0 sh_size: 0x2e79d sh_type: [ SHT_PROGBITS ] sh_offset: 0x20a sh_entsize: 0 sh_link: 3 sh_info: 0 sh_addralign: 0x8 Section Header[3]: sh_name: .symtab sh_addr: 0xa61b5050 sh_flags: 0 sh_size: 0x1f7d0 sh_type: [ SHT_SYMTAB ] sh_offset: 0x2e9a7 sh_entsize: 0x10 sh_link: 4 sh_info: 0 sh_addralign: 0x8 Section Header[4]: sh_name: .strtab sh_addr: 0xa61d96dc sh_flags: [ SHF_STRINGS ] sh_size: 0x1cd5e sh_type: [ SHT_STRTAB ] sh_offset: 0x4e177 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[5]: sh_name: .text sh_addr: 0xfe87e4a0 sh_flags: [ SHF_ALLOC SHF_EXECINSTR ] sh_size: 0x198dc0 sh_type: [ SHT_NOBITS ] sh_offset: 0x6aed5 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[6]: sh_name: .data sh_addr: 0xfec3eba0 sh_flags: [ SHF_WRITE SHF_ALLOC ] sh_size: 0x3e1c0 sh_type: [ SHT_NOBITS ] sh_offset: 0x6aed5 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[7]: sh_name: .bss sh_addr: 0xfed7a5f0 sh_flags: [ SHF_WRITE SHF_ALLOC ] sh_size: 0x7664 sh_type: [ SHT_NOBITS ] sh_offset: 0x6aed5 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[8]: sh_name: .info sh_addr: 0x1 sh_flags: 0 sh_size: 0x4 sh_type: [ SHT_PROGBITS ] sh_offset: 0x6aed5 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[9]: sh_name: .filename sh_addr: 0xfec3e8e0 sh_flags: 0 sh_size: 0x10 sh_type: [ SHT_PROGBITS ] sh_offset: 0x6aed9 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 The string table, symbol table, and CTF data are all complete. You'll notice that we also have text, data, and bss, but they're marked SHT_NOBITS (which means they're not present in the file). We use the section headers to extract information about the address range for each section, but we can't actually export the data due to security. Obviously, letting ordinary users see the data section of loaded modules would be a Bad Thing.To end in typical "Behind the Music" fashion - After a nightmare descent into drug an alcohol abuse, objfs once again was able to take control of its life (thanks mostly a loving relationship with libdtrace), and now lives a relaxing life on a Montana ranch.

About a month ago, I added a new pseudo filesystem to Solaris: 'objfs', mounted at /system/object. This is an "under the hood" filesystem that no user will interact with directly. But it's a novel...



So my last few posts have sparked quite a bit of discussion out there, appearing on slashdot as well as OSNews. It's been quite an interesting experience, though it's had a significant effect on my work productivity today :-) While I'm not responding to every post, I promise that I'm reading them (and thanks to those of you sending private mail, I promise to respond soon).I have to say that I've been reasonably impressed with the discussion so far. Slashdot, as usual, leaves something to be desired (even reading at +5), but the comments in my blog and in my email have been for the most part very reasonable. There is a certain amount of typical fanboy drivel (more so on the pro-Linux side, but only because Solaris doesn't have many fanboys). But there's also a reasonable contingent on Slashdot fighting down the baseless arguments of the zealots. In the past, the debate has been rather one-sided. Solaris is usually dismissed as an OS for big computers for people with lots of money. Sun has traditionally let our marketing department do all the talking, which works really well for CEOs and CTOs (our paying customers), but not as well for spreading detailed technical knowledge to the developer community. We're changing our business model - encouraging blogs, releasing Solaris Express, hosting discussions with actual kernel engineers, and eventually open sourcing Solaris - to encourage direct connections with the community at large.We've been listening to the (often one-sided) discussion for a long time now, and it shows in Solaris. Solaris 10 has killer performance, even on single- and dual-processor x86 machines. Hardware support has been greatly improved (S10 installed on my Toshiba laptop without a hitch). We're focusing on the desktop again, with X.Org integration, Gnome 2.6, Mozilla 1.7, and better open source packages all around. Sure, we're still playing catchup in a lot of ways, but we're learning. I only hope the Linux community can learn from Solaris's strengths, and dismiss many of the Solaris stereotypes that have been implanted (not always without merit) over the course of history. Healthy competition is good, and can only benefit the customer.As much as I would like to continue this debate forever, I think it's time I get back to doing what I really love - making Solaris into the best OS it can be. I'll probably be focusing on more technical posts for a while, but I'll revive the discussion again at a future point. Until then, feel free to continue posting comments or sending me mail. I do read them, even if I don't respond publicly.

So my last few posts have sparked quite a bit of discussion out there, appearing on slashdot as well as OSNews. It's been quite an interesting experience, though it's had a significant effect on my...


Rebutting a rebuttal

Normally, I'm hesitant to engage people on the internet in an argument. Typically, you get about one or two useful responses before it descends into a complete flame war. So I thought I'd take my one remaining useful response and comment on this blog which is a rebuttal to my previous post, and accuses me of being a "single Sun misinformed developer".To state that the Linux kernel developers don't believe in those good, sound engineering values, is pretty disingenuous ... These [sic] is pretty much a baseless argument just wanting to happen, and as he doesn't point out anything specific, I'll ignore it for now.Sigh. It's one thing to believe in sound engineering values, and quite another to develop them as an integral part of your OS . I'm not saying the Linux doesn't care about these things at all, just that they're just not a high priority. The original goal of my post was not "our technology is better than yours," only that we have different priorities. But if you want a technology comparison, here are some Solaris examples:Reliability - Reliability is more than just "we're more stable than Windows." We need to be reliable in the face of hardware failure and service failure. If I get an uncorrectable error on a user process page, predictive self healing can re-start the service without rebooting the machine and without risking memory corruption. Fault Management Architecture can offline CPUs in reponse to hardware errors and retire pages based on the frequency of correctable errors. ZFS provides complete end-to-end checksums, capable of detecting phantom writes and firmware bugs, and automatically repair bad data without affecting the application. The service management facility can ensure that transient application failures do not result in a loss of availability.Serviceability - When things go wrong (and trust me, they will go wrong), we need to be able to solve the problem in as little time as possible with the lowest cost to the customer and Sun. If the kernel crashes, we get a concise file that customers can send to support without having to reproduce the problem on an instrumented kernel or instruct support how to recreate my production environment. With the fault management architecture, an administrator can walk up to any Solaris machine, type a single command, and see a history of all faulty components in the system, when and how they were repaired, and the severity of the problems. All hardware failures are linked to an online knowledge base with recommended repair procedures and best practices. With ZFS, disks exhibiting questionable data integrity can automatically be removed from storage pools without interruption of normal service to prevent outright failure. Dynamic reconfiguration allows entire CPU boards can be removed from the system without rebooting.Observability - DTrace allows real-world administrators (not kernel developers) to see exactly what is happening on their system, tracing arbitrary data from user applications and the kernel, aggregating it and coordinating with disjoint events. With kmdb, developers can examine the static state of the kernel, step through kernel functions, and modify kernel memory. Commands like trapstat provide hardware trap statistics, and CPU event counters can be used to gather hardware-assisted profiling data via libcpc.Resource management - With Solaris resource management, users can control memory and CPU shares, IPC tunables, and a variety of other constraints on a per-process basis. Processes can be grouped into tasks to allow easy management of a class of applications. Zones allow a system to be partitioned and administrated from a central location, dividing the same physical resources amongst OS-like instances. With process rights management, users can be given individual privileges to manage privileged resources without having to have full root access.That's just a few features of Solaris off the top of my head. There are Linux projects out there approaching some semblance of these features, but I'd contend that none of them is as polished and comprehensive as those found in Solaris, and most probably live as patches that have not made their way into few mainstream distributions (RedHat), despite years of development. This is simply because these ideas are not given the highest priority, which is a judgement call by the community and perfectly reasonable.Crash dumps. The main reason this option has been rejected is the lack of a real, working implementation. But this is being fixed. Look at the kexec based crashdump patch that is now in the latest -mm kernel tree. That is the way to go with regards to crash dumps, and is showing real promise. Eventually that feature will make it into the main kernel tree.I blamed crash dumps on Linus. You blame their failure on poor implementation. Whatever your explanation, the fact remains that this project was started back in 1997-99. Forget kernel development - this is our absolute #1 priority when it comes to serviceability. It has taken the Linux community seven years to get something that is "showing real promise" and still not in the main kernel tree. Not to mention the post-mortem tools are extremely basic (30 different commands, compared with the 700+ available in mdb).Kernel debuggers. Ah, a fun one. I'll leave this one alone only to state that I have never needed to use one, in all the years of my kernel development. But I know other developers who swear by them. So, to each their own. For hardware bringup, they are essential. But for the rest of the kernel community, they would be extra baggage.Yes, kernel debuggers are needed in a relatively small number of situations. But in these situations, they're absolutely essential. Also, just because you haven't used one yet doesn't mean it isn't necessary. All bugs can be solved simply by looking at the source code long enough. But that doesn't mean it's practical. The claim that it's "extra baggage" is bizarre. Are you worried about additional source code? Binary bloat? How can a kernel be extra baggage if I don't use it?Tracing frameworks. Hm, then what do you call the kprobes code that is in the mainline kernel tree right now? :) This suffered the same issue that the crash dump code suffered, it wasn't in a good enough state to merge, so it took a long time to get there. But it's there now, so no one can argue it again.Yes, the kprobes code that was just accepted into the mainline branch only a week and a half ago (that must be why I'm so misinformed). KProbes seems like a good first step, but it needs to be tied into a framework that administrators can actually use. LTT is a good beginning, but it's been under development for five years and still isn't integrated into the main tree. It's quite obvious that the linux kernel maintainers don't perceive tracing as anything more than a semi-useful debugging tool. There's more to a tracing framework than just KProbes - any of our avid DTrace customers (administrators) are living proof of this falsehood. We (kernel developers) do not have to accept any feature that we deem is not properly implemented, just because some customer or manager tells us we have to have it. In order to get your stuff into the kernel, you must first tell the community why it is necessary, and so many people often forget this. Tell us why we really need to add this new feature to the kernel, and ensure us that you will stick around to maintain it over time.First of all, every feature in Solaris 10 was conceived by Solaris kernel developers based on a decade of interactions with real customers solving real problems. We're not just a bunch of monkeys out to do management's bidding. Second of all, you don't implement a feature "just because some customer" wants it? What better reason could you possibly have? Perhaps you're thinking that because some customer really wants something, we just integrate whatever junk we can come up with in a months time. If this were true, don't you think you'd see Kprobes in Solaris instead of DTrace?First off, this [binary compatibility] is an argument that no user cares about.We have customers paying tens of millions of dollars precisely because we claim backwards compatibility. This is an example of where Linux is just competing in a different space than Solaris, hence the different priorities. If your customer is a 25 year old linux advocate managing 10 servers for the University, then you're probably right. But if your customer is a 200,000 employee company with tens of thousands of servers and hundreds of millions of dollars riding on their applications, then you're dead wrong..The arguments he makes for binary incompatibility are all ones I've heard before. Yes, not having binary compatibility makes it easier to change a kernel interface. But it's not exactly rocket science to maintain an evolving interface with backwards compatibility. Yes, you can rewrite interfaces. But you can do so in a well defined and well documented manner. How hard is it to have a stable DDI for all of 2.4, without worrying that 2.4.21 is incompatible with 2.4.22-ac? As far as I'm concerned, changing compiler options that break binary compatiblity is a bug. Fix your interfaces so they don't depend on structures that change definition at the drop of a hat. Binary compatibility can be a pain, but it's very important to a lot of our customers.And when Sun realizes the error of their ways, we'll still be here making Linux into one of the most stable, feature rich, and widely used operating systems in the world.For some reason, all Linux advocates have an "us or them" philosophy. In the end, we have exactly what I said at the beginning of my first post. Solaris and Linux have different goals and different philosophies. Solaris is better at many things. Linux is better at many things. For our customers, for our business, Linux simply isn't moving in the right direction for us. There's only so many "we're getting better" comments about Linux that I can take: the proof is in the pudding. The Linux community at large just isn't motivated to accomplish the same goals we have in Solaris, which is perfectly fine. I like Linux; it has many admirable qualities (great hardware support, for example). But it just doesn't align with what we're trying to accomplish in Solaris.

Normally, I'm hesitant to engage people on the internet in an argument. Typically, you get about one or two useful responses before it descends into a complete flame war. So I thought I'd take my...


GPL thoughts and clarifications

So it seems my previous entry has finally started to stir upsome controversy. I'll address some of the technical issues raisedhere shortly. But first I thought I'd clarify my view of the GPL, with the help of an analogy:Let's say that I manufacture wooden two by fours, and that I want to makethem freely available under an "open source" license. There are several optionsout there:You have the right to use and modify my 2x4s to your hearts content.This is the basis for open source software. It protects the rights of the consumer, but imparts few rights to the developer.You have the right to use my 2x4s however you please, but if you modifyone, then you have to make that modification freely available to the public inthe same fashion as the original.This gives the developer a few more guarantees about what can and cannotbe done with his or her contributions. It protects the developer's rightswithout infringing on the rights of the consumer.You have the right to use my 2x4 as-is, but if you decide to build ahouse with it, then your house must be as freely available as my 2x4.This is the provision of the GPL that I don't agree with, and neither docustomers that we've talked to. It protects my rights as a developer, butseverely limits the rights of the consumer in what can and cannot be done with my publicdonation.This analogy has some obvious flaws. Open source software is neither excludable nor rival, unlike the house I just built. There is also a tenuousline between derived works and fair use. In my example, I wouldn't have the right to the furniture put into your house. But I feel like its a reasonablesimplification of my earlier point. As an open source advocate, I would argue that #1 is the "most free". Thisis why, in many ways, the BSD license is the "most open" of all the mainlicenses. As a developer, I would argue that #2 is the best solution. Mycontribution is protected - no one can make changes without giving it back to me(and the community at large). But my code is essentially a service, and I feeleveryone should have a right to that service, even if they go off and make moneyfrom it.The problems arise when we get to #3, which is the essential controversy ofthe GPL. To me, this is a personal choice, which is why GPL advocacy oftenturns into pseudo-religious fanaticism. In many ways, arguing with a GPL zealotis like an atheist arguing with a religious fundamentalist. In the end, theyagree on nothing. The atheist leaves understanding the fundamentalist's beliefsand respects his or her right to have them. The fundamentalist leaves belivingthat the atheist will burn in hell for not accepting the one true religion.This would be fine, except that GPL advocates often blur the line between #2and #3, and make it seem like the protections of #2 can only be had if you fullyembrace the GPL in all its glory. I support the rights provided by #2. You can scream and shout about the benefits of #3 and how it's an inalienable rightof all people, but in the end I just don't agree. Don't equate the GPL withopen source - if you do want to argue the GPL, make it very clear which points you are arguing for.One final comment about GPL advocacy. Time and again I see people talk abouteasing migration, avoiding vendor lockin, and the necessity of consumer choice.But in the same breath they turn around and scream that you must accept the GPL,and any other license would be pure evil (at best, a slow and painful death). Why is it that we have the right to chooseeverything except our choice of license? I like Linux. I like the GPL. TheGPL is not evil. There are a lot of great projects that benefit from the GPL. But it isn't everything to all people, and in my opinion it's notwhat's best for OpenSolaris.[ UPDATE ]As has been enumerated in the comments on this post, the original intent of the analogy is to show the the definition of derived works. As mentioned in the comments:Say I post an example of a function foo() to my website. Oracle goes and uses that function in their software. They make no changes to it whatsover, and are willing to distribute that function in source code form with their product. If it was GPL, they would have to now release all of Oracle under the GPL, even though my code has not been altered. The consumer's rights are preserved - they still have the same rights to my code as before it was put into Oracle. I just don't see why they have a right to code that's not mine.Though I didn't explain it well enough, the analogy was never intended to encompass right to use, ownership, distribution, or any of the other qualities of the GPL. It is a specific issue with one part of the GPL, and the analogy is intentionally simplistic in order to demonstrate this fact.

So it seems my previous entry has finally started to stir up some controversy. I'll address some of the technical issues raised here shortly. But first I thought I'd clarify my view of the GPL, with...


Debugging hard problems

Disclaimer: This is a technical post and not for the faint of heart. You have been warned.I thought I'd share an interesting debugging session I had a while ago. Personally, I like debugging war stories. They are stories about hard problems solved by talented engineers. Any Solaris kernel engineer will tell you that we love hard problems - a necessary desire for anyone doing operating systems work. Every software discipline has its share of hard problems, but none approach the sheer volume, complexity, and variety of problems we encounter while working on an OS like Solaris.This was one of those "head scratching" bugs that really had me baffled for a while. While by no means the most difficult bug I have seen (a demographic populated by memory corruption, subtle transient failures, and esoteric hardware interactions), it was certainly not intuitive. What made this bug really difficult was that a simple programming mistake exposed a longstanding and very subtle race condition with timing on multiprocessor x86 machines.Of course, I didn't know this at the time. All I knew is that a few of our amd64 build machines were randomly hanging. The system would be up and running fine for a while, and then \*poof\* no more keystrokes or ping. With KMDB, we just sent a break to the console and took a crash dump. As you'll see, this problem would have been virtually impossible to solve without a crash dump, due to its unpredictable nature. I loaded mdb on one of these crash dumps to see what was going on: > ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fec1f1a0 1f 16 0 0 yes no t-300013 9979c340 make.bin 1 80fdc080 1f 46 0 164 yes no t-320107 813adde0 sched 2 fec22ca8 1b 19 0 99 no no t-310905 80aa3de0 sched 3 80fce440 1f 20 0 60 yes no t-316826 812f79a0 fsflushWe have threads stuck spinning on all CPUs (as evidenced by the long time since last switch). Each thread is stuck waiting for a dispatch lock, except for one: $gt; 914d4de0::findstack -v 914d4a84 0x2182(914d4de0, 7) 914d4ac0 turnstile_block+0x1b9(a4cc3170, 0, 92f32b40, fec02738, 0, 0) 914d4b0c mutex_vector_enter+0x328() 914d4b2c ipcl_classify_v4+0x30c(a622c440, 6, 14, 0) 914d4b68 ip_tcp_input+0x756(a622c440, 922c80f0, 9178f754, 0, 918cc328,a622c440) 914d4bb8 ip_rput+0x623(918ff350, a622c440) 914d4bf0 putnext+0x2a0(918ff350, a622c440) 914d4d3c gld_recv_tagged+0x108() 914d4d50 gld_recv+0x10(9189d000, a622c440) 914d4d64 bge_passup_chain+0x40(9189d000, a622c440) 914d4d80 bge_receive+0x60(9179c000, 917cf800) 914d4da8 bge_gld_intr+0x10d(9189d000) 914d4db8 gld_intr+0x24(9189d000) 9e90dd64 cas64+0x1a(b143a8a0, 3) 9e90de78 trap+0x101b(9e90de8c, 8ff904bc, 1) 9e90de8c kcopy+0x4a(8ff904bc, 9e90df14, 3) 9e90df00 copyin_nowatch+0x27(8ff904bc, 9e90df14, 3) 9e90df18 instr_is_prefetch+0x15(8ff904bc) 9e90df98 trap+0x6b2(9e90dfac, 8ff904bc, 1) 9e90dfac 0x8ff904bc()Strangely, we're stuck in new_mstate(): > turnstile_block+0x1b9::dis turnstile_block+0x1a2: adcl $0x0,%ecx turnstile_block+0x1a5: movl %eax,0x3bc(%edx) turnstile_block+0x1ab: movl %ecx,0x3c0(%edx) turnstile_block+0x1b1: pushl $0x7 turnstile_block+0x1b3: pushl %esi turnstile_block+0x1b4: call -0x93dae turnstile_block+0x1b9: addl $0x8,%esp turnstile_block+0x1bc: cmpl $0x6,(%ebx) turnstile_block+0x1bf: jne +0x1a turnstile_block+0x1c1: movl -0x20(%ebp),%eax turnstile_block+0x1c4: movb $0x1,0x27b(%eax) turnstile_block+0x1cb: movb $0x0,0x27a(%eax) turnstile_block+0x1d2: movl $0x1,0x40(%esi)One of the small but important features in Solaris 10 is that microstate accounting is turned on by default. This means we record timestamps when changing system state, rather than relying on clock-based process accounting. Unfortunately, I can't post the source code to new_mstate() (another benefit of OpenSolaris - no secrets). But I can say that there is a do/while loop where we spin waiting for the present to catch up to the past. The ia32 architecture was not designed with large MP systems in mind. In particular, the chips have a high resolution timestamp counter (tsc) with the particularly annoying property that different CPUs do not have to be in sync. This means that a thread can read tsc from one CPU, get bounced to another, and read a tsc value that appears to be in the past. This is very rare in practice, and we really should never go through this new_mstate() loop more than once. On these machines, we were looping for a very long time.We read this tsc value from gethrtime_unscaled(). Slogging through the assembly for new_mstate(), we see that the result is stored as a 64-bit value in -0x4(%ebp) and -0x1c(%ebp): new_mstate+0xe3: call -0x742d9 new_mstate+0xe8: movl %eax,-0x4(%ebp) new_mstate+0xeb: movl %edx,-0x1c(%ebp)Looking at our current stack, we are able to reconstruct the value: > 0x914d4a84-4/X 0x914d4a80: d53367ab > 0x914d4a84-1c/X 0x914d4a68: 2182This gives us the value 0x2182d53367ab. But our microstate accounting data tells us something entirely different: > 914d4de0::print kthread_t t_lwp->lwp_mstate { t_lwp->lwp_mstate.ms_prev = 0x3 t_lwp->lwp_mstate.ms_start = 0x1b3c14bea4be t_lwp->lwp_mstate.ms_term = 0 t_lwp->lwp_mstate.ms_state_start = 0x3678297e1b0a t_lwp->lwp_mstate.ms_acct = [ 0, 0x1b3c14bf554d, 0, 0, 0, 0, 0, 0, 0x1052, 0x10ad ] } Now we see the problem. We're trying to catch up to 0x3678297e1b0a clock ticks, but we're only at 0x2182d53367ab. This means we're going to stay in this loop for the next 23,043,913,724,767 clock ticks! I wouldn't count on this routine returning any time soon, even on a 2.2GHz Opteron. Using a complicated MDB pipeline, we can print the starting microstate time for every thread in the system and sort the output to find the highest values: > ::walk thread | ::print kthread_t t_lwp | ::grep ".!=0" | ::print klwp_t \\ lwp_mstate.ms_state_start ! awk -Fx '{print $2}' | perl -e \\ 'while () {printf "%s\\n", hex $_}' | sort -n | tail 30031410702452 30031411732325 30031412153016 30031412976466 30031485108972 30031485108972 36845795578503 59889720105738 59889720105738 59889720105738The three highest values all belong to the same thread (which share a lwp), and are clearly way out of line. This is where the head scratching begins. We have a thread which, at one point in time, decided to put approximately double the expected value into ms_state_start, and since then has gone back to normal, waiting for a past which will (effectively) never arrive.So I pour through the source code for gethrtime_unscaled(), which is really tsc_gethrtimeunscaled(). After searching for half an hour, with several red herrings, I finally discover that, through some botched indirection, we were actually calling tsc_gethrtime() from gethrtime_unscaled() by mistake. One version returns the number of clock ticks (unscaled), while the other returns the number of nanoseconds (scaled, and more expensive to calculate). Immediately this reveals itself as a source of many other weird timing bugs we were seeing with the amd64 kernel. Undoing this mistake brought much of our system back to normal, but I was worried that there was another bug lurking here. Even if we were getting nanoseconds instead of clock ticks, we still shouldn't end up spinning in the kernel - something else was fishy.The problem was very obviously a nasty race condition. Further mdb analysis, DTrace analysis, or custom instrumentation would be futile, and wouldn't provide any more information than I already had. So I retreated to the source code, where things finally started to come together.The implementation of tsc_gethrtime() is actually quite complicated. When CPUs are offlined/onlined, we have to do a little dance to calculate the amount of skew between the other CPUs. On top of this, we have to keep the CPUs from drifting too far from one another. We have a cyclic called once a second to record the 'master' tsc value for a single CPU, which is used to create a running baseline. Subsequent calls to tsc_gethrtime() use the delta from this baseline to create a unified view of time. Turns out there has been a subtle bug in this code for a long time, and only by calling gethrtime() on every microstate transition were we able to expose it.In order to factor in power management, we have a check in tsc_gethrtime() to see if the present is less than the past, in which case we ignore the delta between the last master tsc. This doesn't seem quite right. Because the tsc registers can drift between CPUs, we have the remote possibility that we read tsc on the master CPU 0 at nearly the same instant we read another tsc value on CPU 1. If the CPUs have drifted just the right amount, our new tsc value will appear to be in the past, even though it really isn't. The end result is that we nearly double the returned value, which is the exact behavior we were seeing.This is one of those "one in a million" race conditions. You have to be on a MP x86 system where the high resolution timers have drifted ever so slightly: too much and a different algorithm kicks in. Then, you have to call gethrtime() at almost exactly the same time as tsc_tick() is called from the cyclic subsystem (once a second). If you hit the window of a few clock cycles just right, you'll get an instantaneous bogus value, but the next call to gethrtime() will return to normal. This isn't fatal in most situations. The only reason we found it at all is because we were accidentally calling gethrtime() thousands of times per second, resulting in a fatal microstate accounting failure.Hopefully this has given you some insight into the debugging processes that we kernel developers (and our support engineers) go through every day, as well as demonstrating the usefulness of mdb and post mortem debugging. After six long hours of on and off debugging, I was finally able to nail this bug. With a one line change, I was able to fix a glaring (but difficult to identify) bug in the amd64 kernel, while simultaneously avoid a nasty race condition that we have been hiding from for a long time.

Disclaimer: This is a technical post and not for the faint of heart. You have been warned. I thought I'd share an interesting debugging session I had a while ago. Personally, I like debugging war...


Analysts on OpenSolaris

There's been a lot of press about Sun recently thanks to our Network Computing 04Q3 event. It's hard to miss some of the coverage out there. Jim pointed out this article over at eWeek, which has some good suggestions, but also some gross misconceptions. I thought I'd look at some of the quotes and respond, as a Solaris kernel developer, and explain what OpenSolaris really means and why we're doing it.First, there is the obligatory comment about Linux vs. Solaris within Sun. John Loiacono explained some of the reasons why we're investing in Solaris rather than jumping on the Linux bandwagon. To which Eric Raymond responded:The claim that not controlling Linux limits one's ability to innovate is a load of horse puckey ... In open source, you always have that ability, up to and including forking the code base, if you don't like the way things are being run.Taken out of context, this seems like an entirely reasonable position. But when you put it in the context of OpenSolaris vs. Linux, it quickly becomes irrelevant. The main reason we can't just jump into Linux is because Linux doesn't align with our engineering principles, and no amount of patches will ever change that. In the Solaris kernel group, we have strong beliefs in reliability, observability, serviceability, resource management, and binary compatibility. Linus has shown time and time again that these just aren't part of his core principles, and in the end he is in sole control of Linux's future. Projects such as crash dumps, kernel debuggers, and tracing frameworks have been repeatedly rejected by Linus, often because they are perceived as vendor added features. Not to mention the complete lack of commitment to binary compatibility (outside of the system call interface). Kernel developers make it nearly impossible to maintain a driver outside the Linux source tree (nVidia being the rare exception), whereas the same apps (and drivers) that you wrote for Solaris 2.5.1 will continue to run on Solaris 10. Large projects like Zones, DTrace, and Predictive Self Healing could never be integrated into Linux simply because they are too large and touch too many parts of the code. Kernel maintainers have rejected patches simply because of the amount of change (SMF, for example, modified over 1,000 files). That's not to say that Linux doesn't have many commendable principles, not the least of which is their commitment to open source. But there's just no way that we can shoehorn Solaris principles into the Linux kernel.Of course, as Eric Raymond says, we could create a fork of the Linux kernel. But this idea lies somewhere between idealistic and completely ludicrous. First of all, there's the sheer engineering effort. Even after porting all the huge Solaris 10 (and 9, and 8 ...) features to a branch of the Linux kernel, we would enter into a perpetual game of "catchup" with the main branch. We'd be spending all of our time merging patches and testing rather than innovating. With features such as guaranteed binary compatibility, it may not even be possible. Forget the fact that such a fork would probably never be accepted by the Linux community at large. The real problem with creating a fork of the Linux kernel is simply that the GPL doesn't align with our corporate principles. We want to have ISVs embedding Solaris in their set-top box without worrying about how to dance around the GPL while keeping their IP private. Even if you can tiptoe around the issue now by putting your code in a self-contained module, the Linux kernel developers could actively work against you in the future. Of course, we could still choose a GPL compatible license for OpenSolaris, at which point I'll end up eating my words.In the end, dumping Solaris into Linux makes no sense, either technically or philosophically. I have yet to hear a convincing argument of why ditching Solaris would be a good thing for Sun. And I can't begin to imagine justification for forking the Linux kernel. To be clear, we're not out to rule OpenSolaris with an iron fist. Because we own our intellectual property, we can make a licensing decision that reflects our corporate goals. And because we've put all the engineering effort behind that IP, we can instill similar beliefs into the community that we spawn. These beliefs may change over time: we would love to see a OpenSolaris community where we are merely a participant in a much larger game. But we'll be able to build a foundation with ideas that are important to us, and fundamentally different from those of the Linux community.Getting back to the article, we have another quote from Gary Barnett from Ovum:If Sun releases only Solaris for SPARC with a peculiar open-source license that's not compatible with the GPL, it's not going to be a big deal. All you'll get is the right to help Sun improve its software ... If they produce a license that's not in the spirit of the open-source community, they won't do themselves any favors at allIt's hard to know where to begin with statements like these. First and foremost is the idea that we will release Solaris "only for SPARC". No matter how many times we say it, people just don't seem to get it. Both x86 and SPARC are built from the same source! There is a very small bit of Solaris that is platform specific, and is scattered throughout the codebase such that it would be impossible to separate. Second, we've already stated that the license will be OSI compliant. I can't elaborate on specifics, and whether it will be GPL compatible, but it really shouldn't matter as long as it has OSI approval. The GPL is not the be-all, end-all of open source licenses. There are bad aspects of the GPL, and not every project in the world should use it. If we do end up being GPL-incompatible, the only downside will be that you cannot use the source code in Linux or another GPL project. But why must everything exist to contribute to Linux? I can't take Linux code and drop it into FreeBSD, so why can't the same be true with OpenSolaris? Not to mention the many benefits of being GPL-incompatible, like being able to mix OpenSolaris code with proprietary source code.Most importantly, contributors to OpenSolaris won't just be helping "Sun improve its software." By nature of making it open source, it will no longer be "our software". All the good things that come with an OSI license (right to use, fork, and modify code) will prevent us from ever owning OpenSolaris. If you contribute, you will helping improve your software, which you can then use, modify, or repackage to your heart's content. You will be part of a community. Yes, it won't be the Linux community. But it will be a community you can choose to join, either as a developer or a user, as alternative to Linux.Sun is hoping that making source code available will cause a community as large, as diverse and as enthusiastic as that around Linux to gather around Solaris. Just offering source code is not enough to create such a community. Sun would need to do a great deal of work to make that happen.Now here's a comment (by Dan Kusnetzky of IDC) that actually makes sense. He understands that we are out to create a community. Like us, he knows that just throwing source code over the wall is not enough. He has good suggestions.And we're listening. We have been researching this project for a very long time. We have talked to numerous customers, ISVs, and open source leaders to try and define what makes a community successful. Clever people have already noticed that we have begun a (invite only) pilot program; several open source advocates are already involved in helping us shape a vision for OpenSolaris. Creating a community doesn't happen overnight. We are gradually building something meaningful, being sure to take the right approach at each step. The community will evolve over time. And I'm looking forward to it.

There's been a lot of press about Sun recently thanks to our Network Computing 04Q3 event. It's hard to miss some of the coverage out there. Jim pointed out this article over at eWeek, which has...


Sold! To bidder number 42...

What am I up to these days?Recently, I putback my last round of major bugfixes in my "traditional" area of expertise - procfs, libproc, mdb, etc. And since I'm not attached to any of the major S10 projects, I can actually pick and choose what to work on next. I have tons of projects I'd love to start, but I can't justify new projects so late in the release cycle. There are simply too many other things that need to be fixed first. The downside of choice is that eventually, word gets out that you have copious amounts of free time with only two months of development left in the release. I've basically been put up for auction, except the bidding price is always the same - though the work itself becomes the reward.I've been spending quality time with our bug database, as well as entertaining offers from potential suitors. I've been pinch hitting on the amd64 project, helping out with greenline (SMF), and most recently signed up to help design and implement a new subsystem for ZFS. I'm also getting to play with cool things like ZFS on amd64: we're getting it up and running on some very cool (and super-secret) hardware that I unfortunately can't talk about. I'm learning about parts of Solaris that I never thought I'd have a chance to work with. It's all very exciting, but I will definitely be happy when S10 ships so I can get back to some personal projects that have been evolving in the depths of my mind. Stay tuned...So, do I hear a hundred? How about two hundred?

What am I up to these days? Recently, I putback my last round of major bugfixes in my "traditional" area of expertise - procfs, libproc, mdb, etc. And since I'm not attached to any of the major S10...


Linux Kernel Debugging with KDB

So it's been a while since my KMDB post, but I promised I would do some investigation into kernel debugging on the Linux side. Keep in mind that I have no Linux kernel experience. While I will try to be thorough in my research, there may be things I miss simply from lack of experience or a good test system. Feel free to comment on any errors or omissions.We'll try to solve the same problem that I approached with KMDB in the last post: a deadlock involving reader-writer locks. Linux has a choice of two debuggers, kdb and kgdb (though User Mode Linux presents interesting possibilities). In this post I'll be taking a look at KDB.Fire up KDBChances are you're not running a Linux kernel with KDB installed. Some distros (like Debian) make it easier to download and apply the patch, but none seems to include it by default (admittedly, I didn't do a very thorough search). This means you'll have to go download the patch, apply it, tweak some kernel config variables (CONFIG_KDB and CONFIG_FRAME_POINTER), recompile/reinstall your kernel, and reboot. Hopefully you've done all this beforehand, because as soon you reboot you've lost your bug (possibly forever - race conditions are fickle creatures). Assuming you were running a kdb-enabled kernel when you hit this bug, you then run: # echo "1" > /proc/sys/kernel/kdbAnd then press the 'pause' key on your keyboard. Alternatively, you can hook up a serial console, but I'll opt for the easy way out.Find our troubled threadFirst, we need to find the pid our offending process. The only way to do this is to use the 'ps' command to display all processes on the system, and then pick out (visually) which pid belongs to our 'ps' process. Once we have this information, we can then use 'btp <pid>' to get a stack trace.Get the address of the rwlockThis step is very similar to the one we took when using kmdb. The stack trace produced by 'btp' includes frame pointers like kmdb's $C. Looking back over my kmdb post, it wasn't immediately clear where I got that magic starting number - it came from the frame pointer in the (verbose) stack trace. In any case, we use 'id <addr>' to disassemble the code around our call site. We then use 'mdr <addr+offset>' to examine the memory where the original value is saved. This gets much more interesting (painful) on amd64, where arguments are passed in registers and may not get pushed on the stack until several frames later.Without a paddle?At this point, the next step should be "Find who owns the reader lock." But I can't find any commands in the kdb manpages that would help us determine this. Without kmdb's ::kgrep, we're stuck searching for a needle in a haystack. Somewhere on this system, one or more threads have referenced this rwlock in the past. Our only course of action is to try 'bta', which will give us a stack trace of every single process on the system. With a deep understanding of the code, a great deal of persistence, and a little bit of luck, we may be able to pick out the offending stack just by sight. This quickly becomes impractical on large systems, not to mention difficult to verify and prone to error.With KDB we can do some basic debugging tasks, but it still relies on giant "leaps of faith" to correlate two pieces of seemingly disjoint data (two thread involved in a deadlock, for example). As a point of comparison, KDB provides 40 different commands, while KMDB provides 771 (356 dcmds and 415 walkers on my current desktop). Next week I'll look at kgdb and see if it fills in any of these gaps.

So it's been a while since my KMDB post, but I promised I would do some investigation into kernel debugging on the Linux side. Keep in mind that I have no Linux kernel experience. While I will try to...


Kernel Debugging with KMDB

I've talked a lot in the past about kernel debugging, usually comparing the Solaris tools against what's available in Linux and other Operating Systems. I thought that I'd walk through a concrete example of kernel debugging, first with our debugger KMDB (which replaced the venerable kadb in build 61 of Solaris 10), and then with the available Linux kernel debuggers (KDB and KGDB). I'll also examine port-mortem analysis tools for Linux.For today's lesson, I've picked a problem where a port-mortem or live kernel debugger is absolutely necessary. Those of you who have done multithreaded programming in the past have most likely had to deal with a deadlock here or there. In your garden variety deadlock, one thread grabs lock A and then lock B, while another thread grabs lock B and then lock A. If they both get their first lock but not their second, they'll both be blocked waiting for locks which will never be released. I've decided to spice it up a bit by looking at reader-writer locks. With an rwlock, you can have any number of readers, but only one writer. This makes problems much more difficult to debug. With a mutex, the current owner is stored as part of the data structure. But with a rwlock, the only indication you have is the number of active readers.In our fantasy land, we have the following situation: commands are getting stuck in the kernel. Forget kill -9; if you're stuck waiting on a kernel synchronization primitive (wihtout accepting interruption by signals), you're in trouble. The timeline for our threads looks something like this:thread 1thread 2rw_enter(rwlock, RW_READER)<stuck somewhere>rw_enter(rwlock, RW_WRITER)<blocked>Loading KMDBSomething's ill on our system. We know that our 'ps' command is hanging for some reason. So we enter KMDB: # mdb -K Loaded modules: [ ufs unix krtld nca lofs genunix ip usba specfs nfs random sctp ] [0]>KMDB is ready to be loaded on every system running Solaris 10 (build 61 or later). You can also boot -k which will load kmdb in the background. If you do this, you can send a break (via a tip line or L1/F1-A) to get into KMDB.Find our troubled threadIn this example, we know that 'ps' is hanging. Finding the troubled stack is easy: [0]> ::pgrep ps | ::walk thread | ::findstack rw_enter_sleep+0x144() as_pageunlock+0x48() default_physio+0x340() pread+0x244() syscall_trap+0x88()Get the address of the rwlockThis takes a little bit of work because we don't have the necessary debugging information in the kernel to restore arguments from arbitrary positions within a function. In this example we're at rw_enter_sleep+0x144. Using ::findstack -v or $C we can get the frame pointer for each frame. At this point, we have to examine the disassembly around each call site to find out where our first parameter came from. We've been kicking around some ideas to do this automatically, but I'll owe Matt another nickel if I mention it out loud... At the moment, you need to do something similar to the following (on SPARC - x86 is similar): [0] > as_pageunlock+48::dis as_pageunlock+0x20: mov 3, %i4 as_pageunlock+0x24: call -0x2318 as_pageunlock+0x28: restore as_pageunlock+0x2c: add %l5, 0x30, %l0 as_pageunlock+0x30: sethi %hi(0x1069400), %i5 as_pageunlock+0x34: sethi %hi(0x1069400), %i1 as_pageunlock+0x38: ldx [%i5 + 0x1e8], %l6 as_pageunlock+0x3c: mov %l0, %o0 as_pageunlock+0x40: ldx [%i1 + 0x1e0], %i0 as_pageunlock+0x44: add %i2, %i3, %i3 as_pageunlock+0x48: call -0x143754 as_pageunlock+0x4c: mov 1, %o1 as_pageunlock+0x50: mov %l5, %o0 as_pageunlock+0x54: mov %i2, %o1 as_pageunlock+0x58: call -0x2cdc as_pageunlock+0x5c: clr %o2 as_pageunlock+0x60: add %i3, %i0, %l7 as_pageunlock+0x64: and %i2, %l6, %l3 as_pageunlock+0x68: ldx [%o0 + 0x38], %i2 as_pageunlock+0x6c: and %l7, %l6, %l4 as_pageunlock+0x70: mov %l3, %o1 [0]> 000002a101344ee1+7ff::print struct frame { fr_local = [ 0x3000eda5358, 0, 0x1, 0xfd1ff, 0x30018c3dec8, 0xb, 0x8, 0x1 ] fr_arg = [ 0x3df8515a500, 0xb, 0x2, 0x300068202d8, 0x3000eda5358, 0x1 ] fr_savfp = 0x2a101344f91 fr_savpc = 0x1186134 fr_argd = [ 0, 0x810, 0x102, 0, 0x3001503a1a0, 0x2a101345930 ] fr_argx = [ 0x300021541d8 ] } [0]> 0x3000eda5358::rwlock ADDR OWNER/COUNT FLAGS WAITERS 3000eda5358 READERS=1 B011 30018878f20 (R) || 30018c86000 (R) WRITE_WANTED -------+| 30018c86300 (R) HAS_WAITERS --------+ 30018c86600 (R) 30018c86900 (R) 30018c86f00 (R) 30018c87200 (R) [0]>Find who owns the reader lockHere's another place where things get tough. Since we don't record readers as part of the rwlock_t (only the count), we can't just look at memory and see who owns this. We could just dump the stack of every thread and try to guess who's the owner, but on a large server this quickly becomes impractical. This is where you need ::kgrep, which will find references to a given pointer within the kernel. From there, we pipe it to ::whatis (or more recently, ::whattype), which can tell us in which stacks the value is referenced. This dramatically cuts down our list of choices. After some guess-and-check (but significantly less than looking at every thread in the system), we find the real culprit of the crime. [0]> ::kgrep 30006820000 | ::whatis 2a100527730 is in thread 30013e04920's stack 2a100527770 is in thread 30013e04920's stack 2a100527780 is in thread 30013e04920's stack 2a100ecf730 is in thread 3000f962300's stack 2a100ecf770 is in thread 3000f962300's stack 2a100ecf780 is in thread 3000f962300's stack 2a101171730 is in thread 3001506db60's stack 2a101171770 is in thread 3001506db60's stack [...] [0]> 30013e05220::findstack stack pointer for thread 30013e05220: 2a1011a09a1 [ 000002a1011a09a1 sema_p+0x140() ] 000002a1011a0a51 biowait+0x60() 000002a1011a0b01 ufs_getpage_miss+0x2f0() 000002a1011a0c01 ufs_getpage+0x6a8() 000002a1011a0d61 fop_getpage+0x48() 000002a1011a0e31 segvn_fault+0x834() 000002a1011a0fe1 as_fault+0x4a0() 000002a1011a10f1 pagefault+0xac() 000002a1011a11b1 trap+0xc14() 000002a1011a12f1 utl0+0x4c() [0]>Looking at this now, it makes me want a new MDB command, ::whatthread, which would find all threads with stacks containing a given pointer. With MDB's modular design and stable API, it's easy to connect ::kgrep and ::whatis into a self-contained command. But that's a RFE for another day.Figure out what our culprit is doingFrom here, you're pretty much own your own. In the case outlined about, we're stucking waiting for I/O while we have a read lock on the address space. The gory details are beyond a single blog post, but what's been covered is more than enough for a crash-course in kernel debugging. The best part about this bug is that you could never figure this out with custom instrumentation or 'classic' debugging techniques. This is a race condition of the worst kind - all the printf()s in the world wouldn't have helped you here. And if you need to reboot in order to load your custom kernel, you'll probably never see the problem again.Post mortem debugging and LinuxOn Solaris, the post-mortem case is exactly the same as the live KMDB case. Simply reboot -d (or initiate a sync from KMDB or the OK prompt if it's really hung) and then run mdb on the resulting crash dump. You can also examine the kernel while the system is still running with mdb -k, but some commands (such as ::kgrep) will fail because memory is actively changing.In upcoming posts, I'll wade through documentation on linux kernel debuggers, trying to figure out how they would solve this same problem. Note that this could be "solved" by wading through the stack traces of every thread on the system, and picking out the troublemaker thread by eye. This requires intimate knowledge of the source code (all of it), is prone to error, and extremely tedious. It's painful on a simple desktop, and quickly becomes impossible for large servers with thousands of threads. Also, keep in mind this is just a demonstration of kernel debugging - the worst bugs are much more complicated.

I've talked a lot in the past about kernel debugging, usually comparing the Solaris tools against what's available in Linux and other Operating Systems. I thought that I'd walk through a...


And now for something completely different

In a departure from all my previous blog posts, I thought I'd try my hand at a personal entry. Yesterday, the Olympic Track and Field competitions began, with the U.S. taking Silver in the Men's shot put. What was supposed to be a sweep ended up in disaster: Cantwell never qualified at the trials, Godina fouled his first two and didn't make it to the finals, and Nelson fouled all but his first throw. If you're wondering how I know all of this, it's becuase I'm a track nut. I've been running track and field since sophmore year of high school, and by this point know almost all the men's world records, and a fair number of the women's.Back in high school, I ran everything; most notably the 110 hurdles, 300 hurdles, and triple jump. In college I was a walk-on, and focused solely on the triple jump (with a few 4x400 legs sprinkled here and there). The Olympic triple jump preliminaries start tomorrow. Despite the craziness at the U.S. Olympic trials, I'd put my money on Christian Olsson, with Jadel Gregario and Kenta Bell rounding out the top three.Back in college, I was a fairly mediocre Division-I athlete, managing to jump 14.61 (47'11.5") at the Ivy League championships my sophmore year. In contrast, Olsson's best is 17.83 (58'6") and the world record is a whopping 18.29 (60'0"). When you have a moment, try measuring out 20 yards and imagine traversing the distance in 3 steps.For your amusement, I'll leave you with some track pictures, courtesy of Dan Grossman, a great friend of Brown Track and Field. Some of these are certainly less flattering than others, thanks to the speedsuit and my unique facial expressions.Harvard 2001Sean Thomas and myselfPossibly one of my two good jumps sophmore yearWarming up at Heps 2001Me not getting my feet out in front of meMore jumping at YaleIndoor Heps 2002 (with bleached hair)More indoor hepsRidiculously cold meet at HarvardHeps champsionships @Navy 2002A very bad jump at HepsBy this point, I hope you've enjoyed my humiliation. Next post I'll get back to some real issues, including kernel debugging and the joys of KMDB.

In a departure from all my previous blog posts, I thought I'd try my hand at a personal entry. Yesterday, the Olympic Track and Field competitions began, with the U.S. taking Silver in the Men's shot...


So now Linux is innovative?

I just ran across this interview with Linus. If you've read my blog, you know I've talked before about OS innovation, particularly with regards to Linux and Solaris. So I found this particular Linus quote very interesting:There's innovation in Linux. There are some really good technical features that I'm proud of. There are capabilities in Linux that aren't in other operating systems. A lot of them are about performance. They're internal ways of doing things in a very efficient manner. In a kernel, you're trying to hide the hard work from the application, rather than exposing the complexity.This seems to fly in the face of his previous statements, where he claimed that there's no innovation left at the OS level. Some people have commented on my previous blog entry that Linux cannot innovate simply because of its size: with so many hardware platforms to support, it becomes impossible to integrate large projects. What you're left with is "technical features" to improve performance, which (in general) are nothing more than algorithm refinement and clever tricks1.I also don't buy the "supporting lots of platforms means no chance for innovation" stance. Most of the Solaris 10 innovations (Zones, Greenline, Least Privileges) are completely hardware independent. Those projects which are hardware dependent develop a platform-neutral infrastructure, and provide modules only for those platforms which they suppot (FMA being a good example of this). Testing is hard. It's clear that Linux testing often amounts to little more than "it compiles". Large projects will always break things; even with our stringent testing requirements in Solaris, there are always plenty of "follow on" bugfixes to any project. If Linux (or Linus) is afraid of radical change, then there will be no chance for innnovation2.As we look towards OpenSolaris, I'm intrigued by this comment by Linus:I am a dictator, but it's the right kind of dictatorship. I can't really do anything that screws people over. The benevolence is built in. I can't be nasty. If my baser instincts took hold, they wouldn't trust me, and they wouldn't work with me anymore. I'm not so much a leader, I'm more of a shepherd. Now all the kernel developers will read that and say, "He's comparing us to sheep." It's more like herding cats.Stephen has previously questioned the importance of a single leader for an open source project. In the course of development for OpenSolaris, we've looked at many open source models, ranging from the "benevolent dictatorship" to the community model (and perhaps briefly, the pornocracy). I, for one, question Linus' scalability. Too many projects like crash dumps, kernel debugging, and dynamic tracing have been rejected by Linus largely because Linux debugging is "as good as it needs to be." Crash dumps and kernel tracing are "vendor features," because real kernel developers should understand the code well enough to debug a problem from a stack trace and register dump. I love the Solaris debugging tools, but I know for a fact that there are huge gaps in our capabilities, some of which I hope to fill in upcoming releases. I would never think of Solaris debugging as perfect, or beyond reproach.So it's time for Space Ghosts's "something.... to think about....":Is it beneficial to have one man, or one corporation, have the final say in the development of an open source project?1ZFS is an example of truly innovative performance improvements. Once you chuck the volume manager, all sorts of avenues open up that nobody's ever though about before.2To be clear, I'm not an ardently anti-Linux. But I do believe that technology should speak for itself, irregardless of philosophical or "religious" issues. Obviously, I'm part of the Solaris kernel group, and I believe that Solaris 10 has significant innovations that no other OS has. But I also have faith in open source and the open development model. If you know of Linux features that are truly innovative (and part of the core operating system), please leave a comment and I'll gladly acknowledge them.

I just ran across this interview with Linus. If you've read my blog, you know I've talked before about OS innovation, particularly with regards to Linux and Solaris. So I found this particular Linus...


Solaris Paleontology

In the footnote a few days ago, I commented on the fact that the history of Solaris debugging could rougly be divded into three 'eras'. As someone interested in UNIX history, I decided to dig through the Solaris archives and put together a chronology of Solaris debuggability and observability tools. For fun, I divided it into eras to parallel Earth's history. And I swear I'm not out to make anyone feel like a dinosaur (or a prokaryote, for that matter).I've only been around for one of these "dawn of a new era" arrivals, DTrace1. When one of these revolutionary tools arrive, it's amazing to see how quickly engineers avoid their own past. Try asking Bryan to debug a performance problem on Solaris 9, and you'll probably get some choice phrases politely explaining that while he appreciates the importance of your problem, he would rather throw himself down a slide of broken glass and into a vat of rubbing alcohol. Being the neophyte that I am, I've only ventured into the 'Paleozoic era' on one occasion. After an MDB session on a Solaris 8 crashdump (paraphrased slightly): $ mdb 0 > ::print mdb: invalid command '::print': unknown dcmd name > ::help print mdb: unknown command: print > ::please print mdb: invalid command'::please': unknown dcmd name > ::ihateyou $I quickly ran away screaming, never to return. I think I ended up hiding in a corner of my office for two hours, cradling my DTrace answerbook and whispering "there's no place like home" over and over. I'm still a spoiled brat, but at least I have respect and admiration for those Solaris veterans who crawled through debugging hell so that I could live a comfortable life2. It's also made me feel sorry for the Linux (and Windows) developers out there. Not in the Nelson Muntz "Ha ha! You don't have DTrace!" sense. More like "Poor little guy. It's not his fault his species never evolved opposable thumbs." There are a lot of brilliant Linux developers out there, stuck in a movement that doesn't embrace debugging or observability as fundamental goals. But this post is supposed to be about history, not Linux. So without further ado, my brief history of Solaris (soon to be available in refrigerator magnet form):<1989HADEANSunOS 4.Xadb, ptrace, crash1990ARCHAEANSVr4 merge/proctruss(1)1991vtracevmstat(1M)iostat(1M)1992SOLARIS 2.01993mpstatSOLARIS 2.21994Kernel slab allocatorTNFbasic ptoolsSOLARIS 2.419951996SOLARIS 2.5.1PROTEROZOICNext generation /procUserland watchpoints1997lockstat(1M)pkill and pgreplibproc1998savecore on by defaultSOLARIS 71999libproc for corefilescoreadm(1M)prstat(1M)lockstat kernel profilingPALEOZOICMDB(1)::findleaks2000SOLARIS 8EOL of crash(1M)2001live process control for MDBEOL of adb(1)pargs and preapMESOZOICkernel CTF datatrapstat(1M)2002SOLARIS 9libumem(3LIB) and umem_debug(3MALLOC)::typegraph for mdb(1)2003Userland CTFcoreadm(1M) content controlCENOZOICDTrace(1M)intrstat(1M)2004DTrace pid provider for x86pfiles with pathnamesDTrace sched, proc providersCTF for core librariesDTrace I/O providerKMDB(1)DTrace MIB, fpuinfo providersPer-thread ptoolsThese are my choices based on SCCS histories and putback logs. Obviously, I've failed to include some things. Leave a comment or email if you think something's not getting the recognition it deserves (keeping in mind this is a blog post, not a book).1 I actually started exactly one day before DTrace integrated. But I had some experience (albeit limited) as an intern the previous year.2 In all seriousness, it's not that I don't have to ever debug anything, or that the problems we have today are somehow orders of magnitude simpler than those in the past. What these tools provide is a dramatic reduction in time to root-cause. You still need the same inquisitive and logical mind to debug hard problems, its just that good tools let you form questions and get answers faster than you could before. Really good tools (like DTrace) let you ask the previously unanswerable questions. You may have been able to debug the problem before, but you would have ended up running around in circles trying to get data that's now immediately available thanks to DTrace.

In the footnote a few days ago, I commented on the fact that the history of Solaris debugging could rougly be divded into three 'eras'. As someone interested in UNIX history, I decided to dig through...


Missing the big picture

I just saw this article the other day, and was amazed that someone could so clearly miss the point of OpenSolaris and Sun's business strategy. In the article, the (apparently anonymous) author tries to argue that Jonathan and Adam are on two opposing sides of an internal corporate struggle between the executivess and the engineers over business strategy. He begins with a gross misinterpretation of Jonathan's blogs:As far as I can dissect Schwartz's argument, it's that IBM made a big mistake by building features on top of open source, because that removes any possible hardware lockin, and lets customers move to less expensive hardware that they can buy from any old vendor.The point of Jonathan's argument (whether or not you believe it is up to you) is not about hardware at all. The point is that because IBM relies completely on Linux (and doesn't have a distribution of its own), they cannot offer the same integrated solution that others can. If you want to run Linux on IBM hardware, you'll most likely be running RedHat. But if RedHat provides an integrated application server, why would you turn around and buy WebSphere from IBM? Jonathan also discusses software lockin when moving between Linux distributions, but this also has nothing to do with hardware. Yet the author continues to focus on "hardware lockin," twisting Adam's blog entry into this lovely conclusion:If Solaris is open-sourced, then Sun is about to undermine its own hardware lockin.For historical reasons, Sun has a reputation of big iron and proprietary UNIX, and with that comes hardware lockin. But if you look at Solaris and the rest of our software stack, you'll quickly see this is no longer true. Whether or not you believe in Sun's hardware prices (think x86 and Opteron), you can't ignore the fact that Solaris runs on commodity x86 hardware. And even if you don't want Solaris, Java Desktop System runs on Linux, and Java Enterprise System runs on Windows and HP-UX. To top it off, they all run on open standards; you can move your apps to non-Sun systems (as long as they are standards compliant) without worry. So where's the hardware lockin? Sun's message is that we sell systems. Maybe we're not quite there yet, but eventually you'll be able to get an integrated software stack (JES or JDS) on whatever hardware platform and Operating System you choose. If you want to run Solaris, great. If you want to get hardware from us, great. If not, we'll make damn sure our software all works together as best as possible.The author does, however, get one thing right: OpenSolaris will enable ports to new platforms. Ask a kernel engineer or ask a VP, they will both tell you this is a good thing. If our customers want to run on POWER, and OpenSolaris enables this port to happen, then we can bring the weight of our entire software portfolio onto the platform, and do it better than anyone else. Go buy your hardware from IBM, but we'll give you a complete software solution with Java Enterprise System for less than IBM. As Adam points out, OpenSolaris can only help Sun. We love choice, and choice is good.

I just saw this article the other day, and was amazed that someone could so clearly miss the point of OpenSolaris and Sun's business strategy. In the article, the (apparently anonymous) author tries...


Debugging the Debugger

I've been missing almost a week, mostly because of my involvement with the amd64 bringup effort. A while ago, I was recruited to get the ptools and mdb up and running in 64-bit mode. This certainly made me appreciate some of the old war stories - all the Solaris veterans have their favorite bug that they debugged using only hex dumps, a pocket knife, and a ball of string. Over time, you start taking the Solaris debugging tools for granted: try going back to Solaris 9 after spending a year with DTrace1. I was in for quite a shock when I learned that my spoiled lifestyle wasn't going to cut it in the jungles of amd64.It's no secret that we've had the amd64 kernel up and running for a while now. Thankfully, I was not part of the initial bringup effort. Back when I joined, the kernel was already booting multiuser, and I never had to lay my finger on a simulator or diagnose a double fault. 64 bit applications would load and run (thanks in part to a certain linker alien2), but debugging them was basically impossible: no truss, no mdb, no pstack. So where do you begin?Thankfully, we've had a 64-bit OS for years, and most of the infrastructure was already working. All our tools worked with 64-bit ELF files out of the box, for example. But a lot of things were still broken. I ended up along roughly the following path:pstack on corefilesSo pstack segfaulted the first time I ran it. At this point I could run elfdump on the corefile, but not much else. The first task was getting pstack to run on corefiles, so I at least knew where to begin inserting my printf() statements. Walking a stack on amd64 can be a tricky thing - so I begin with a simple version that works 99% of the time.mdb for corefilesThe next step was to get MDB chewing on these corefiles. A stacktrace is all well and good, but we need to be able to exmine registers and memory. This turned out to be quite a bit of work; mdb is quite a heavy consumer of libproc, and uses some little-used interfaces in libc (in particular, getcontext(2) and makecontext(3c) were annoying). But with a lot of printfs, a few fixes and a few hacks, we had post mortem debugging.trussSadly, I can't take credit for this one. This turned out to be just a bug in fork(2), and once that was fixed, truss worked flawlessly.mdb for live processesThis was not too difficult thanks to the magic of libproc, which allows us to manipulate live processes and corefiles through the same interface. A few minor tweaks were needed here and there, and some of the finer bugs have yet to be fixed, but it's basically working. Most of the ISA specific actions (such as setting breakpoints) are the same on ia32 and amd64.agent LWP and pfilesFinally, I had to get Psyscall (the libproc internal function that executes a system call in the context of a target process) working. This was particularly annoying, mostly because the code was poorly structured - rather than having separate ISA specific actions in different files, we had tons of #ifdefs scattered throughout the code. A large part of this was just ripping apart the code and restructuring it in a way that made porting easier. Someday when someone ports Solaris to run on Adam's laptop, they'll appreciate it.In a testament to the portability of Solaris, there were no large infrastructure changes outside of Psyscall. Basically, I just fixed one small bug after another. So all the debugging tools are now up and running, and with Bryan and Matt helping, we have DTrace and KMDB as well. So now I can go back to a pampered life in my Hollywood Hills mansion; surrounded by DTrace, MDB, and a few of my closest ptools.1 Solaris debugging can be roughly divided into three eras: pre-mdb (Paleozoic), pre-DTrace (Mesozoic), and modern day (Cenozoic). The arrival of CTF data could be seen end of the Triassic period and beginning of the Jurassic, while KMDB may begin the Pleistocene (a.k.a. modern) era. Sounds like an interesting science project...2 There were many others involved in getting the kernel this far. But Mike's the only one with a blog, so he gets all the credit.

I've been missing almost a week, mostly because of my involvement with the amd64 bringup effort. A while ago, I was recruited to get the ptools and mdb up and running in 64-bit mode. This certainly...


More on OS innovation

The other day on vacation, I ran across a Slashdot article on UNIX branding and GNU/Linux. Tne original article was mildy interesting, to the point where I actually bothered to read the comments. Now, long ago I learned that 99% of Slashdot comments are worthless. Very rarely to you find thoughtful and objective comments; even browsing at +5 can be hazardous to your health. Even so, I managed to find this comment, which contained in part some perspective relating to my previous post on Linux innovation:I have been saying that for several years now. UNIX is all but dead. The only commercial UNIX likely to still be arround in ten years time as an ongoing product is OS/X. Solaris will have long since joined IRIX, Digital UNIX and VMS as O/S you can still buy and occasionaly see a minor upgrade for it.There is a basic set of core functions that O/S do and this has not changed in principle for over a decade. Log based file systems, threads that work etc are now standard, but none of this was new ten years ago.The interesting stuff all takes place either above or below the O/S layer. .NET, J2EE etc are where interesting stuff is happening.Clearly, this person is not the sharpest tool in the shed when in comes to Operating Systems. But it begs the question: How widespread is this point of view? We love innovation, and it shows in Solaris 10. We have yet to demo Solaris 10 to a customer without them being completely blown away by at least one of the many features. DTrace, Zones, FMA, SMF, and ZFS are but a few reasons why Solaris won't have "joined IRIX, Digital UNIX, and VMS" in a few years.Part of this is simply that people have yet to experience real OS innovation such as that found in Solaris 10. But more likely this is just a fundamental disconnection between OS developers and end users. If I managed to get my mom and dad running Java Desktop System on Solaris 10, they wouldn't never know what DTrace, Zones, or ZFS is, simply because it's not visible to the end user. But this doesn't mean that it isn't worthwhile innovation: as with all layered software, our innovations directly influence our immediate consumers. Solaris 10 is wildly popular among our customers, who are mostly admins and developers, with some "power users". Even though these people are a quantitatively small portion of our user base, they are arguably the most important. OS innovation directly influences the quality of life and efficiency of developers and admins, which has a cascading effect on the rest of the software stack.This cascading influence tends to be ignores in arguments over the commoditization of the OS. If you stand at any given layer, you can make a reasonable argument that the software two layers beneath you has become a commodity. JVM developers will argue that hardware is a commodity, while J2EE developers will argue that the OS is a commodity. Even if you're out surfing the web and use a web service developed on J2EE, you're implicitly relying on innovation that has its roots at the OS. Of course, the further you go from the OS the less prominent the influence is, but its still there.So think twice before declaring the OS irrelevant. Even if you don't use features directly provided by the OS, your quality of life has been improved by having them available to those that do use them.

The other day on vacation, I ran across a Slashdot article on UNIX branding and GNU/Linux. Tne original article was mildy interesting, to the point where I actually bothered to read the comments....


Solaris as OSCON

As you may have noticed from Adam's blog, our time at OSCON was a rousing success. Unfortunately, I don't have enough time to write up a real post, since I'm on vacation for the next few days. Adam summed things up pretty well; the two points I'd reiterate are:We are eager to learn how to do open Solaris right.Sun has a lot of experience with open source projects, with varying degrees of success. Our meeting with open source leaders was extremely informative; I myself never realized how difficult it is to build a developer community that really works. We're not just throwing source over the wall as a PR stunt or to get free labor; we're doing it (among other reasons) to build a thriving community centered around Solaris. And we need you to help us get it right.Solaris 10 technology sells itself.Before our BOF, most people we met were skeptical of Solaris. Because we're a proprietary UNIX, we've gained a reputation of being an old dinosaur: Linux is fast and new and evolving, Solaris is slow and old and stagnant. This couldn't be further from the truth, and it doesn't take a marketing campaign to convince the world otherwise. Once people see DTrace, Zones, Solaris Management Framework, Predictive Self Healing, ZFS, and all the other great features in Solaris 10, there's really no question that Solaris is alive and well. Whether you are an administrator or a developer, there will be something in Solaris that will blow you away. If you haven't seen Solaris 10 in action, get your Solaris Express today and spread the word.

As you may have noticed from Adam's blog, our time at OSCON was a rousing success. Unfortunately, I don't have enough time to write up a real post, since I'm on vacation for the next few days. Adam...


Lessons in broken interfaces

In build 60 (Beta 5 or SX 7/04), I fixed a long standing Solaris bug: mounted filesystems could not contain spaces. We would happily mount the filesystem, but then all consumers of /etc/mnttab would fail. This resulted in sad situations like:# df -hFilesystem size used avail capacity Mounted on/dev/dsk/c0d0s0 36G 13G 22G 38% //devices 0K 0K 0K 0% /devices/dev/dsk/c0d0p0:boot 11M 2.3M 8.4M 22% /boot/proc 0K 0K 0K 0% /procmnttab 0K 0K 0K 0% /etc/mnttabfd 0K 0K 0K 0% /dev/fdswap 1002M 24K 1002M 1% /var/runswap 1003M 1.3M 1002M 1% /tmp# mount -F lofs /export/space\\ dir /mnt/space\\ mnt/export/space dir /mnt/space mnt lofs dev=1980000 1090718041# df -hdf: a line in /etc/mnttab has too many fields#Luckily you could unmount the filesystem, but it was quite annoying to say the least. The resulting fix was really an exploration into bad interface design./etc/mnttabThis file has been around since the early days of Unix (at least as far back as SVR3). Each line is a whitespace-delimited set of fields, including special device, mount point, filesystem type, mount options, and mount time (see mnttab(4) for more information). Historically, this was a plain text file. This meant that the user programs mount(1M) and umount(1M) were responsible for making sure its contents were kept up to date. This could be very problematic: imagine what would happen if the program died partway through adding an entry, or root accidently removed an entry without actually unmounting it. Once the contents were corrupted, the admin usually had to resort to rebooting, rather than trying to guess what the proper contents. Not to mention it makes mounting filesystems from within the kernel unnecessarily complicated.In Solaris 8, we solved part of the problem by creating the mntfs pseudo filesystem. From this point onward, /etc/mnttab was no longer a regular text file, but a mounted filesystem. The contents are generated on-the-fly from the kernel data structures. This means that the contents are always in sync with the kernel1, and that the user can't accidentally change the contents. However, we still had the problem that the mount points could not contain spaces, because space was a delimiter with special meaning.getmntent() and friendsOn top of this broken interface, a C API was developed that had even worse problems. Consider getmntent(3c):int getmntent(FILE \*fp, struct mnttab \*mp);There are several problems with this interface:The user is responsible for opening and closing the fileThere is only one mount state for the kernel; why should the user have to know that /etc/mnttab is the place where the entries are stored?The first parameter is a FILE \*If you're developing a system interface, you should not enforce using the C stdio library. Every other system API takes a normal file descriptor instead./p>The memory is allocated by the function on demandThis causes all sorts of problems, including making multithreaded difficult, and preventing the user from controlling the size of the buffer used to read in the data.There is no relationship between the memory and the open fileBecause of this, a lazy programmer can close the file after the last call to getmntent() while still using the memory, so it must be kept around indefinitely.By now, it should be obvious that this was an ill-conceived API built on top of a broken interface. Off the top of my head, if I were to re-design these interfaces I would come up with something more like:mnttab_t \*mnttab_init(void);int mnttab_get(mnttab_t \*mnttab, struct mntent \*ent, void \*scratch, size_t scratchlen);void mnttab_fini(mnttab_t \*mnttab);The solutionOnce /etc/mnttab became a filesystem, we could add ioctl(2) calls to do whatever we wanted. Once we're in the kernel, we know exactly how long each field of the structure is. We create a set of NULL-terminated strings directly in user space, and simply return pointers to them. This was more complicated than it sounds for the reasons outlined above. We also had to maintain the ability to read the file directly. With this fix, all C consumers "just work". Scripted programs will still choke on a mnttab entry with spaces, but this is a minority by far.Note that the files /etc/vfstab and /etc/dfs/sharetab still suffer from this problem. There has been some discussion about how to resolve these issues, with the new Service Management Facility being touted as a possible solution. And ZFF (Sun's next generation filesystem) is avoiding /etc/vfstab altogether.1 There is always the possibility that the mounted filesystems change between the time the file is opened and the data is read.

In build 60 (Beta 5 or SX 7/04), I fixed a long standing Solaris bug: mounted filesystems could not contain spaces. We would happily mount the filesystem, but then all consumers of /etc/mnttab would...


Is Linux innovative?

In a departure from recent musings on the inner workings of Solaris, I thought I'd examine one of the issues that Bryan has touched on in his blog. Bryan has been looking at some of the larger issues regarding OS innovation, commoditization, and academic research. I thought I'd take a direct approach by examining our nearest competitor, Linux.Bryan probably said it best: We believe that the operating system is a nexus of innovation.I don't have a lot of experience with the Linux community, but my impression is that the OS is perceived as a commodity. As a result, Linux is just another OS; albeit one with open source and a large community to back it up. I see a lot of comments like "Linux basically does everything Solaris does" and "Solaris has a lot more features, but Linux is catching up." Very rarely do I see mention of features that blow Solaris (or other operating systems) out of the water. Linus himself has said:A lot of the exciting work ends up being all user space crap. I mean, exciting in the sense that I wouldn't car [sic], but if you look at the big picture, that's actually where most of the effort and most of the innovation goes.So Linus seems to agree with my intuition, but I'm in unfamiliar territory here. So, I pose the question:Is the Linux operating system a source of innovation?This is a specific question: I'm interested only in software innovation relating to the OS. Issues such as open source, ISV suport, and hardware compatibility are irrelevant, as well as software which is not part of the kernel or doesn't depend on its facilities. I consider software such as the Solaris ptools as falling under the purview of the operating system, because they work hand-in-hand with the /proc filesystem, a kernel facility. Software such as GNOME, KDE, X, GNU tools, etc, are all independent of the OS and not germane to this discussion. I'm also less interested in purely academic work; one of the pitfalls of academic projects is that they rarely see the light of day in a real-world commercial setting. Of course, most innovative work must begin as research before it can be viable in the industry, but certainly proven technologies make better examples.I can name dozens of Solaris innovations, but only a handful of Linux ones. This could simply be because I know so much about Solaris and so little about Linux; I freely acknowledge that I'm no Linux expert. So are there great Linux OS innovations out there that I'm just not aware of?

In a departure from recent musings on the inner workings of Solaris, I thought I'd examine one of the issues that Bryan has touched on in his blog. Bryan has been looking at some of the larger issues...


Watchpoints features in Solaris 10

In my last post I described how watchpoints work in Solaris, or how they're supposed to work. The reality is that there have been some small problems that have prevented a large number of watchpoints from being practical for complicated programs. I've made some changes in Solaris 10 so that they work in all situations, which made it onto Adam's Top 11-20 Features in Solaris 10.How watchpoints are usedTypically, watchpoints are used in one of two ways. First, they are used for debugging userland applications. If you know that memory is getting corrupted, or know that a variable is being modified from an unknown location, you can set a watchpoint through a debugger and be notified when the variable changes. In this case, we only have to keep track of a handful of watchpoints. But they are also used for memory allocator redzones, to prevent buffer overflows and memory corruption. For every allocation, you put a watched region on either end, so that if the program tries to access unknown territory, a SIGTRAP signal is sent so the program can be debugged. In this case, we have to deal with thousands of watchpoints (two for every allocation), and we fault on virtually every heap access1.Watchpoints in strange placesWatchpoints have worked for the most part since they were put into Solaris. Whenever a watchpoint is tripped, we end up in the kernel, where we have to look at the instruction we faulted on and take appropriate action. There were some instructions that we didn't quite decode properly when there were watchpoints present. On SPARC, the cas and casx instructions (used heavily in recent C++ libraries) could cause a SEGV if they tried to access a watched page. On x86, instructions that accessed the stack (pushl and movl, for example) would cause a similar segfault if there was a watchpoint on a stack page.Multithreaded programsThere has been a particularly nasty watchpoint problem for a while when dealing with lots of watchpoints in multithreaded programs. When one thread hit a watchpoint, we have to stop all the other threads. But in the process of stopping, those threads may trigger a watchpoint, we try to stop the original watchpoint thread. We end up spinning in the kernel, where the only solution is to reboot the system.ScalabilityIn the past, watchpoints were kept in a linked list for each process. This means that every time a program added a watchpoint or accessed a watched page, it would spend a linear amount of time trying to find the watchpoint. This is fine when you only have a handful of watchpoints, but can be a real problem when you have thousands of them. These linked lists have since been replaced with AVL trees. Individual watchpoints may be slow, but 10,000 watchpoints have nearly the same impact as 10 watchpoints. This can result in as much as 100x improvement for large number of watchpoints.All of the above problems have been fixed in Solaris 10. The end result is that tools like watchmalloc(3malloc) and dbx's memory checking features are actually practical on large programs.1 Remember that we have to fault on every access to a page that contains a watchpoint, even if it's not the address we're actually interested in.

In my last post I described how watchpoints work in Solaris, or how they're supposed to work. The reality is that there have been some small problems that have prevented a large number of watchpoints...


Watchpoints 101

As Adam noted in the Solaris Top 11-20, watchpoints are now much more useful in Solaris 10. Before I go into specific details regarding Solaris 10 improvements, I thought I'd give a little technical background on how watchpoints actually work. This will be my second highly technical entry in as many days; in my next post I promise to tie this into some real-world applications and noticeable improvements in S10.The idea of watchpoints has been around for a long time. The basic idea is to allow a debugger to set a watchpoint on a region of memory within a process. When that region of memory is accessed, the debugger is notified in order to take appropriate action. This typically serves two purposes. First, it's useful for interactive debuggers when determining when a region of memory gets modified. Second, it can be used as a protection mechanism to avoid buffer overflows (more on this later).As with most modern operating systems, Solaris implements a virtual memory system. A complete explanation of how virtual memory works is beyond the scope of a single blog post. The simplest way to explain it is that each process refers to memory by a virtual address, which corresponds to a physical piece of memory. Each piece of memory is called a page, which can be either mapped (resident in RAM) or unmapped (possibly stored on disk). The operating system has control over when and how pages get mapped in or out of memory. If a program tries to access memory that is unmapped, the OS will map in the necessary pages as needed. Once pages are mapped, accesses will be handled directly in hardware until the OS decides to unmap the memory1. There are many benefits of this, including the ability for processes to see a unified flat memory space, inability to access other processes' memory, and the ability to store unused pages on disk until needed.To implement watchpoints, we need a way for the operating system to intercept accesses to a specific virtual page within a process. If we leave pages mapped, then accesses will be handled in hardware and the OS will have no say in the matter. So we keep pages with watchpoints unmapped until they are actually accessed. When the process tries to read/write/execute from the watched page, the OS gets notified via a trap2. At this point, we temporarily map in the page and single step over the instruction that triggered the trap. If the instruction touches a watched area (note that there can be more than one watched area within a page), then we notify the debugger through a SIGTRAP signal. Otherwise, the instruction executes normally and the process continues.Things become a little more complicated in a multithreaded program. If we map in a page for a single thread, then all other threads in the process will be able to access that memory without OS intervention. If another thread accesses the memory while we're stepping over the instruction, we can miss triggering a watchpoint. To avoid this, we have to stop every thread in the process while we step over the faulting instruction. This can be very expensive; we're looking into more efficient methods. I won't spend too much time discussing how the debugger communicates with the OS when setting and reacting to watchpoints. Most of this information can be found in the proc(4) manpage.With my next post I'll examine some of the specific enhancements made to watchpoints in Solaris 10.1This is obviously a very simplistic view of virtual memory. Curious readers should try a good OS textbook or two for more detailed information.2Traps are quite an interesting subject by themselves. On Solaris SPARC, you can see what traps are ocurring with the very cool trapstat(1M) utility that Bryan wrote.

As Adam noted in the Solaris Top 11-20, watchpoints are now much more useful in Solaris 10. Before I go into specific details regarding Solaris 10 improvements, I thought I'd give a little technical...


Inside pfiles with pathnames

I'm finally back from vacation, and I'm here to help out with Adam's Top 11-20 Solaris Features. I'll be going into some details regarding one of the features I integrated into Solaris 10, pfiles with pathnames (which was edged out by libumem for the #11 spot by a lean at the finish line). This will be a technical discussion; for a good overview of why it's so useful, see my previous entry.There were several motivations for this project:Provide path information for MDB and DTrace.Make pathname information available for pfiles(1).Improve the performance of getcwd(3c).First of all, we needed to record the information in the kernel somewhere. In Solaris, we have what's known as the Virtual File System (VFS) layer. This is an abstract interface, where each file system fills in the implementation details so no other consumers has to know. Each file is represented by a vnode, which can be thought of as a superclass if you're familiar with inheritence. The end result of this is that we can open a UFS file in the same way we open a /proc file, and the only one who knows the difference is the underlying filesystem. We can also change things at the VFS layer and not have to worry about each individual filesystem.To address concerns over performance and the difficulty of bookkeeping, it was necessary to adjust the constraints of the problem appropriately. It is extremely difficult, if not impossible, to ensure that the path is always correct (consider hard links, unlinked files, and directory restructuring). To make the problem easier, we make no claim that the path is currently correct, only that it was correct at one time. Whenever we translate from a path to a vnode (known as a lookup) for the first time, we store the path information within the vnode. The performance hit is negligible (a memory allocation and a few string copies) and it only occurs when first looking up the vnode. We must be prepared for situations where no pathname is available, as some files have no meaningful path (sockets, for example).With the magic of CTF, MDB and DTrace need no modification. Crash dumps now have pathnames for every open file, and with a little translator magic we end up with a stable DTrace interface like the io provider. We also use this to improve getcwd performance. Normally, we would have to lookup "..", iterate over each entry until we find the matching vnode, record the entry name, lather, rinse, repeat. Now, we make a take a first stab at it by doing a forward lookup of the cached pathname, and if it's the same vnode, then we simply return the pathname. getcwd has very stringent correctness requirements, so we have to fall back to the old method when our shortcut fails.The only remaining difficultly was exporting this information to userland for programs like pfiles to use. For those of you familiar with /proc, this is exactly the type of problem it was designed to solve. We added symbolic links in /proc/<pid>/path for the current working directory, the root directory, each open file descriptor, and each object mapped in the address space. This allows you to run ls -l in the directory and see the pathname for each file. More importantly, the modifications to pfiles become trivial. The only tricky part is security. Because a vnode can have only one name, and there can be hard links to files or changing permissions, it's possible for the user to be unable to access the path as it was originally saved. To avoid this, we do the equivalent of a resolvepath(2) in the kernel, and reject any paths which cannot be accessed or do not map to the same vnode. The end result of this is that we may lose this information is some exceptional circumstances (the directory layout of a filesystem is relatively static) but as Bart is fond of reminding us: performance is a goal, correctness is a constraint.

I'm finally back from vacation, and I'm here to help out with Adam's Top 11-20 Solaris Features. I'll be going into some details regarding one of the features I integrated into Solaris 10,...


Real life obfuscated code

In a departure from my usual Solaris propaganda, I thought I'd try a little bit of history. This entry is aimed at all of you C programmers out there that enjoy the novelty of Obfuscated C. If you think you're a real C hacker, and haven't heard of the obfuscated C contenst, then you need to spend a few hours browsing their archives of past winners1.If you've been reading manpages on your UNIX system, you've probably been using some form of troff2. This is an early typesetting language processor, dating back to pre-UNIX days. You can find some history here. The nroff and troff commands are essentially the same; they are built largely from the same source and differ only in their options and output formats.The original troff was written by Joe F. Ossanna in assembly language for the PDP-11 in the early 70s. Along came this whizzy portable language known as C, so Ossana rewrote his formatting program. However, it was less of a rewrite and more of a direct translation of the assembly code. The result is a truly incomprehensible tangle of C code, almost completely uncommented. To top it off, Ossana was tragically killed in a car accident in 1977. Rumour has it that attempts were made to enhance troff, before Brian Kernighan caved in and rewrote it from scratch as ditroff.If you're curious just how incomprehensible 7000 lines of uncommented C code can be, you can find a later version of it from The Unix Tree, an invaluable resource for the nostalgic among us. To begin with, the files are named n1.c, n2.c, etc. To quote from 'n6.c':setch(){register i,\*j,k;extern int chtab[];if((i = getrq()) == 0)return(0);for(j=chtab;\*j != i;j++)if(\*(j++) == 0)return(0);k = \*(++j) | chbits;return(k);}find(i,j)int i,j[];{register k;if(((k = i-'0') >= 1) && (k

In a departure from my usual Solaris propaganda, I thought I'd try a little bit of history. This entry is aimed at all of you C programmers out there that enjoy the novelty of Obfuscated C. If you...