Tuesday Apr 05, 2011

New Post

I am just doing a quick post to my blog.

Wednesday Jan 27, 2010

Cross of Lorraine

Wednesday Sep 16, 2009

7000 Series Takeover and Failback

The two fundamental operations most people associate with clusters are takeover (assuming control of resources from a failed peer) and failback (relinquishing control of those resources to that peer after it has resumed operation). It's very important to understand what these operations mean in the context of the Sun Storage 7310 and 7410 NAS appliances. A number of important changes have been made in the recently-released 2009.Q3 software release that affect how these operations work - all of them, we obviously believe, for the better - but the administrative interfaces are unchanged. The product documentation (PDF not yet updated for Q3 as of this writing) has also been greatly enhanced to better describe the clustering model and administrative operations that apply to these products, and I would strongly recommend that you avail yourself of that resource; you'll want to be familiar with it to understand some of the terminology I'll use here. But many of our customers and partners have asked us questions around takeover performance, and in order to address those questions I will need to go into greater detail about the implementation of these operations. So let's take a look under the hood. It's important to understand that nearly everything discussed here is an implementation detail that may change without notice in future software releases, and applies very specifically to the new 2009.Q3 software.

A First-Order Look

Takeover and failback consist of operations performed on each resource eligible for the operation. The selection of eligible resources is described in the product documentation and depends on the resource type and, if applicable, which cluster peer has been assigned ownership of it. For the sake of simplicity, we will define a fairly typical cluster configuration in which there is a single pool consisting of the disks and log devices in 4 J4400 JBODs, a pair of network interfaces by which clients will access that storage via NFS, and a network interface private to each head for administration only. We will refer to the heads as simply A and B. The pool and service interfaces are assigned to head A.

In this configuration, the following resources of interest will exist on both heads:

  • Eight resources of the form ak:/diskset/(uuid)
  • ak:/zfs/pool-0
  • ak:/nas/pool-0
  • ak:/net/nge0
  • ak:/net/nge1
  • ak:/shadow/pool-0
  • ak:/smb/nge0
  • ak:/smb/nge1

In addition, head A will have ak:/net/nge2 and head B will have ak:/net/nge3, representing the private administrative network interfaces. There will also be a very large number of other resources representing components with upgradeable firmware, service configuration, users and roles, and other configuration state replicated between heads. Because these resources are replicas and do not normally have any activity associated with them at takeover or failback time, we will not consider them further. The resources of interest must be acted upon in dependency order; for example, we must open the ZFS pool and mount and share all of its shares before we can safely bring any network interfaces up that clients expect to use to access those shares. Otherwise clients could attempt to access the shares before they are available, receiving a response that would result in stale filehandle errors. The above listing of resources reflects the dependency order for import; as you would expect, the opposite order must be observed when exporting.

It is worth noting that most of these resources do not appear in the management UIs. This is deliberate: the nas, shadow, and smb resource classes are symbiotes; that is, they always have the same resource identifier and ownership as their respective masters but have distinct locations in the dependency chain and distinct actions required for import or export. This allows us finer-grained dependency control and makes the implementation simpler and more modular by separating subsystems from one another. This becomes very important when discussing takeover and failback performance: the symbiote resources must be exported and/or imported, and these operations take time.


When head A fails (let us assume it has been powered off accidentally by our data center personnel), head B will initiate takeover once it detects that no heartbeats have arrived from head A. This timeout period is currently 500ms, and takeover is initiated as soon as the timeout elapses. At the beginning of takeover, an arbitration process is performed that protects user data in the case in which all cluster I/O has failed but both heads are still functioning. This arbitration process consists of attempting to take the zone lock (defined by SAS-2) on each SAS expander present in the storage fabric. These locks are acquired and dropped in a defined order by any head attempting to enter the OWNER state. The locks are held with a fixed timeout period, so that if the holder of the locks does not continually reacquire them they will eventually be dropped. A thread on the OWNER head does this at an interval significantly less than the timeout period. Therefore, if a head attempting to enter the OWNER state fails to acquire the locks, it will wait until the timeout interval expires and try again. If it again fails, it will reboot, allowing its peer (which must still be functioning) to take control of shared resources. This process prevents both heads from attempting to perform simultaneous write access to shared storage, which would destroy or corrupt data. The timeout interval is set to 5 seconds in current products, meaning that this step can take up to 5 seconds to complete plus the time to contact all expanders (typically around 1-2s even in the largest configurations). Since the zone locks are not held when in the CLUSTERED state, this will normally take less than 2 seconds overall. Only in the cases where arbitration is actually necessary - such as when all three cluster I/O links are disconnected between two functioning heads - or when taking over directly from the STRIPPED state can the additional 5 second penalty apply.

After acquiring the zone locks, the surviving head will evaluate each resource in the resource map in dependency order. If the resource is not already imported, its class's import function will be invoked. Since head B does not own any of the singleton resources listed above, none of them nor any of their symbiotes will be imported. We will therefore invoke the diskset class's import function for each of the disksets, then the zfs class's import function for pool-0, then the nas class's import function for pool-0, then the net class's import function for nge0, and so on until we have attempted to import all of these resources. Note that if a failure occurs we will simply mark the resource faulted and proceed: our peer is down so we must make a best effort. When head A resumes functioning, it will rejoin the cluster, meaning that the current list and state of all resources will be transferred from head B over the intracluster I/O subsystem. Head A will not, however, take control of any of the singleton resources or their symbiotes; it will import only its own private resource ak:/net/nge2 as it transitions into the STRIPPED state following rejoin. This behaviour prevents ping-ponging and allows the administrator to verify that the restored head has had any hardware issues addressed before returning it to service.


Now that head A has rejoined, a failback can be initiated from head B. During failback, head B will walk the list of resources in reverse dependency order, invoking the resource's class's export function for each resource that is not owned by head B. If any of these functions fails, head B will generate an alert and reboot itself. This is done to ensure that the cluster is in a consistent, well-defined state: it would not be safe for head A to import a resource that is still under the control of head B, nor would it be possible for head A to enter a defined cluster state without importing all of the resources assigned to it. Likewise, if head B attempted to re-import the resource that could not be exported, that operation or some other re-import required by it could (and likely would) fail as well, making matters worse. Therefore B's reboot will trigger a takeover by A and consistency is maintained. Assuming a successful export, head B will now perform an intracluster RPC to head A instructing it to begin importing its resources. In response, head A will walk the list of resources in dependency order, invoking each resource's class's import function for each resource assigned to it (but not any assigned to head B). If any of these functions fails, head A will generate an alert and reboot itself, again triggering takeover from head B and maintaining consistency.

A Closer Look

Since failure detection and zone lock acquisition together take at most a few seconds, it is clear that we will need to understand the performance characteristics of each import and export function in order to understand overall takeover and failback performance. What exactly does each of them do?


A diskset is exactly what its name implies: a collection of disks managed together. A few simple rules govern disksets: disks in a diskset are always part of the same storage pool, or no storage pool; disks in a diskset are always located in the same physical enclosure such as a J4400; and disks in a diskset are always imported or exported together. The mapping of disksets onto the slots or bays in a storage enclosure is defined by metadata delivered as part of the appliance software and may vary by product or enclosure type. Administrators do not manage disksets directly; they are handled automatically by the software when storage pools are created and destroyed.

In the abstract, disksets would be merely an engineering convenience, containers used to track the allocation of disks. Unfortunately, the need to support ATA disks necessitates a far more complex implementation. Because the ATA protocol does not support communication between multiple initiators and a single target, the SAS standard defines the notion of an affiliation, a mapping between a single initiator and an ATA target. Only the initiator that owns the affiliation can communicate with the target; any attempt by another initiator to do so will fail. Ownership of the affiliation is tracked by the SAS expander in which the STP bridge port associated with the ATA target is located, and an affiliation is claimed automatically the first time an initiator performs an I/O operation on a given target. Note that I/O operations in this context are not limited to those that affect the media: it is not possible to obtain even basic information about the device without claiming the affiliation. The process of obtaining that information and using it to create a device node used by ZFS and other software to interface with the disk is known as enumeration, a process that is normally performed by default by Solaris and other operating systems on every disk visible to the system's HBAs. However, as we can see, attaching two initiators to the same expander and performing automatic enumeration will result in chaos if there are ATA disks behind that expander: each system will claim some subset of the affiliations during enumeration but hang for an extended time attempting to enumerate those disks whose affiliations were claimed by its peer. The net result would be extremely long boot times and some random subset of disks visible to each system. Clearly this is untenable.

Disksets are a solution to this problem. By disabling automatic enumeration of ATA disks, we can control when the enumeration process is performed, limiting it to circumstances in which it is known to be safe: storage configuration and those times when we know our peer is not attempting to access the disks; e.g., during takeover and failback. Therefore the diskset import function must, for each disk, cause the operating system to enumerate that disk via each possible path. The export routine, likewise, must cause the operating system to "forget about" the disk and relinquish its affiliations for each initiator.

In previous software releases, diskset import time usually dominated the takeover and failback processes. While the 12 disks in each diskset were enumerated in parallel, fundamental problems in the kernel and an inability to process disks from multiple disksets in parallel limited the parallelism that could be exploited. Each diskset typically took 15 to 30 seconds to import, and, worse, could take much longer in certain error paths, especially if, during takeover, the expander had not yet torn down the defunct initiator's affiliations. The current software release improves the situation considerably: all disksets can be imported in a single invocation of the diskset import function (known as "vector import"), and up to 96 disks can be enumerated in parallel, up from 12. In addition, improvements in error handling and timeouts have greatly reduced the worst-case import time when disk or affiliation errors occur. Overall, configurations such as our example above will typically see reductions in diskset import time on the order of 4-6x, with an accompanying large decrease in variance. That is, we might reasonably expect all 8 disksets in our example configuration to be imported in 30s. Because most of the overall benefits come from increased parallelism, smaller configurations will see somewhat smaller improvements. Diskset export is not, and has never been, a significant contributor to failback times; undoing the enumeration process is typically measured in milliseconds for each disk. This means that the relationship between takeover and failback times depends mainly on which contributing factors dominate each activity; i.e., the configuration and uses of the system.


Importing a ZFS pool resource simply means opening the pool, reading the labels from each disk, and creating the attachment points for any zvols (used to provide block storage) contained in the pool. Reading of labels takes constant time as it is performed in parallel, but the second portion of this activity requires walking all datasets in the pool, which is done sequentially. The time taken here is therefore proportional to the sum of the number of projects, filesystem shares, and LU shares the pool contains. It is, however, usually much less than the mounting and sharing activities, which we will investigate next.


The NAS symbiote of the ZFS pool is responsible for mounting and sharing all of the ZFS datasets, including zvols used as backing stores for block devices. This activity therefore takes time proportional to the number of shared filesystems (NFS, CIFS, FTP, HTTP, SFTP, FTPS) and block devices (iSCSI). Tests have shown that NFS shares contribute between 5ms and 15ms each to this process but, because the meaning of "sharing" depends on the protocol, it is difficult to provide an overall estimate of the constants associated with this activity. Likewise, export requires unsharing and then unmounting the filesystems and LUs, which is also linear and requires variable time that is protocol-dependent. Tests have shown that each NFS share contributes a similar time increment to the export process as it does to the import process.


The net class's resources represent the state of a network interface, which will already have been plumbed and configured on both heads. When this resource is imported, the subsystem informs the kernel that the addresses on the interface should be brought up. This activity is performed sequentially for each address and the time taken is therefore linear in the number of addresses configured for the interface. However, the time taken for each is miniscule and it is unusual to assign more than a few addresses to an interface. The entire operation normally completes in a second or two. Exporting is directly analogous and takes a comparable length of time.


This resource class, new in the 2009.Q3 software, manages shadow migration destinations associated with the pool. Because we cannot necessarily mount the shadow migration sources until the network interfaces are imported, this symbiote of the pool resource is imported after all net resources. It is responsible for activating each shadow mount, which will cause the source filesystems to be mounted. This occurs sequentially, and is therefore linear in the number of shadow sources. Of course, shadow sources that are local will take very little time to mount while NFS client mounts can take a significant amount of time. Exporting is the complement, and is normally very fast in all cases.


Each net resource has an smb symbiote, responsible for notifying the CIFS subsystem that an additional network interface should be used to provide service to CIFS clients. This operation is effectively irrelevant as it usually takes less than a second.

Putting It Together

As we've seen, there are many moving pieces involved in takeover and failback. Each resource class has its own set of operations for import and export; some take effectively constant time while others depend on the number of shares and projects or the number of disks. Even where a clear dependency in a particular variable can be characterised, the actual time taken to perform each individual suboperation may not be known or even constant; for example, sharing a filesystem can take a different amount of time depending on the protocol used and even the parameters associated with that particular share. For all these reasons, I strongly encourage anyone who is especially sensitive to takeover or failback time to perform some tests based on their own real-world configurations. This will become even more important as overall performance improves: for example, the recent improvements to diskset import time make the number and type of shares much more relevant to total takeover time. Many configurations may achieve 4x or better overall takeover time improvement as a result of that work, but a configuration with, for example, 3000 shares on a pool consisting of a single diskset, may see little or no change. As with any benchmarking activity, there is no substitute for testing your own configuration, but I hope the above description of the process and rough guidelines will be helpful in establishing expectations going into that testing process so that anomalous behaviour can be identified and tracked down.

In a future post I'll talk about a few of the remaining opportunities for improvement. Until then, ALL HAIL CLUSTRON!

Monday Nov 10, 2008

Low-Availability Clusters

Greetings, puny humans! I am Sun part number 371-3024, a Sun Fishworks Cluster Controller 100, but the world knows me as CLUSTRON. Today you'll be giving me all your gold in tribute as I tell you about the clustering strategy implemented in Fishworks appliances and my integral place in the Sun Storage 7410C.

All clustering software comes with a devastating intrinsic drawback: its own existence.

As anyone who has worked in the industry can tell you, the only bug-free software is the software that isn't written. So when we talk about using two servers - or appliances - to provide higher availability through redundancy, one ought to be immediately suspicious. Managing multiple system images and coordinating their actions is a notoriously difficult problem. And when the state shared between them consists of the business-critical data you're using the appliances to store, you ought to be downright skeptical. After all, while simple logic dictates that two systems ought to offer better availability than one, there's the small matter of the software required to take that from a simplistic statement of the obvious to a working implementation fulfilling at least some of that promise. It's not just software in the usual sense, either; hardware - like me - is also in play, and most modern hardware contains software of its own, usually called firmware. Firmware is really just software for which the system designer has no source code, no observability tools, and no hope. Generally speaking, more software - wherever it runs - means more bugs, more time and energy devoted to management, and more opportunity for operator error; all of these factors act to reduce availability, eating away at the gains offered by the second head. Anyone who tells you otherwise is lying. Liars make CLUSTRON angry.

The typical clustered unified storage server consists of a pair of underpowered servers, each populated with some HBAs, some NICs, a small, expensive DRAM buffer with a giant battery, and an Infiniband (IB) HCA. Oh, and some software. Lots of software, as it turns out, because the way these implementations provide synchronous write semantics to clients is by mirroring the contents of their battery-backed DRAM buffers to one another in real time across those IB links. When a server fails, its partner has access both to the disk storage (usually via FC) and the in-flight transactions stored in its own copy of NVRAM, so it can pick up where its dead partner left off. The onus is often on the administrator, however, to keep configuration state in sync; while it changes infrequently, it usually needs to be identical in order for clients to observe correct behaviour when one of the two servers has failed. And all this comes at a hefty price in cost - NVRAM and IB HCAs take up precious I/O slots (reducing total capacity and performance) and are not particularly cheap. But there is also a complexity cost: a quick glance at the Solaris IB stack turns up about 65,000 lines of source code, and of course that doesn't include an NVRAM driver or the code needed to coordinate mirroring NVRAM over IB. None of the software in such an implementation is reused elsewhere in the storage stack, so it has to be developed and tested independently, and the IB HCA is likely to contain a fat chunk of that nasty undebuggable firmware of which you'd like to as little as possible in your core systems. Worst of all, because that interconnect link is in the data path and doubles as the cluster "heartbeat" channel, under extreme load it may be possible to lose heartbeats and incorrectly conclude that your partner is dead. That can lead to a takeover at the worst possible time: under extreme load (most general-purpose clustering software suffers from this deficiency as well). Overall, it's almost as if the engineers who designed these systems kept adding complexity, cost, and opportunity for error until they finally ran out of ideas.

The Fishworks approach to clustering is somewhat different. At the bottom of the stack lies the most important difference: me, your CLUSTRON overlord. Instead of IB in the data path, I offer three redundant inter-head communication links for use only by management software. We'll come back to this in a bit. The data that would otherwise be written to NVRAM and mirrored over IB is instead written once to each intent log device as if it were an ordinary storage device. These devices combine flash for persistence with supercapacitor-backed DRAM for performance. Since they live next to the disks in your JBODs, they can - just like NVRAM contents - be accessed by an appliance when it takes over for a failed partner. But this entire path is much simpler; notice that we are reusing the basic I/O path that is already used - and tested - for writing to ordinary disks. And since there's nothing to mirror, we don't need any software on the appliances to drive IB devices or coordinate NVRAM mirroring. Each appliance simply writes its intent log records to the device(s) associated with a given storage pool and replays them when later taking control of that pool, either on boot or during a cluster takeover or failback activity.

But what is my role in this? I provide basic connectivity for two purposes:

  • Configuration sync - if you make a change to a service property (say, you add a DNS server) on one appliance, this change is transparently propagated to its partner. If that partner is down, it will pick up the change when it next boots and rejoins the cluster.
  • Heartbeats - this is how a clustered Fishworks appliance decides to take control of cluster resources. No heartbeats? It must be dead. It wouldn't become a soulless machine to mourn its passing so I'd better just poke the userland management software to initiate a takeover.

On the face of it, that seems unremarkable. One could presumably multiplex these functions onto a traditional IB-based implementation. But recall that a key goal in any clustering implementation must be reducing the complexity of the software and thereby limiting the number of bugs that can affect core functionality. I designed myself to do exactly that. Instead of a complex, featureful, high-performance I/O path, I provide some seriously old-school technology, namely 2 plain old serial links - the kind to which you might once have attached a modem to dial into the WOPR. My third link offers somewhat better performance but again uses only existing software drivers; it is an Intel gigabit Ethernet device. All three links provide redundant heartbeat paths (at all times) and all three can be used to carry management traffic, though management traffic is preferentially routed over the fastest available link to provide a better interactive management experience.

The advantages of this design are several:

  • Serial devices typically take interrupts at high priority. By noting the receipt of heartbeat messages in high-level interrupt context, I can ensure that I remain aware of my partner's health no matter how much load my appliance is under.
  • Likewise, I can employ a high-level cyclic on the transmit side to ensure that outgoing heartbeat messages keep flowing to my partner no matter how heavily loaded my appliance.
  • Serial communication is dead-simple, time-tested, and battle-proven. Fewer than 3400 lines of code are required to provide all my serial functionality, including controlling my LEDs. That's around 5% of what we might expect an IB-based solution to require. And while the Ethernet driver is considerably larger, it once again does double-duty: it's the same driver used with the NICs that attach your appliance to the data centre networks.

As you can see, the Fishworks team kept hammering away at a few key design objectives throughout; perhaps the most important of these was a desire to minimise the amount and complexity of new software to be written. This is not to say there is not complexity in the clustering subsystem; there certainly is, and I'll discuss some of those areas in a later edict. But the foundation of the clustering design is as simple as it can be. Clustering is not right for every application or every shop: even with these design principles firmly in place, clusters are much more complex to manage and monitor than standalone appliances, entail significantly higher hardware costs (though as always in the Fishworks universe, there is no added software licensing fee), and however little code may be specific to clustering it certainly is not zero. That means there will be failures that occur in clusters which would not have occurred in a standalone configuration - in other words, that clustering can always reduce availability as well as enhance it. The Fishworks clustering design makes a commendable effort to make this unhappy outcome less likely than in traditional shared-storage clusters. In my next edict I'll discuss the exact circumstances in which I can help provide greater availability than a standalone appliance, and some of the cases not yet covered that the engineers are looking to include.


Wednesday Mar 12, 2008

OGB endorsements

A few people have asked how I'm voting in this year's elections. Here, then, are my endorsements:


  • John Sonnenschein
  • Brian Gupta
  • Joerg Schilling
  • Ben Rockwood
  • Stephen Lau
  • Al Hopper


  1. No
  2. No

Sunday Mar 11, 2007

C-Teams and the ARC as Community Groups

Rich Lowe asked an excellent question about OpenSolaris government in response to one of Casper Dik's answers to the DTrace Community. Here's the question, and my answer. As always, the other candidates' responses are available in the above-referenced thread.

And what would be your (the general "you", not just Casper) plans to help make the ARC and especially the C-team more practically part of OpenSolaris process, rather than a part of Sun process we're exposed to from one side, but not, so far, fully involved with?

Of the two, the ARC is much more difficult to rationalise; I'll explain why below. As for the C-teams and the more general problem of consolidation management, I'll let this text from my position paper[1] do the answering:

One of the OGB's most important tasks will be to rationalise the Community Group structure into one which will allow meaningful self-government. The centerpiece of my plan for doing this is construction of Consolidation-Sponsoring Community Groups (CSCGs). Each of these groups will be given control over an existing consolidation. This structure is not unlike that which exists today in the misnamed Nevada Community, representing ON. But that Community does not govern openly, and other consolidations are entirely missing structure under which they can be governed legitimately. Since the Constitution provides for the Community Group as the unit of independent government, each consolidation requires one to oversee its progress. The CSCGs will be responsible both for controlling the content of their codebases and for providing guidance and leadership to project teams desirous of integration. They will be required to adopt a set of rules (harmonised but not necessarily identical across all CSCGs) for integration and apply fairly these rules.

The challenge associated with the ARC (or ARCs) is that it maps poorly onto the Community Group structure. It makes little sense to me that an Architecture Community Group would sit alongside, say, an Observability Community Group. Observability incorporates a number of subsystems in the OS which in turn need to be properly integrated into each project. So would Reliability, or Virtualization. Architecture is not another such feature set but rather the way in which all those features, along with the new ones offered by the project, fit together and expose themselves to other consumers. That is, Architecture is both a superset of and yet entirely disjoint from all other CGs' areas of interest. The practical effect is substantial overlap: we would expect each CG to offer project teams advice concerning how best to integrate their work with existing features (and, for projects directly related to the CG's area of expertise, what features it should offer to others). In some ways, however, this directly conflicts with the mandate of an Architecture CG, which is to provide architecture guidance to all project teams. In the current system, an observability expert cannot override the ARC's decisions with respect to a proposed observability project. Yet under the Constitution, the Observability CG is supposed to be self-governing. The defining question is what exactly the latter CG is expected to govern, and by what mechanism - the very question the Constitution so conspicuously fails to answer.

It's easy enough in my CSCG model to simply require that all CSCGs adopt rules requiring architectural review by a particular CG just as they should require other CGs with expertise in relevant areas to review and perhaps approve each project prior to integration. Indeed, this is not unlike the system that exists today. The CSCGs do indeed have complete control over their areas of responsibility, namely, the existing consolidations. But this leaves all other CGs less equal, their endorsements subject to veto and without any code of their own to govern. A logical conclusion one could reach on this line of thinking is that CSCGs and perhaps the ARC should be the \*only\* CGs. The reality on the ground thus maps poorly to the Constitution we've been given, suggesting that the Framers either did not consider this matter in sufficient detail or intended much more radical changes in either the structure of consolidations, global review processes, or both. Mr. Fielding in fact hinted at just such an intent[2]:

We don't need to enshrine one committee's view of how C-Teams operate in an organization-wide constitution because C-Teams simply aren't relevant to \*every\* activity at OpenSolaris, and the vast majority of comments we have received so far clearly indicate that the existing consolidation boundaries are arbitrary AND dysfunctional. Personally, I am hoping that the communities feel empowered to change the things that are obviously causing them harm right now, and let the consensus process ensure that the traditions are adequately promoted and maintained over time.

Presumably Mr. Fielding and perhaps others have some grand detailed view of how all these things should be made to fit together in the rather obvious presence of existing bodies of code with no associated governing units and vice versa. Unfortunately, they've not seen fit to share that view nor to stand for election themselves. If consensus does not emerge within a few months as to an appropriate way to map the (possibly modified) existing practical devices of government onto the new constitutional structures, I'll probably favour amending the constitution rather than spinning our wheels forever trying to shoehorn OpenSolaris into a framework that may well be inappropriate to our broader goals.

At some point I'd like to hear Mr. Plocher and others more intimately involved with the operation of the ARC Community express their views on how that Community could be made to fit into the new Constitutional world of governing CGs. Their testimony will be needed before the OGB as it considers how best to restructure the Communities into meaningfully self-governing units.

  1. http://blogs.sun.com/wesolows/entry/ogb_election
  2. http://www.opensolaris.org/jive/message.jspa?messageID=99494#99494

Friday Mar 09, 2007

DTrace Community OGB Questionnaire

Leaders of the DTrace Community had a number of questions for the OGB candidates. Here's a copy of the questions and my answers. You can also see the other candidates' responses in the DTrace mailing list archives.

  • DTrace is one of only a small handful of OpenSolaris technologies that has actually been incorporated into other operating systems. Thus, your position on dual-licensing is very important to us; what is your position on dual-licensing in general?

    As I noted in my position paper: (a) the OGB does not control licensing, and (b) to the extent that the OGB would be consulted on the matter, I'm opposed to dual-licensing.

    The well-known opportunity it offers for license-based forks is a significant drawback that would have to be balanced and more by compelling benefits. No one has yet articulated such benefits, and I have found no evidence myself that they exist. The advantages presented by proponents of such a licensing scheme appear to be predicated on the idea that the second license would be GPLv3 (which is not complete), and that its use would dramatically increase the size of our community by drawing in the FSF as a partner in our technical work. Those are two large 'ifs' for a 'maybe' we're poorly positioned to handle.

  • Do you agree with the conclusions and decrees of CAB/OGB Position Paper # 20070207?

    Generally, yes. See above.

  • The OGB is responsible for the representation of OpenSolaris to third parties. If a third party were to inquire about incorporating DTrace into a GPL'd Program, what would be your response or position?

    I would note that my lay reading of the GPL would preclude that party from distributing the resulting product without violating the terms of that license. I would also advise that party to seek legal counsel, as with any licensing concern. That's as far as I'd go, however; the OGB does not hold the copyright to DTrace and is not in a position to warn or litigate against infringers.

  • DTrace is currently a Community Group, but some could argue that it would make more sense for DTrace to be a Project in (say) the Observability Community Group. In your mind, what is (or should be) the difference between a Community Group and a Project -- and where should DTrace fall?

    These two questions are not necessarily well-linked. The difference between a Project and a Community Group is straightforward. A Project owns one or more gates and does direct technical work with the intent to add or improve a specific aspect of the software they contain. A Community Group is the unit of independent government as defined by the Constitution; it is responsible for directing and guiding Project teams and others doing work that affects a broadly-defined set of interests.

    Others have suggested that a Project is defined to have a defined life span (presumably terminating upon integration into a consolidation). I disagree with this definition - a project (like DTrace) which provides a large and useful set of functionality will never be fully complete unless and until it is replaced wholesale. So long as the Project's work remains in use, it is important that some collaborative unit exist to provide a home for those using and improving it.

    DTrace is unquestionably a Project. Whether it deserves a Community Group of its own[0] depends on the granularity at which we wish to distinguish among Community Groups and the amount of overlap among them. That is, if Observability is held to be a Community Group distinguished from others at the correct granularity, DTrace should not be a separate CG, as its function would be a strict subset of another valid CG's. Instead, the DTrace leadership would be expected to participate in the Observability Group's activities, offering guidance and advocacy for consumers of its work. As part of that transition, mutually acceptable agreements regarding contributorship grants and leadership structure would need to be in place regarding the merged community (much like any corporate merger). Alternately, however, I could envision a finer-grained set of Community Groups with some overlap; DTrace might fit alongside, for example, a Debuggers Community Group in such a scheme. My personal preference is for a smaller number of larger Community Groups, some of them controlling the long-term maintenance of consolidations and others providing guidance to project teams (and the consolidation owners) based on their particular areas of technical expertise. I believe this would promote a vision of our software as an integrated whole. Just as importantly, even under such a system, any large and ambitious Project would fall inside the scope of several Community Groups' areas of interest. Expecting project teams to interact with dozens of Groups' leaders would seem to introduce excessive and unneccessary complexity.

    [0] The existence of DTrace as a Project ought not preclude the existence of other Projects which seek to enhance it.

  • The Draft Constitution says next-to-nothing about where the authority lies to make or accept changes to OpenSolaris -- only that Projects operate at the behest of Community Groups, and that Community Groups can be "terminated" by the OGB. In your opinion, where does or should this authority lie? And do you believe that the Constitution should or should not make this explicit? Finally, under what grounds do you believe that a Community Group should be "terminated"?

    As I noted in my position paper, I believe the authority for code acceptance should reside with Community Groups responsible for the targetted consolidation. Those CGs would be expected to delegate some or all of that authority in turn to specific individuals forming the C-Team for a particular release. While some minor changes will be needed to this strategy to accomodate open development, the basic process has worked well for a long time, and I see little reason to alter it radically.

    As I've noted in several messages, I would prefer that the Constitution had made at least some of this more explicit. The absence of this specification leaves the OGB with a set of illegitimate Sun entities excercising effective control over matters the Charter clearly leaves to the OGB, and offers no transition plan, timetable, or framework in which to take over these functions. This will present an additional challenge to the first elected OGB.

    Community Groups formed under a coherent and comprehensive strategy such as the one I hope the OGB will provide should generally be terminated only for inactivity or other clearly self-induced act of dissolution (such as a voluntary merger with another Community Group, approved by the OGB). Unfortunately, we also have a large number of existing Communities which do not fit well within any strategy one could retroactively imagine, and the OGB will be obligated to rationalise this situation. The process of doing so will likely involve terminating a number of these Communities and/or merging them with other Communities to form strategically valuable Community Groups. In the process, it is not unrealistic to suppose that some Communities may be terminated without the consent of their leaders. The OGB should seek to offer reasonable accomodation to the leaders of such Communities and work with them to find acceptable solutions that fit the strategic plan. My hope and expectation is that events of this type would occur very rarely after the initial realignment.

  • The Draft Constitution says that Community Groups (and in particular, the Community Groups' Facilitators) are responsible for "communicating the Community Group's status to the OGB"; what does this mean to you?

    My understanding was that the Working Group introduced the position of Facilitator for the purpose of maintaining a single first-line point of contact for each Community Group. The OGB should expect each Community Group to provide its membership list as required by the Constitution on a regular basis, and for proposing desired changes in structure or termination (if any). Beyond that, I believe this requirement has little meaning to the OpenSolaris community; it seems to make more sense in the context of an Apache-like organisation in which many completely disjoint software engineering efforts are undertaken simultaneously by likewise disjoint groups overseen by the Foundation. Since the OGB is not responsible for technical decisions, it makes little sense to expect Community Groups to provide detailed information about the work they oversee in the absence of a specific conflict or other matter requiring the OGB's attention. In short, it makes no sense to sample data which you cannot usefully consume.

  • According to the Draft Constitution, "nominations to the office of Facilitator shall be made by the Core Contributors of the Community Group, but the OGB shall not be limited in their appointment to those nominated." Under what conditions do you believe that the selection of a Facilitator would or could fall outside of the nominations made by a Community Group's Core Contributors?

    The only example I can imagine is one in which the designated Facilitator has a proven history of unreliability or deception. It seems unlikely that such an individual would be nominated by a responsible Community Group, so in practice I doubt this clause will ever be exercised.

  • According to the Draft Constitution, "non-public discussion related to the Community Group, such as in-person meetings or private communication, shall not be considered part of the Community Group activities unless or until a record of such discussion is made available via the normal meeting mechanism." In your opinion, in the context of a Community Group like DTrace -- where a majority of the Core Contributors spend eight to ten hours together every work day -- what does this mean? Specifically, what does it mean to be (or not to be) "considered part of the Community Group activities"? And in your opinion, what role does the OGB have in auditing a Community Group's activities?

    I choose to interpret this as a Blue Sky provision, requiring that important decisions be undertaken in public with the opportunity to participate for all those whose input might be considered useful. Since the Constitution provides no definition of "Community Group activities" other than voting, by implication this works in the same way as similar provisions in Municipal charters.

    In the context of the DTrace Community Group, I take it to mean that matters which require a Community Group to vote must be presented on a public list with reasonable opportunity for comment before such a vote is taken.

    Outside of bootstrapping activities around organising and rationalising Community Groups, I see little proactive role for the OGB in auditing CG activities. The OGB should generally handle only conflicts which cannot be resolved within one or more CGs, and then only when requested by a party to the conflict. The Constitution does preclude the OGB from interfering with a CG's internal governance.

  • Historically, binary compatibility has been very important to Solaris, having been viewed as a constraint on the evolution of technology. However, some believe that OpenSolaris should not have such constraints, and should be free to disregard binary compatibility. What is your opinion?

    Those people are wrong. Binary compatibility is a great strength, one which can in nearly all cases be preserved without retarding progress. To the extent that binary compatibility requires deeper thought on the part of engineers, it also directly enhances the quality of new work. Solaris customers praise and appreciate this engineering philosophy and the results it offers them; we should offer the same benefits to customers of other distributions as well by maintaining compatibility and architectural consistency within all recognised consolidations. Naturally, consumers of OpenSolaris are free to incorporate the technology into their own products in whatever manner they choose, including the introduction of changes that violate these constraints. Such activities are outside the scope of the OGB to regulate.

  • If a third-party were to use and modify DTrace in a non-CDDL'd system, whose responsibility is it to assure that those modifications are made public? To put it bluntly: is enforcing the CDDL an OGB issue?

    The answer to the first question is "No one." Neither use nor modification triggers the requirement that those modifications be made distributed in source form (and additions, in particular, need not be distributed at all). Only distribution triggers this requirement, and it is extended only to those to whom binaries are provided. If such a party did distribute the binaries containing DTrace, it is that party's responsibility to ensure its own compliance with the license terms.

    Enforcement of the CDDL is not an OGB consideration. The OGB does not hold any copyrights and has not issued any licenses. If the OGB is notified of a license violation, it should (as a group of good citizens) pass the information along to the copyright holder, if his/her/its identity is known. For much of the code in OpenSolaris including DTrace, that copyright holder is Sun Microsystems, Inc. Further action is at the discretion of the copyright holder.

    It may well be within the scope of the OGB's activities to help educate contributors about the terms of the CDDL, but such a campaign would require the OGB to obtain legal counsel.

  • Do you have an opinion on the patentability of software? In particular, what is the role of the OGB -- if any -- if Sun were to initiate legal proceedings to protect a part of its software patent portfolio that is represented in OpenSolaris?

    The OGB does not own software patents (or any other property), and I have no position on the patentability of software in general. Sun has the right to enforce its property rights under the laws of the countries in which it operates, and the OGB has no authority to interfere with that enforcement. Since community members who adhere to the terms of the licenses offered for OpenSolaris have limited (but adequate for all uses permitted under the CDDL) licenses to patents represented within that body of code, there is no reason for the OGB to worry about this. If such an event were to occur, the OGB might profitably offer a simple statement to this effect, clarifying the facts of the situation and denying incorrect rumours. Whether such an action would be necessary or appropriate would depend on the specific circumstances.

  • When you give public presentations, do you run OpenSolaris on your laptop? Have you ever given a public demonstration of OpenSolaris technology?

    Yes, I use OpenSolaris exclusively with the exception of interoperability testing. Yes, I have demonstrated new technology in Solaris 10 (now in OpenSolaris as well) at OSCON in 2004 and 2005, and the early OpenSolaris build system technology at SVOSUG in 2005.

  • And an extra credit question: Have you ever used DTrace? When did you most recently use it, and why? The answers "just now" and "to answer this question" will be awarded no points. ;)

    Yes, I've used DTrace. I most recently used it earlier this week while diagnosing the behaviour of two machines in an HA cluster. I've also written a (never-integrated) System V IPC provider for OpenSolaris and introduced USDT probes to enhance the observability of several aspects of daemon behaviour.

Thursday Mar 08, 2007

OGB Election

OGB Election OpenSolaris Governing Board elections begin next week. In addition, a single question will be presented to the voters: Shall the proposed Constitution be ratified? Please take the time to read this important document and learn about the issues being debated by the candidates. As a candidate for an OGB seat, I can help you right here and now with the latter task. I'd appreciate an allowance of five minutes of your time to learn where I stand on some of these issues. I welcome questions; you can send mail to all candidates to ask your questions. I'll be posting here my answers to any questions I receive in this fashion.

  • The Constitution


    I've pointed out a number of issues with the Constitution (see the 'constitutional limitations' thread) and continue to believe that the proposal as written positions us poorly to achieve independence from Sun, accomplish useful technical work, and provide leadership. Nevertheless, the alternative (last paragraph) is unlikely to be any better, thanks to some unfortunate decisions made by Sun. Therefore I support ratification and urge you to vote in favour.

  • Community Structure


    One of the biggest gaps in the Constitution is how the existing codebases are to be managed, controlled, and led. Indeed, the document does not even acknowledge their existence, despite the fact that they are the primary purpose for and value in OpenSolaris's existence. One of the OGB's most important tasks will be to rationalise the Community Group structure into one which will allow meaningful self-government. The centerpiece of my plan for doing this is construction of Consolidation-Sponsoring Community Groups (CSCGs). Each of these groups will be given control over an existing consolidation. This structure is not unlike that which exists today in the misnamed Nevada Community, representing ON. But that Community does not govern openly, and other consolidations are entirely missing structure under which they can be governed legitimately. Since the Constitution provides for the Community Group as the unit of independent government, each consolidation requires one to oversee its progress. The CSCGs will be responsible both for controlling the content of their codebases and for providing guidance and leadership to project teams desirous of integration. They will be required to adopt a set of rules (harmonised but not necessarily identical across all CSCGs) for integration and apply fairly these rules.

  • Projects

    MINOR CHANGES are needed here.

    The bar for project creation is very low today: if two Members believe a Project ought to exist, it does. This benefits everyone by allowing virtually unrestricted exploration of new spaces and approaches, but it also encourages duplication of effort and expenditure of effort on projects which are not positioned to be successful. I would like to see this approach altered: instead of directing project creation requests to a giant unmoderated mailing list (see more on this below), I would prefer to see them directed to one or more Community Groups, including (when relevant) the CSCG to which the project is targeted for integration. During a one-week initial review period, members of those Community Groups would be expected to provide feedback on the proposed project, informing its backers of related or conflicting ongoing work, the need for inclusion of additional or alternate Community Groups in the review, and risks and opportunities the project would offer. Just as importantly, this is an opportunity for Community Groups to inform the project's backers of the actions and choices the project team would need to make in order to secure those Groups' endorsements. It is expected that, by the time a project seeks integration into a consolidation, it will have secured the endorsements of all relevant Community Groups; this process will give the project team a leg up on understanding what will be required to do so, and help them make contacts and forge working relationships within those Groups. At the end of the initial review period, the project team will be required to indicate to the OGB's project-creation delegate whether, in light of the feedback received, it wishes to proceed. This decision cannot be vetoed, but a project which fails to secure the endorsement of relevant Groups will have much more work to do later if integration is desired. It is worth noting that integration need not be a project team's goal: some projects may be worthwhile on their own, may eventually lead to the formation of new consolidations, or may be intended solely as exploratory efforts that may yield innovative work later used elsewhere. We must not discourage these teams nor should we send them elsewhere to do their work. At the same time, we should provide a framework in which project teams desiring integration can learn early what will be required and work continuously throughout the life of the project with the technical leaders of relevant Community Groups.

  • Dual- or re-licensing

    I am OPPOSED to either of these steps at this time.

    It's important to note that the OGB does not control the offered licenses to OpenSolaris source because it does not hold the copyrights. Only Sun can offer additional or alternate licenses. Therefore, this position is relevant only to the extent that Sun seeks the OGB's guidance on the matter. The arguments for and against changes to the licensing regime have been discussed at length; I will not repeat them here. I have two main observations: First, licensing changes appear to be a solution in search of a problem. No proponent of such changes has articulated clearly the problem(s) which such a change would solve. Given the risks and costs, I would expect a clear and convincing case to be made that license changes are necessary; that threshold has not been met. Second, the main benefits posited by advocates of licensing change center around an increase in the size and stature of our community. Unfortunately, we are ill-positioned for growth; our institutions and infrastructure are in dismal shape. Any large influx in contributors would lead to more complaints and flames but little additional useful work. If we desire to grow, we must first position ourselves to leverage fully our existing contributor base. Until then, a focus on growth makes no sense. Similarly, I have little concern for our 'stature' in the broader Free Software community. If the FSF or a similar organisation would like us to change our licensing to better suit their interests, or to form a partnership to deliver interesting and useful products, we should remain open to such offers if they would benefit all parties. Since no such offer has been made, and made openly, there is little reason to consider hypothetical partnerships as a key benefit of a licensing change.

  • Infrastructure

    The OGB and the Tools Community must exert leadership; BLIND RELIANCE ON SUN IS NOT THE ANSWER

    The OGB must formulate a plan with dates and milestones for opening defect tracking to community participation, establishing review, approval, and archival mechanisms for change submissions, and increasing the transparency and utility of the ARC process. The OGB must also establish rules that Community Groups will be expected to follow regarding acceptance and integration of opaquely-managed projects (namely, that non-grandfathered projects of this type must not be permitted to integrate until, at minimum, a sufficient period for public review). Since Sun currently has a variety of tools for managing these processes, it would of course be nice if they would make those tools available to us. However, Sun's resources for doing so are limited, and in some cases the tools are poorly-designed to be used outside a LAN. The most important such example is the Bugtraq2/Bugster defect tracking system. Lack of access to this system is a major roadblock to open development, and Sun has not offered a plan to address this problem. The OGB must seek a firm commitment from Sun to open access to this system in an acceptable way, and must hold Sun to agreed-upon milestones in that plan. If Sun declines to offer an acceptable plan for doing so, or fails to uphold its agreement, the OGB must assist the Community Groups, notably but not exclusively Tools, in designing and constructing suitable replacements. I would like the OGB to issue a Call for Proposals for solving the defect tracking problem with a deadline of May 20. Sun is especially invited to submit a proposal. The OGB would then evaluate the proposals, giving special weight to one which would allow access to the existing body of information in Bugtraq2, and establish and monitor progress toward a chosen proposal. Other infrastructure problems (code review and archival, ACLs and Wiki-like features, RTI handling, etc.) should be handled in a similar fashion. This general framework is proving itself effective in the SCM project and we should not hesitate to use it in the future rather than expecting Sun to "do something" "someday."

  • Communication

    The OGB MUST DO MORE to improve the signal-to-noise ratio, and to communicate its own activities more clearly

    Several have complained about communication of information about the election, and with good reason. The OGB has at times communicated poorly with the other members. I would like to see the OGB use opensolaris-announce (a read-only list containing all members) more heavily to communicate information of universal interest. Correspondingly, opensolaris-discuss should never be used to convey any official information, nor to seek input or feedback from all members. Instead, the OGB should provide a set of mailing lists open to all in which topics related to governance can be discussed. When input or feedback are desired on a particular issue, the OGB should announce a Call For Discussion via opensolaris-announce, pointing interested members to the appropriate topical list. Naturally, the traffic on -announce should be kept low, but neither should we be afraid to use it when appropriate: it is a highly effective way to reach all members without requiring them to subscribe to a largely useless list with minimal signal and excessive noise. I will recommend that the OGB adopt a policy that its members not subscribe to -discuss, so as to force the board to communicate with all members on an equal footing. In short, subscription to such a high-volume, low-S/N list wastes time and resources that could be better spent working on real problems in more focused venues. The OGB should strongly encourage the use of appropriate topical, project-, or Community Group-sponsored lists for technical questions, proposals, and announcements. The general discussion list may well be reserved for flames, offtopic "water-cooler" conversation, and sophomoric hand-wringing over OpenSolaris's future. No one who does useful work should have to filter such tripe in order to keep up with important news.

  • Culture and Leadership

    A QUALITY-CENTRIC ENGINEERING CULTURE is one of our greatest assets; the OGB must encourage and strengthen that culture.

    The OGB is not intended to make technical decisions; these are to be made by Community Groups. Nevertheless, the OGB must position these Groups to enforce sound engineering philosophy, and provide them with the tools and support needed to do so. There is far too often a perception that the "movers and shakers," those who want to "cut red tape" and "just solve problems" are the community's true leaders. At times, this is indeed true. But engineering also requires a sober, cautious approach to new problems, especially those which are poorly understood. The existence of process and review is neither an accident nor red tape. Instead, these tools help us make the right decisions - decisions that will remain with us for many years. The OGB should urge, and where appropriate, force, its Community Groups to keep this in mind as they evaluate proposals and requests. Expressions of enthusiasm and a can-do spirit are welcome, but should not be confused with commitment or full agreement. It can take weeks or months of work to validate or discredit a particular approach to a problem. Community Groups will be most successful which do not commit to a particular approach until that time has passed.

  • The role of Sun

    Sun's engineers are IMPORTANT CONTRIBUTORS but Sun Microsystems, Inc. is JUST ANOTHER DISTRIBUTOR of our technology and enjoys NO SPECIAL STANDING.

    One of the largest challenges the OGB will face is encouraging the formation of decision-making bodies that operate openly and are independent of Sun, while still ensuring that the interests of Sun and other distributors are well-served. Far too much of our activity today takes place entirely within Sun in a largely opaque fashion. For example, the Solaris PAC, an entity mentioned nowhere in the Charter or Constitution, still believes it has the authority to set integration rules for each build. And, in part because no alternate framework exists for making these decisions, SPAC in fact does - improperly - exercise this authority. The OGB is responsible for taking over these functions with respect to OpenSolaris and providing a framework in which these actions can be taken openly. None of this should be taken to imply that the OGB exercises control over Solaris (Sun's distribution); like any other distributor, Sun remains free to ship whatever products it wishes without regard for the OGB or any other action of the OpenSolaris community. But to the extent that it wishes to undertake actions which conflict with openly-established policies, it must branch or fork in order to do so. If we make our decisions properly, with input from all stakeholders and with adequate transparency, then Sun's or another distributor's choice to do so will be both healthy and desirable. It may not always be possible to meet the needs of every possible member of our community, and sometimes Sun's corporate interests may be the ones we cannot serve. For now, however, our focus must be on building credible and authoritative institutions which are independent but not ignorant of Sun.

    I should note that I work for Sun, although not for the business unit responsible for Solaris. However, I am running as an independent individual, not a representative of Sun or any other entity. I have in the past expressed skepticism and disagreement with Sun's (and Sun executives') positions on various issues of interest to our community, and I will continue to do so in the future when appropriate. The OGB is not beholden to Sun or anyone else, and its members are expected to act accordingly. Neither corporations nor corporate representatives are permitted to serve - by design. I expect your confidence in my ability and determination to act independently for the common good.

Wednesday Feb 08, 2006

A louder voice for the fault manager

The Solaris reference implementation of the fault manager recently got a boost in its ability to report faults with the introduction of a two-part SNMP agent. This agent makes it easy to integrate the Solaris fault manager into existing SNMP-based monitoring infrastructure.


The fault manager has always been able to report faults to the system log and console(s), and to provide a wealth of status information via fmadm(1M) and fmdump(1M). But these reporting mechanisms leave much to be desired; syslog messages must be parsed, and a busy central log host can easily lose important messages in the noise. Worse still, a privileged user must log into the affected system and run administrative commands to get information they need that isn't contained in the message.

SNMP is a natural choice for extending the reach of the fault manager's voice; it's widely used to facilitate centralised monitoring of events throughout and even across administrative domains. The basic model is simple and extensible; information can be pushed from any device to one or more network management stations (NMSs), or pulled by an administrator or automated utility from a particular device of interest. Managed devices - in this case, a Solaris system - signify events using traps (also called notifications in SNMPv2), which provide a limited amount of information to designated NMSs. They also provide access to a management information base (MIB) on demand. Generally, the MIB provides access to a much greater breadth and depth of information than is transmitted with a trap or notification. An NMS can be configured to retrieve additional data from the MIB upon receipt of a trap if desired.


The technology described here is available in Solaris Nevada builds 33 and later. OpenSolaris offers access to the sources. A prerequisite for building or using these applications is the installation of the SMA packages provided by the SFW consolidation; BFUing newer ON bits is not sufficient. If you have SWAN access, you can run /ws/onnv-gate/public/bin/update_sma to get the necessary packages; otherwise see the OpenSolaris download center for the packages.

A Note on NMS Configuration

If you use the Net-SNMP-based NMS software delivered in Solaris, as I do below, you will want to tell the client utilities to use the fault management MIB to encode and decode OIDs. The easiest way to do this is to add MIBS=+ALL to your environment. You can also make this permanent by creating (or adding to) /etc/sma/snmp/snmp.conf the line:

    mibs +ALL
See snmp.conf(4) for more information on MIB searching and importing. If you use a different NMS, consult your vendor's documentation to learn how to import a new MIB.

snmp-trapgen: an SNMP plugin for fmd(1M)

The trap or notification generator component is snmp-trapgen. This is a very simple fault manager plugin similar to that which logs fault information to the system log and console. Instead of writing formatted text to a log device, however, this plugin generates SNMPv1 traps and/or SNMPv2 notifications, one for each destination configured in the systemwide snmpd.conf(4). No additional configuration is required; if you have already configured a system to send traps to one or more NMSs, you don't need to do anything else to be notified upon fault diagnosis. If not, you'll want to add v1 or v2 trap destinations to /etc/sma/snmp/snmpd.conf. The hostnames or addresses you use will need to be configured to receive and act upon SNMP traps or notifications. If you don't have an NMS on your network, you can use the snmptrapd(1M) server included with Solaris.

A fault diagnosis trap (sunFmProblemTrap) includes a limited subset of the information contained in the syslog message associated with the fault. Specifically, the diagnosis's UUID, diagnostic code, and reference URL are included. The object identifiers (OIDs) for these data are defined by the fault management MIB, SUN-FM-MIB, installed in /etc/sma/snmp/mibs/. The same information is delivered to both SNMPv1 and SNMPv2 trap sinks. At present, this is the only trap defined by the fault management MIB, but others may be generated in the future. Here's an example of an SNMPv2 notification as decoded by snmptrapd(1M):

2006-02-07 16:36:34 stomper [192.xx.xx.xx]:

        DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2266748911) 262 days, 8:31:29.11
        SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
        SUN-FM-MIB::sunFmProblemUUID."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: "a58aa105-4fab-6e16-8557-ab7687113de7"
        SUN-FM-MIB::sunFmProblemCode."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: SUN4U-8000-KA
        SUN-FM-MIB::sunFmProblemURL."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: http://sun.com/msg/SUN4U-8000-KA
The diagnostic code and URL can be used to find knowledge base articles describing the fault and suggested corrective action. The diagnosis UUID can be used to get further detail from fmdump(1M), or from the MIB, as seen in the next section.

libfmd_snmp: a MIB plugin for the System Management Agent (SMA)

Knowing that a fault has been diagnosed is important, but the amount of information delivered with the trap or notification may not be enough to provide an administrator with a complete understanding of the problem. The fault management MIB defines a wealth of detail, and this detail is made available via SMA by libfmd_snmp. In addition to fault diagnosis detail, this MIB also offers information about faulty components and the configuration of the fault manager itself, similar to that offered by fmadm(1M).

Enabling the plugin requires configuring the master SNMP agent on each server you wish to query. Adding the architecture-dependent line

    dlmod sunFM /usr/lib/fm/sparcv9/libfmd_snmp.so.1
to /etc/sma/snmp/snmpd.conf will cause the MIB plugin to be automatically loaded and initialised the next time the master agent is started, such as via /etc/init.d/init.sma. In the future, SMA will be managed via SMF; see 6349499[0].

No further configuration is necessary, although the usual snmpd.conf(4) directives will allow you to restrict access to the MIB, which may be important to you since some of the information it provides is ordinarily restricted to privileged users.

The fault management MIB provides 4 tables and a single scalar, in addition to the trap/notification described above. sunFmProblemTable and sunFmFaultEventTable are logically two pieces of the same table; they are separated only because MIBs do not support nested tables. The problem table contains the scalar information about each diagnosis, while the fault event table contains lists of the events associated with each diagnosis. Both tables are indexed by diagnosis UUID; the fault event table utilises a second scalar index to distinguish between multiple events associated with a diagnosis. In response to the trap above, you might want to know which Automated System Recovery Unit(s) (ASRU(s)) the fault manager believes may have caused the fault. This is just a fancy way of saying we want to know what broke to trigger the diagnosis. Because each ASRU is associated with a fault event, we'll first need to know how many fault events were associated with this diagnosis so that we can then look up each one's ASRU in the fault event table. To do this, we'll use snmpget(1M), delivered by Solaris in /usr/sfw/bin. Of course, you can use any NMS software.

    nms$ snmpget -c public -v 2c stomper \\
    SUN-FM-MIB::sunFmProblemSuspectCount."a58aa105-4fab-6e16-8557-ab7687113de7" = Gauge32: 1
This diagnosis has only one fault event associated with it. To look up the ASRU, we'll look in the fault event table entry indexed by the UUID and the fault index. Since fault events are indexed starting from 1, we'll need to do:
    nms$ snmpget -c public -v 2c stomper \\
    = STRING: cpu:///cpuid=4/serial=23EBEC1505
Most NMSs offer scripting facilities that allow you to perform actions similar to these in response to a trap. Alternately, you could poll the data on a regular basis. Many impementations do both, using polling to offset the risk of losing traps, which like all SNMP datagrams do not offer reliable transmission. SNMPv3 informs, also known as acknowledged notifications, offer only a partial remedy to this problem, and are not supported by snmp-trapgen at this time.

A polling NMS may wish to poll the systemwide faulty component count, provided by the MIB as sunFmFaultCount. An increase in this gauge without a corresponding problem trap is a good indication that the trap has been lost. More details about devices the fault manager believes to be in degraded or faulted states is available via the sunFmResourceTable; walking this table provides a ready - and remote - answer to the common question "What's broken on that machine?" For this, we use the snmpwalk(1M) utility:

    nms$ snmpwalk -c public -v 2c stomper sunFmResourceTable
    SUN-FM-MIB::sunFmResourceFMRI.1 = STRING: cpu:///cpuid=4/serial=23EBEC1505
    SUN-FM-MIB::sunFmResourceStatus.1 = INTEGER: degraded(3)
    SUN-FM-MIB::sunFmResourceDiagnosisUUID.1 = STRING:
Finally, the sunFmConfigTable offers remote access to the same information provided by fmadm(1M)'s config subcommand; like the other tables, it can be accessed using snmpget(1M), snmpwalk(1M), or any other SNMP-compatible NMS implementation. You can find the complete fault management MIB at the Fault Management community site, and in build 33 and later at /etc/sma/snmp/mibs/SUN-FM-MIB.mib.

[0] The bug should be visible, but it isn't. This is itself a bug, which the SFW team is working to fix.

Friday Jan 27, 2006

More on Drivers

For those who attended the SVOSUG meeting last night and are looking for boilerplate code similar to that Max presented, you can find it in the Device Driver Tutorial. This gentle introduction also includes a trivial but functional pseudo device implementation.

Monday Dec 05, 2005

GCC inline assembly, part 2

Long ago, I promised to write more about gcc inline assembly, in particular a few cases that are tricky to get right. Here, somewhat belatedly, are those cases. These examples are taken from libc, but the concepts apply to any inline assembly fragments you write for gcc. As I mentioned previously, these concerns apply only to gcc-style inlines; the Studio-style inline format doesn't require that you use this same level of caution. gcc expects you to write assembly fragments (even in a "separate" inline function) as if they are logically a part of the caller. That is, the compiler will allocate registers or other appropriate storage locations to each of the input and output C variables. This requires that you instruct the compiler very carefully as to your use of each variable, and the variables' relationships to one another. The advantage is much better register allocation; the compiler is free to allocate whatever registers it wishes to your input and output variables in a manner that is transparent to you. Instead, Studio requires that you code the fragment as if it were a leaf function, so the compiler does not do any register allocation for you. You are permitted to use the caller-saved registers any way you wish, and even to use the caller's stack as if you are in a leaf function. Arguments and return values are stored in their ABI-defined locations. Depending on the optimization level you use, this can be wasteful of registers (though the peephole optimizer can often clean up some of this waste) and can also make writing the fragment much more difficult. In exchange, however, you don't have to be nearly as careful to express the fragment's operation to the compiler.

Inputs, Outputs, and Clobbers (oh my!)

Each assembly fragment may have any or all of outputs, inputs, and clobbers. Each input and output maps a C variable or literal to a string suitable for use as an assembly operand. These operands can then be referenced as %0, %1, %2, etc. These are ordered beginning from 0 with the first output, followed by the inputs. Alternately, newer versions of gcc allow the use of symbolic names for each input and output. Clobbers are somewhat different; they express the set of registers and/or memory whose values are changed by the fragment but are not expressed in the outputs. Inputs which are also changed must be listed as outputs, not clobbers. Normally, the clobbers include explicit registers used by certain instructions, but may also include "cc" to indicate that the condition code registers are modified and/or "memory" to indicate that arbitrary memory addresses have had their contents altered.


Outputs and inputs are expressed as constraints, in a language specifying the type of operand that will contain the value of a variable. Common constraints include "r", indicating that a general register should be allocated, and "m" indicating that some type of memory location should be used. The complete list of constraints is found in the gcc documentation. These constraints may contain modifiers, which give gcc more information about how the operand will be used. The most common modifiers are "=", "+", and "&". The "=" modifier is used to indicate that the operand is output-only; it may appear only in the constraint for an output variable. Even if the constraint is applied to a variable containing an existing value in your program, there is no guarantee that it will contain that value when your assembly fragment is executed. If you need that, you must use the "+" modifier instead of "="; this tells the compiler that this operand is both an input and an output. Nevertheless, the variable with this constraint is provided only in the outputs section of the fragment's specification. An alternate way to express the same thing is provided in the documentation. Note that providing the same variable as both an input and an output does not guarantee you that the same location (register, address, etc.) will be used for both of them. Thus the following is generally incorrect:

static inline int
add(int var1, int var2)
		"add	%2, %0"
	: "=r" (var1)
	: "r" (var1), "r" (var2));

	return (var1);
The "&" modifier is used on an output operand whose value is overwritten before all the input operands are consumed. This exists to prevent gcc from using the same register for both the input and output operands. For example, for swap32() (see also the Studio inline function), we might think to write:
extern __inline__ uint32_t
swap32(volatile uint32_t \*__memory, uint32_t __value)
	uint32_t __tmp1, __tmp2;
	__asm__ __volatile__(
		"ld [%3], %1\\n\\t"
		"mov %0, %2\\n\\t"
		"cas [%3], %1, %2\\n\\t"
		"cmp %1, %2\\n\\t"
		"bne,a,pn %%icc, 1b\\n\\t"
		"  mov %2, %1"
		: "+r" (__value), "=r" (__tmp1), "=r" (__tmp2)
		: "r" (__memory)
		: "cc");
	return (__tmp2);

But suppose gcc decided to allocate o0 to both __tmp1 and __memory. This is allowable, because the "=r" constraint implies that the corresponding register is set only after all input-only operands are no longer needed (input/output operands obviously don't have this problem). In the case above, the first load would clobber o0 and the cas would operate on an arbitrary location. Instead, we must write "=&r" for both __tmp1 and __tmp2; neither variable may safely be allocated the same register as the input operand.

Bugs caused by omitting the earlyclobber are painful to track down because they often appear and disappear from one compilation to the next as entirely unrelated code changes cause increases or decreases in register pressure.

This is not an academic concern. Consider this example program:


static __inline__ void
incr32(volatile uint32_t \*__memory)
        uint32_t __tmp1, __tmp2;
        __asm__ __volatile__(
        "ld [%2], %0\\n\\t"
        "add %0, 1, %1\\n\\t"
        "cas [%2], %0, %1\\n\\t"
        "cmp %0, %1\\n\\t"
        "bne,a,pn %%icc, 1b\\n\\t"
        "  mov %1, %0"
        : "=r" (__tmp1), "=r" (__tmp2)
        : "r" (__memory)
        : "cc");

func(uint32_t x)
        uint32_t y = 4;
        uint32_t z = x + y;


        z = x + y;

        return (z);
gcc compiles this (use -O2 -mcpu=v9 -mv8plus) into:
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              c2 00 40 00  ld           [%g1], %g1	<===
    func+0x18:              9a 00 60 01  add          %g1, 0x1, %o5
    func+0x1c:              db e0 50 01  cas          [%g1] , %g1, %o5	<= SEGV
    func+0x20:              80 a0 40 0d  cmp          %g1, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              82 10 00 0d  mov          %o5, %g1
    func+0x2c:              81 c3 e0 08  retl         
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

In this case, gcc has allocated g1 to both __tmp1 and __memory, and o5 to __tmp2. Note the highlighted instructions: the initial load destroys the value of g1, and the subsequent cas will attempt to operate on whatever address was stored at \*__memory when the fragment began. In this example, that value will be 4 (g1 is assigned sp+0x64, which is simply the address of y). This program is compiled incorrectly due to improper constraints, and will cause a segmentation fault if the code in question is executed.

If instead we use "=&r" for both __tmp1 and __tmp2, gcc generates the following code:

    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d8 00 40 00  ld           [%g1], %o4	<===
    func+0x18:              9a 03 20 01  add          %o4, 0x1, %o5
    func+0x1c:              db e0 50 0c  cas          [%g1] , %o4, %o5	<= OK
    func+0x20:              80 a3 00 0d  cmp          %o4, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              98 10 00 0d  mov          %o5, %o4
    func+0x2c:              81 c3 e0 08  retl         
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

This code now assigns o4 to __tmp1, which eliminates the problem described above. This function, however, still does not do the right thing. Why not?


Compilers keep track of where each live variable in the program can be found; many variables can be found both at some memory location and in a register. Sometimes, the compiler chooses to use a register for a different variable, and stores the value back to its memory location (if it has changed) before doing so. Later, if this value is needed, the value must be loaded back into a register before being used. This is known as reloading. Other reasons reloading may be required include a variable's declaration as volatile and the case that concerns us here, a variable's modification via side effects.

In the example above, incr32() is actually operating on a memory address, not a register. So why did we assign __memory the "r" constraint instead of more correctly expressing the constraint as "+m" (\*__memory)? It turns out that the "m" constraint allows a variety of possible addressing modes. On SPARC, this includes the register/offset mode (such as [%sp+0x64]). This is fine for instructions like ld and st, but the cas instruction is special: it allows no offset. No constraint exists to describe this condition; the "V" constraint is clearly similar but is not correct; a bare register ([%g1]) is an offsettable address, so "V" would actually exclude the case we want. Conversely, "o", the inverse constraint of "V", includes the register/offset addressing mode we specifically wish to exclude. So, the only way to express this constraint is "r". But this does nothing to capture the fact that although the pointer itself is not modified, the value at \*__memory is altered by the assembly fragment. Is this a problem? Let's look at the assembly generated for func() a little more closely:

    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0	<===
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d8 00 40 00  ld           [%g1], %o4
    func+0x18:              9a 03 20 01  add          %o4, 0x1, %o5
    func+0x1c:              db e0 50 0c  cas          [%g1] , %o4, %o5
    func+0x20:              80 a3 00 0d  cmp          %o4, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              98 10 00 0d  mov          %o5, %o4
    func+0x2c:              81 c3 e0 08  retl         			<===
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

We see that gcc has assigned z the o0 register, which is not surprising given that it's the return value. But after o0 is set to x + 4 at the beginning of the function, it's never set again. The line z = x + y has been discarded by the compiler! This is because it does not know that our inline assembly modified the value of y, so it did not reload the value and recalculate z.

There are two ways we can correct this problem: (a) add a "+m" output operand for \*__memory, or (b) add "memory" to the list of clobbers. This is a special clobber that tells gcc not to trust the values in any registers it would otherwise believe to hold the current values of variables stored in memory. In short, this clobber tells gcc that all registers must be reloaded if the correct value of a variable is required. This is somewhat inefficient when we know which piece of memory has been touched, so (a) is preferable for better performance. Whichever solution we choose, gcc now compiles our code to:

    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               98 10 00 08  mov          %o0, %o4
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d6 00 40 00  ld           [%g1], %o3
    func+0x18:              9a 02 e0 01  add          %o3, 0x1, %o5
    func+0x1c:              db e0 50 0b  cas          [%g1] , %o3, %o5
    func+0x20:              80 a2 c0 0d  cmp          %o3, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              96 10 00 0d  mov          %o5, %o3
    func+0x2c:              d0 03 a0 64  ld           [%sp + 0x64], %o0	<===
    func+0x30:              90 03 00 08  add          %o4, %o0, %o0	<===
    func+0x34:              81 c3 e0 08  retl         
    func+0x38:              9c 23 bf 88  sub          %sp, -0x78, %sp

Note the reload, which will now return the correct result. There are actually two other ways to correct this, although the use of "+m" is the most correct. First, we could declare z to be volatile in func(). This would force gcc to reload its value from memory any time that value is required. Use of the volatile keyword is mainly useful when some external thread (or hardware) may change the value at any time; using it as a substitute for correct constraints will cause unnecessary reloading, degrading performance. Second, and perhaps best of all, the compiler could be modified to accept a SPARC-specific constraint for use with the cas instruction, one which requires the address of the operand to be stored in a general register.

You can find more inline assembly examples in libc (math functions), MD5 acceleration, and the kernel illustrating these concepts. Be sure to read and understand the documentation completely before writing your own inline assembly for gcc, and always test your understanding by constructing and compiling simple test programs like these.

Tuesday Aug 16, 2005

Broken allocators and paleolithic debugging strategies

Not so long ago I was looking through Solaris's shells for memory allocators - functions that perform tasks similar to malloc(3c). These functions often store the size of the allocated block at the beginning of each block; if that size is stored as a 4-byte value, the return value from the allocator may not be aligned on an 8-byte boundary. This is a major problem on SPARC, because it's not uncommon to allocate structs or unions containing types that require 8-byte alignment, especially long long. As it turns out, gcc correctly assumes that long long variables are aligned on 8-byte boundaries and uses the ldd and std instructions to access them. Our Studio compiler doesn't; it always issues two ld or st instructions. The result is that programs using this kind of allocator can crash when built with gcc but not with Studio, not a pleasant condition.

As part of my search, I found that, indeed, the Bourne and Korn shells have some alignment problems. Though these are bugs, we've decided that there's no reliable way to find all possible bugs of this type, so we worked around them in the compiler as well as fixed the ones we've found. This is, if nothing else, a good argument against compilers that "help" programmers by covering up this kind of error. But the best prize of all wasn't the kind of problem I was looking for, but rather this gem from the C shell:

        printf("i=%d: Out of memory\\n", i);

This is the systems programming equivalent of finding a live wooly mammoth contentedly smoking a cigar in your recliner. Unfortunately, there's no way to trigger this behaviour, as it's protected by the "debug" preprocessor symbol, which we never set in a normal build. Nevertheless, thanks to OpenSolaris, you can see it for yourself.

We harp incessantly on the need to be able to debug production code, with no recompilation needed; there are a number of better ways to debug this particular condition. For example, you could use the DTrace pid provider to stop a csh process when nomem() is called, and even provide a backtrace. If that weren't enough, you could then use mdb(1) to debug the problem in greater detail, or gcore(1) to produce a core dump. But the best part, the real joy, if you'll pardon the pun, is the chdir call. Clearly the purpose was to drop core in a predictable location for later analysis by the author. I think you'll find that coreadm(1m), along with other corefile improvements, offers a far more flexible and powerful way to accomplish this - and it complements nicely the other debugging strategies I mentioned above.

Tuesday Aug 02, 2005

Premium Drinks

Tuesday and Wednesday nights (after the extravaganza on Tuesday and the OpenSolaris BOF on Wednesday) we'll be convening for potent beverages, good food, and unique and amusing company. I'll be at the Lloyd Center DoubleTree in downtown Portland, OR, room 1560. Expect other OpenSolaris personalities to be present. Laura tells me that souvenir shot glasses are among the after-party swag collection, so don't miss out.

Monday Aug 01, 2005

OpenSolaris at OSCON

Those of you in or near Portland, Oregon are encouraged to come and see us at OSCON this week. Most of the conference is at the Convention Center this year (use the helpfully-named Convention Center train stop). Sun will have a booth in the exhibit hall starting Wednesday, and we're giving a few talks as well. In particular, join Bryan and me for a free tutorial on building, installing, and developing with OpenSolaris using DTrace, mdb, and more. That will be held Tuesday at 1:30pm in room D140. Then on Wednesday, I'll be giving a short talk on the status of OpenSolaris at 2:35pm in Portland/255, and we'll have a BOF at 8:30pm. Thursday, don't miss Bryan's short talk on DTrace at 4:30pm.

Even if you can't make the conference, you're welcome to join me for a beer. Send me mail at wesolows at eng dot sun dot com if you're interested, or leave a message for me at the 5th Avenue Suites.

Tuesday Jun 14, 2005

The First OpenSolaris Project: GCC Support

The First OpenSolaris Project: GCC Support

OpenSolaris is (finally) available. I've been working on this every day I've been with Sun, though others have spent years on the effort, and it's an amazing milestone. Unlike most launches, though, this is the beginning of a new effort rather than the end of one. As much as we've done already, there's far more left to be done before OpenSolaris can fulfil all our promises and achieve all our goals.

One promise we have fulfilled today is our commitment to make OpenSolaris accessible to people without the money or desire to buy compilers. Since most of Solaris is normally built with the Sun Studio compilers, this meant we'd need either to provide the compilers on the same terms as Solaris (also required to build OpenSolaris sources), or modify the sources to build and work with the GNU C compiler, available with source and free of charge under the terms of the GNU GPL. For reasons more illustrative of bureaucracy and human nature than of technological difficulties, we were unsure almost until the moment of launch whether we would be able to provide the Studio compilers under acceptable terms; therefore, another engineer and I have spent the last two and a half months porting OpenSolaris to gcc.

At this point I had a nice writeup on inline assembly differences between the Studio and GNU compilers. But it relies on source code that isn't available yet - namely, the gcc-specific inline assembly files. So instead I'll talk about why it happened that way and why it's actually a good thing. I'll also talk about some straight-up bugs we found in the process of porting.

We received word that a final Studio license had been agreed upon on June 3 - just 11 days ago! The license is free-as-in-beer and although somewhat vague seems reasonable enough. Of course, I prefer using only Free Software and promoting it whenever possible (as we're going with OpenSolaris), so I'd really rather use gcc. Our plan of record was to make a merged workspace available as "official" OpenSolaris. There were three sets of changes that needed to be merged together in the last three days leading up to launch: the gcc changes, which edit about 2500 files (mostly to fix compiler warnings), a large wad of renames to support the separation of code we're releasing now from that which we're hoping to release later (thousands of renames), and the coup d'grace, the addition of the CDDL license block to over 24,000 files. In the end, this gigantic 3-way merge proved impractical: there were over 1700 conflicts to resolve. Most are trivial and can easily be automerged by TeamWare, our revision control system, but the sheer volume and shortened schedule would have made adequate testing impossible.

Instead of the three-way merge, then, we elected to take the minimum amount of change we could: the addition of the CDDL blocks and the separation of released from unreleasable source. That meant gcc support would not ship in the "official" sources - but it could still be made available to the developer community. This is important for several reasons - first, it illustrates an important principle: FCS quality all the time. That is, if it's not good enough for a customer, it's not good enough to be putback. Since there was no doubt in anyone's mind that the gcc work was not ready for either, that meant it also wasn't good enough to call OpenSolaris. Second, it offers us an opportunity to provide a glimpse into the way projects work. One of the most common questions we get is "so, if the gate always has to be golden, how does any major work ever get done?" Like most people, we do major work in "branches" off the trunk. TeamWare supports children of children and merging of independent workspaces with common ancestry, so that no complicated branching apparatus is needed as for CVS. What will be available on the gcc project page will be that project gate. You're invited to participate - there are over 300 mostly very small bugs to fix.

One of the most significant kinds of bug we found were programs writing into string constants, confirming Osborne's Law. These programs ordinarily work properly because the Studio compilers place the string constants in the .data section or some other writable data section. The flag -xstrconst changes this behaviour, placing the strings in .rodata or a similar read-only segment and thus also allowing them to be shared. This reduces runtime memory usage but comes at a cost: buggy programs that attempt to write to the constant strings will trigger a segmentation violation and normally die. gcc acts as if this flag were always on, and applies it to other const data types as well. The end result is greater enforcement of correctness at the cost crashes.

Fortunately fixing these is very easy. For example, I fixed bug number 6281909 (you're supposed to be able to see bugs, too, but it doesn't seem to include the bugs of interest) by fixing the selector function not to assume it can write '=' and '\\0' into its arguments. Note that the correct use of 'const' can help prevent this kind of problem.

The original article on inline assembly will appear when the source it references appears - and you can help make that happen sooner: check out the gcc project page.

Technorati Tag:
Technorati Tag:



« December 2016