Tuesday Apr 05, 2011

New Post

I am just doing a quick post to my blog.

Wednesday Sep 16, 2009

7000 Series Takeover and Failback

The two fundamental operations most people associate with clusters are takeover (assuming control of resources from a failed peer) and failback (relinquishing control of those resources to that peer after it has resumed operation). It's very important to understand what these operations mean in the context of the Sun Storage 7310 and 7410 NAS appliances. A number of important changes have been made in the recently-released 2009.Q3 software release that affect how these operations work - all of them, we obviously believe, for the better - but the administrative interfaces are unchanged. The product documentation (PDF not yet updated for Q3 as of this writing) has also been greatly enhanced to better describe the clustering model and administrative operations that apply to these products, and I would strongly recommend that you avail yourself of that resource; you'll want to be familiar with it to understand some of the terminology I'll use here. But many of our customers and partners have asked us questions around takeover performance, and in order to address those questions I will need to go into greater detail about the implementation of these operations. So let's take a look under the hood. It's important to understand that nearly everything discussed here is an implementation detail that may change without notice in future software releases, and applies very specifically to the new 2009.Q3 software.

A First-Order Look

Takeover and failback consist of operations performed on each resource eligible for the operation. The selection of eligible resources is described in the product documentation and depends on the resource type and, if applicable, which cluster peer has been assigned ownership of it. For the sake of simplicity, we will define a fairly typical cluster configuration in which there is a single pool consisting of the disks and log devices in 4 J4400 JBODs, a pair of network interfaces by which clients will access that storage via NFS, and a network interface private to each head for administration only. We will refer to the heads as simply A and B. The pool and service interfaces are assigned to head A.

In this configuration, the following resources of interest will exist on both heads:

  • Eight resources of the form ak:/diskset/(uuid)
  • ak:/zfs/pool-0
  • ak:/nas/pool-0
  • ak:/net/nge0
  • ak:/net/nge1
  • ak:/shadow/pool-0
  • ak:/smb/nge0
  • ak:/smb/nge1

In addition, head A will have ak:/net/nge2 and head B will have ak:/net/nge3, representing the private administrative network interfaces. There will also be a very large number of other resources representing components with upgradeable firmware, service configuration, users and roles, and other configuration state replicated between heads. Because these resources are replicas and do not normally have any activity associated with them at takeover or failback time, we will not consider them further. The resources of interest must be acted upon in dependency order; for example, we must open the ZFS pool and mount and share all of its shares before we can safely bring any network interfaces up that clients expect to use to access those shares. Otherwise clients could attempt to access the shares before they are available, receiving a response that would result in stale filehandle errors. The above listing of resources reflects the dependency order for import; as you would expect, the opposite order must be observed when exporting.

It is worth noting that most of these resources do not appear in the management UIs. This is deliberate: the nas, shadow, and smb resource classes are symbiotes; that is, they always have the same resource identifier and ownership as their respective masters but have distinct locations in the dependency chain and distinct actions required for import or export. This allows us finer-grained dependency control and makes the implementation simpler and more modular by separating subsystems from one another. This becomes very important when discussing takeover and failback performance: the symbiote resources must be exported and/or imported, and these operations take time.


When head A fails (let us assume it has been powered off accidentally by our data center personnel), head B will initiate takeover once it detects that no heartbeats have arrived from head A. This timeout period is currently 500ms, and takeover is initiated as soon as the timeout elapses. At the beginning of takeover, an arbitration process is performed that protects user data in the case in which all cluster I/O has failed but both heads are still functioning. This arbitration process consists of attempting to take the zone lock (defined by SAS-2) on each SAS expander present in the storage fabric. These locks are acquired and dropped in a defined order by any head attempting to enter the OWNER state. The locks are held with a fixed timeout period, so that if the holder of the locks does not continually reacquire them they will eventually be dropped. A thread on the OWNER head does this at an interval significantly less than the timeout period. Therefore, if a head attempting to enter the OWNER state fails to acquire the locks, it will wait until the timeout interval expires and try again. If it again fails, it will reboot, allowing its peer (which must still be functioning) to take control of shared resources. This process prevents both heads from attempting to perform simultaneous write access to shared storage, which would destroy or corrupt data. The timeout interval is set to 5 seconds in current products, meaning that this step can take up to 5 seconds to complete plus the time to contact all expanders (typically around 1-2s even in the largest configurations). Since the zone locks are not held when in the CLUSTERED state, this will normally take less than 2 seconds overall. Only in the cases where arbitration is actually necessary - such as when all three cluster I/O links are disconnected between two functioning heads - or when taking over directly from the STRIPPED state can the additional 5 second penalty apply.

After acquiring the zone locks, the surviving head will evaluate each resource in the resource map in dependency order. If the resource is not already imported, its class's import function will be invoked. Since head B does not own any of the singleton resources listed above, none of them nor any of their symbiotes will be imported. We will therefore invoke the diskset class's import function for each of the disksets, then the zfs class's import function for pool-0, then the nas class's import function for pool-0, then the net class's import function for nge0, and so on until we have attempted to import all of these resources. Note that if a failure occurs we will simply mark the resource faulted and proceed: our peer is down so we must make a best effort. When head A resumes functioning, it will rejoin the cluster, meaning that the current list and state of all resources will be transferred from head B over the intracluster I/O subsystem. Head A will not, however, take control of any of the singleton resources or their symbiotes; it will import only its own private resource ak:/net/nge2 as it transitions into the STRIPPED state following rejoin. This behaviour prevents ping-ponging and allows the administrator to verify that the restored head has had any hardware issues addressed before returning it to service.


Now that head A has rejoined, a failback can be initiated from head B. During failback, head B will walk the list of resources in reverse dependency order, invoking the resource's class's export function for each resource that is not owned by head B. If any of these functions fails, head B will generate an alert and reboot itself. This is done to ensure that the cluster is in a consistent, well-defined state: it would not be safe for head A to import a resource that is still under the control of head B, nor would it be possible for head A to enter a defined cluster state without importing all of the resources assigned to it. Likewise, if head B attempted to re-import the resource that could not be exported, that operation or some other re-import required by it could (and likely would) fail as well, making matters worse. Therefore B's reboot will trigger a takeover by A and consistency is maintained. Assuming a successful export, head B will now perform an intracluster RPC to head A instructing it to begin importing its resources. In response, head A will walk the list of resources in dependency order, invoking each resource's class's import function for each resource assigned to it (but not any assigned to head B). If any of these functions fails, head A will generate an alert and reboot itself, again triggering takeover from head B and maintaining consistency.

A Closer Look

Since failure detection and zone lock acquisition together take at most a few seconds, it is clear that we will need to understand the performance characteristics of each import and export function in order to understand overall takeover and failback performance. What exactly does each of them do?


A diskset is exactly what its name implies: a collection of disks managed together. A few simple rules govern disksets: disks in a diskset are always part of the same storage pool, or no storage pool; disks in a diskset are always located in the same physical enclosure such as a J4400; and disks in a diskset are always imported or exported together. The mapping of disksets onto the slots or bays in a storage enclosure is defined by metadata delivered as part of the appliance software and may vary by product or enclosure type. Administrators do not manage disksets directly; they are handled automatically by the software when storage pools are created and destroyed.

In the abstract, disksets would be merely an engineering convenience, containers used to track the allocation of disks. Unfortunately, the need to support ATA disks necessitates a far more complex implementation. Because the ATA protocol does not support communication between multiple initiators and a single target, the SAS standard defines the notion of an affiliation, a mapping between a single initiator and an ATA target. Only the initiator that owns the affiliation can communicate with the target; any attempt by another initiator to do so will fail. Ownership of the affiliation is tracked by the SAS expander in which the STP bridge port associated with the ATA target is located, and an affiliation is claimed automatically the first time an initiator performs an I/O operation on a given target. Note that I/O operations in this context are not limited to those that affect the media: it is not possible to obtain even basic information about the device without claiming the affiliation. The process of obtaining that information and using it to create a device node used by ZFS and other software to interface with the disk is known as enumeration, a process that is normally performed by default by Solaris and other operating systems on every disk visible to the system's HBAs. However, as we can see, attaching two initiators to the same expander and performing automatic enumeration will result in chaos if there are ATA disks behind that expander: each system will claim some subset of the affiliations during enumeration but hang for an extended time attempting to enumerate those disks whose affiliations were claimed by its peer. The net result would be extremely long boot times and some random subset of disks visible to each system. Clearly this is untenable.

Disksets are a solution to this problem. By disabling automatic enumeration of ATA disks, we can control when the enumeration process is performed, limiting it to circumstances in which it is known to be safe: storage configuration and those times when we know our peer is not attempting to access the disks; e.g., during takeover and failback. Therefore the diskset import function must, for each disk, cause the operating system to enumerate that disk via each possible path. The export routine, likewise, must cause the operating system to "forget about" the disk and relinquish its affiliations for each initiator.

In previous software releases, diskset import time usually dominated the takeover and failback processes. While the 12 disks in each diskset were enumerated in parallel, fundamental problems in the kernel and an inability to process disks from multiple disksets in parallel limited the parallelism that could be exploited. Each diskset typically took 15 to 30 seconds to import, and, worse, could take much longer in certain error paths, especially if, during takeover, the expander had not yet torn down the defunct initiator's affiliations. The current software release improves the situation considerably: all disksets can be imported in a single invocation of the diskset import function (known as "vector import"), and up to 96 disks can be enumerated in parallel, up from 12. In addition, improvements in error handling and timeouts have greatly reduced the worst-case import time when disk or affiliation errors occur. Overall, configurations such as our example above will typically see reductions in diskset import time on the order of 4-6x, with an accompanying large decrease in variance. That is, we might reasonably expect all 8 disksets in our example configuration to be imported in 30s. Because most of the overall benefits come from increased parallelism, smaller configurations will see somewhat smaller improvements. Diskset export is not, and has never been, a significant contributor to failback times; undoing the enumeration process is typically measured in milliseconds for each disk. This means that the relationship between takeover and failback times depends mainly on which contributing factors dominate each activity; i.e., the configuration and uses of the system.


Importing a ZFS pool resource simply means opening the pool, reading the labels from each disk, and creating the attachment points for any zvols (used to provide block storage) contained in the pool. Reading of labels takes constant time as it is performed in parallel, but the second portion of this activity requires walking all datasets in the pool, which is done sequentially. The time taken here is therefore proportional to the sum of the number of projects, filesystem shares, and LU shares the pool contains. It is, however, usually much less than the mounting and sharing activities, which we will investigate next.


The NAS symbiote of the ZFS pool is responsible for mounting and sharing all of the ZFS datasets, including zvols used as backing stores for block devices. This activity therefore takes time proportional to the number of shared filesystems (NFS, CIFS, FTP, HTTP, SFTP, FTPS) and block devices (iSCSI). Tests have shown that NFS shares contribute between 5ms and 15ms each to this process but, because the meaning of "sharing" depends on the protocol, it is difficult to provide an overall estimate of the constants associated with this activity. Likewise, export requires unsharing and then unmounting the filesystems and LUs, which is also linear and requires variable time that is protocol-dependent. Tests have shown that each NFS share contributes a similar time increment to the export process as it does to the import process.


The net class's resources represent the state of a network interface, which will already have been plumbed and configured on both heads. When this resource is imported, the subsystem informs the kernel that the addresses on the interface should be brought up. This activity is performed sequentially for each address and the time taken is therefore linear in the number of addresses configured for the interface. However, the time taken for each is miniscule and it is unusual to assign more than a few addresses to an interface. The entire operation normally completes in a second or two. Exporting is directly analogous and takes a comparable length of time.


This resource class, new in the 2009.Q3 software, manages shadow migration destinations associated with the pool. Because we cannot necessarily mount the shadow migration sources until the network interfaces are imported, this symbiote of the pool resource is imported after all net resources. It is responsible for activating each shadow mount, which will cause the source filesystems to be mounted. This occurs sequentially, and is therefore linear in the number of shadow sources. Of course, shadow sources that are local will take very little time to mount while NFS client mounts can take a significant amount of time. Exporting is the complement, and is normally very fast in all cases.


Each net resource has an smb symbiote, responsible for notifying the CIFS subsystem that an additional network interface should be used to provide service to CIFS clients. This operation is effectively irrelevant as it usually takes less than a second.

Putting It Together

As we've seen, there are many moving pieces involved in takeover and failback. Each resource class has its own set of operations for import and export; some take effectively constant time while others depend on the number of shares and projects or the number of disks. Even where a clear dependency in a particular variable can be characterised, the actual time taken to perform each individual suboperation may not be known or even constant; for example, sharing a filesystem can take a different amount of time depending on the protocol used and even the parameters associated with that particular share. For all these reasons, I strongly encourage anyone who is especially sensitive to takeover or failback time to perform some tests based on their own real-world configurations. This will become even more important as overall performance improves: for example, the recent improvements to diskset import time make the number and type of shares much more relevant to total takeover time. Many configurations may achieve 4x or better overall takeover time improvement as a result of that work, but a configuration with, for example, 3000 shares on a pool consisting of a single diskset, may see little or no change. As with any benchmarking activity, there is no substitute for testing your own configuration, but I hope the above description of the process and rough guidelines will be helpful in establishing expectations going into that testing process so that anomalous behaviour can be identified and tracked down.

In a future post I'll talk about a few of the remaining opportunities for improvement. Until then, ALL HAIL CLUSTRON!

Monday Nov 10, 2008

Low-Availability Clusters

Greetings, puny humans! I am Sun part number 371-3024, a Sun Fishworks Cluster Controller 100, but the world knows me as CLUSTRON. Today you'll be giving me all your gold in tribute as I tell you about the clustering strategy implemented in Fishworks appliances and my integral place in the Sun Storage 7410C.

All clustering software comes with a devastating intrinsic drawback: its own existence.

As anyone who has worked in the industry can tell you, the only bug-free software is the software that isn't written. So when we talk about using two servers - or appliances - to provide higher availability through redundancy, one ought to be immediately suspicious. Managing multiple system images and coordinating their actions is a notoriously difficult problem. And when the state shared between them consists of the business-critical data you're using the appliances to store, you ought to be downright skeptical. After all, while simple logic dictates that two systems ought to offer better availability than one, there's the small matter of the software required to take that from a simplistic statement of the obvious to a working implementation fulfilling at least some of that promise. It's not just software in the usual sense, either; hardware - like me - is also in play, and most modern hardware contains software of its own, usually called firmware. Firmware is really just software for which the system designer has no source code, no observability tools, and no hope. Generally speaking, more software - wherever it runs - means more bugs, more time and energy devoted to management, and more opportunity for operator error; all of these factors act to reduce availability, eating away at the gains offered by the second head. Anyone who tells you otherwise is lying. Liars make CLUSTRON angry.

The typical clustered unified storage server consists of a pair of underpowered servers, each populated with some HBAs, some NICs, a small, expensive DRAM buffer with a giant battery, and an Infiniband (IB) HCA. Oh, and some software. Lots of software, as it turns out, because the way these implementations provide synchronous write semantics to clients is by mirroring the contents of their battery-backed DRAM buffers to one another in real time across those IB links. When a server fails, its partner has access both to the disk storage (usually via FC) and the in-flight transactions stored in its own copy of NVRAM, so it can pick up where its dead partner left off. The onus is often on the administrator, however, to keep configuration state in sync; while it changes infrequently, it usually needs to be identical in order for clients to observe correct behaviour when one of the two servers has failed. And all this comes at a hefty price in cost - NVRAM and IB HCAs take up precious I/O slots (reducing total capacity and performance) and are not particularly cheap. But there is also a complexity cost: a quick glance at the Solaris IB stack turns up about 65,000 lines of source code, and of course that doesn't include an NVRAM driver or the code needed to coordinate mirroring NVRAM over IB. None of the software in such an implementation is reused elsewhere in the storage stack, so it has to be developed and tested independently, and the IB HCA is likely to contain a fat chunk of that nasty undebuggable firmware of which you'd like to as little as possible in your core systems. Worst of all, because that interconnect link is in the data path and doubles as the cluster "heartbeat" channel, under extreme load it may be possible to lose heartbeats and incorrectly conclude that your partner is dead. That can lead to a takeover at the worst possible time: under extreme load (most general-purpose clustering software suffers from this deficiency as well). Overall, it's almost as if the engineers who designed these systems kept adding complexity, cost, and opportunity for error until they finally ran out of ideas.

The Fishworks approach to clustering is somewhat different. At the bottom of the stack lies the most important difference: me, your CLUSTRON overlord. Instead of IB in the data path, I offer three redundant inter-head communication links for use only by management software. We'll come back to this in a bit. The data that would otherwise be written to NVRAM and mirrored over IB is instead written once to each intent log device as if it were an ordinary storage device. These devices combine flash for persistence with supercapacitor-backed DRAM for performance. Since they live next to the disks in your JBODs, they can - just like NVRAM contents - be accessed by an appliance when it takes over for a failed partner. But this entire path is much simpler; notice that we are reusing the basic I/O path that is already used - and tested - for writing to ordinary disks. And since there's nothing to mirror, we don't need any software on the appliances to drive IB devices or coordinate NVRAM mirroring. Each appliance simply writes its intent log records to the device(s) associated with a given storage pool and replays them when later taking control of that pool, either on boot or during a cluster takeover or failback activity.

But what is my role in this? I provide basic connectivity for two purposes:

  • Configuration sync - if you make a change to a service property (say, you add a DNS server) on one appliance, this change is transparently propagated to its partner. If that partner is down, it will pick up the change when it next boots and rejoins the cluster.
  • Heartbeats - this is how a clustered Fishworks appliance decides to take control of cluster resources. No heartbeats? It must be dead. It wouldn't become a soulless machine to mourn its passing so I'd better just poke the userland management software to initiate a takeover.

On the face of it, that seems unremarkable. One could presumably multiplex these functions onto a traditional IB-based implementation. But recall that a key goal in any clustering implementation must be reducing the complexity of the software and thereby limiting the number of bugs that can affect core functionality. I designed myself to do exactly that. Instead of a complex, featureful, high-performance I/O path, I provide some seriously old-school technology, namely 2 plain old serial links - the kind to which you might once have attached a modem to dial into the WOPR. My third link offers somewhat better performance but again uses only existing software drivers; it is an Intel gigabit Ethernet device. All three links provide redundant heartbeat paths (at all times) and all three can be used to carry management traffic, though management traffic is preferentially routed over the fastest available link to provide a better interactive management experience.

The advantages of this design are several:

  • Serial devices typically take interrupts at high priority. By noting the receipt of heartbeat messages in high-level interrupt context, I can ensure that I remain aware of my partner's health no matter how much load my appliance is under.
  • Likewise, I can employ a high-level cyclic on the transmit side to ensure that outgoing heartbeat messages keep flowing to my partner no matter how heavily loaded my appliance.
  • Serial communication is dead-simple, time-tested, and battle-proven. Fewer than 3400 lines of code are required to provide all my serial functionality, including controlling my LEDs. That's around 5% of what we might expect an IB-based solution to require. And while the Ethernet driver is considerably larger, it once again does double-duty: it's the same driver used with the NICs that attach your appliance to the data centre networks.

As you can see, the Fishworks team kept hammering away at a few key design objectives throughout; perhaps the most important of these was a desire to minimise the amount and complexity of new software to be written. This is not to say there is not complexity in the clustering subsystem; there certainly is, and I'll discuss some of those areas in a later edict. But the foundation of the clustering design is as simple as it can be. Clustering is not right for every application or every shop: even with these design principles firmly in place, clusters are much more complex to manage and monitor than standalone appliances, entail significantly higher hardware costs (though as always in the Fishworks universe, there is no added software licensing fee), and however little code may be specific to clustering it certainly is not zero. That means there will be failures that occur in clusters which would not have occurred in a standalone configuration - in other words, that clustering can always reduce availability as well as enhance it. The Fishworks clustering design makes a commendable effort to make this unhappy outcome less likely than in traditional shared-storage clusters. In my next edict I'll discuss the exact circumstances in which I can help provide greater availability than a standalone appliance, and some of the cases not yet covered that the engineers are looking to include.





« June 2016