Low-Availability Clusters

Greetings, puny humans! I am Sun part number 371-3024, a Sun Fishworks Cluster Controller 100, but the world knows me as CLUSTRON. Today you'll be giving me all your gold in tribute as I tell you about the clustering strategy implemented in Fishworks appliances and my integral place in the Sun Storage 7410C.

All clustering software comes with a devastating intrinsic drawback: its own existence.

As anyone who has worked in the industry can tell you, the only bug-free software is the software that isn't written. So when we talk about using two servers - or appliances - to provide higher availability through redundancy, one ought to be immediately suspicious. Managing multiple system images and coordinating their actions is a notoriously difficult problem. And when the state shared between them consists of the business-critical data you're using the appliances to store, you ought to be downright skeptical. After all, while simple logic dictates that two systems ought to offer better availability than one, there's the small matter of the software required to take that from a simplistic statement of the obvious to a working implementation fulfilling at least some of that promise. It's not just software in the usual sense, either; hardware - like me - is also in play, and most modern hardware contains software of its own, usually called firmware. Firmware is really just software for which the system designer has no source code, no observability tools, and no hope. Generally speaking, more software - wherever it runs - means more bugs, more time and energy devoted to management, and more opportunity for operator error; all of these factors act to reduce availability, eating away at the gains offered by the second head. Anyone who tells you otherwise is lying. Liars make CLUSTRON angry.

The typical clustered unified storage server consists of a pair of underpowered servers, each populated with some HBAs, some NICs, a small, expensive DRAM buffer with a giant battery, and an Infiniband (IB) HCA. Oh, and some software. Lots of software, as it turns out, because the way these implementations provide synchronous write semantics to clients is by mirroring the contents of their battery-backed DRAM buffers to one another in real time across those IB links. When a server fails, its partner has access both to the disk storage (usually via FC) and the in-flight transactions stored in its own copy of NVRAM, so it can pick up where its dead partner left off. The onus is often on the administrator, however, to keep configuration state in sync; while it changes infrequently, it usually needs to be identical in order for clients to observe correct behaviour when one of the two servers has failed. And all this comes at a hefty price in cost - NVRAM and IB HCAs take up precious I/O slots (reducing total capacity and performance) and are not particularly cheap. But there is also a complexity cost: a quick glance at the Solaris IB stack turns up about 65,000 lines of source code, and of course that doesn't include an NVRAM driver or the code needed to coordinate mirroring NVRAM over IB. None of the software in such an implementation is reused elsewhere in the storage stack, so it has to be developed and tested independently, and the IB HCA is likely to contain a fat chunk of that nasty undebuggable firmware of which you'd like to as little as possible in your core systems. Worst of all, because that interconnect link is in the data path and doubles as the cluster "heartbeat" channel, under extreme load it may be possible to lose heartbeats and incorrectly conclude that your partner is dead. That can lead to a takeover at the worst possible time: under extreme load (most general-purpose clustering software suffers from this deficiency as well). Overall, it's almost as if the engineers who designed these systems kept adding complexity, cost, and opportunity for error until they finally ran out of ideas.

The Fishworks approach to clustering is somewhat different. At the bottom of the stack lies the most important difference: me, your CLUSTRON overlord. Instead of IB in the data path, I offer three redundant inter-head communication links for use only by management software. We'll come back to this in a bit. The data that would otherwise be written to NVRAM and mirrored over IB is instead written once to each intent log device as if it were an ordinary storage device. These devices combine flash for persistence with supercapacitor-backed DRAM for performance. Since they live next to the disks in your JBODs, they can - just like NVRAM contents - be accessed by an appliance when it takes over for a failed partner. But this entire path is much simpler; notice that we are reusing the basic I/O path that is already used - and tested - for writing to ordinary disks. And since there's nothing to mirror, we don't need any software on the appliances to drive IB devices or coordinate NVRAM mirroring. Each appliance simply writes its intent log records to the device(s) associated with a given storage pool and replays them when later taking control of that pool, either on boot or during a cluster takeover or failback activity.

But what is my role in this? I provide basic connectivity for two purposes:

  • Configuration sync - if you make a change to a service property (say, you add a DNS server) on one appliance, this change is transparently propagated to its partner. If that partner is down, it will pick up the change when it next boots and rejoins the cluster.
  • Heartbeats - this is how a clustered Fishworks appliance decides to take control of cluster resources. No heartbeats? It must be dead. It wouldn't become a soulless machine to mourn its passing so I'd better just poke the userland management software to initiate a takeover.

On the face of it, that seems unremarkable. One could presumably multiplex these functions onto a traditional IB-based implementation. But recall that a key goal in any clustering implementation must be reducing the complexity of the software and thereby limiting the number of bugs that can affect core functionality. I designed myself to do exactly that. Instead of a complex, featureful, high-performance I/O path, I provide some seriously old-school technology, namely 2 plain old serial links - the kind to which you might once have attached a modem to dial into the WOPR. My third link offers somewhat better performance but again uses only existing software drivers; it is an Intel gigabit Ethernet device. All three links provide redundant heartbeat paths (at all times) and all three can be used to carry management traffic, though management traffic is preferentially routed over the fastest available link to provide a better interactive management experience.

The advantages of this design are several:

  • Serial devices typically take interrupts at high priority. By noting the receipt of heartbeat messages in high-level interrupt context, I can ensure that I remain aware of my partner's health no matter how much load my appliance is under.
  • Likewise, I can employ a high-level cyclic on the transmit side to ensure that outgoing heartbeat messages keep flowing to my partner no matter how heavily loaded my appliance.
  • Serial communication is dead-simple, time-tested, and battle-proven. Fewer than 3400 lines of code are required to provide all my serial functionality, including controlling my LEDs. That's around 5% of what we might expect an IB-based solution to require. And while the Ethernet driver is considerably larger, it once again does double-duty: it's the same driver used with the NICs that attach your appliance to the data centre networks.

As you can see, the Fishworks team kept hammering away at a few key design objectives throughout; perhaps the most important of these was a desire to minimise the amount and complexity of new software to be written. This is not to say there is not complexity in the clustering subsystem; there certainly is, and I'll discuss some of those areas in a later edict. But the foundation of the clustering design is as simple as it can be. Clustering is not right for every application or every shop: even with these design principles firmly in place, clusters are much more complex to manage and monitor than standalone appliances, entail significantly higher hardware costs (though as always in the Fishworks universe, there is no added software licensing fee), and however little code may be specific to clustering it certainly is not zero. That means there will be failures that occur in clusters which would not have occurred in a standalone configuration - in other words, that clustering can always reduce availability as well as enhance it. The Fishworks clustering design makes a commendable effort to make this unhappy outcome less likely than in traditional shared-storage clusters. In my next edict I'll discuss the exact circumstances in which I can help provide greater availability than a standalone appliance, and some of the cases not yet covered that the engineers are looking to include.



Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014