Monday Jul 30, 2007
Wednesday Feb 07, 2007
By relling on Feb 07, 2007
My colleagues Tim Read, Gia-Khan Nguyen, and Bob Bart have recently released an excellent white paper which cuts through the fog surrounding the complementary functions of Oracle Clusterware, RAC, and Solaris Cluster. The paper is titled Sun™ Cluster 3.2 Software: Making Oracle Database 10g R2 RAC even More “Unbreakable.”
We have people ask us about why they should use Solaris Cluster when Oracle says they could just use Clusterware. Daily. Sometimes several times per day. The answer is not always clear, and we view the products as complementary rather than exclusionary. This white paper shows why you should consider both products which work in concert to provide very highly available services. I would add that Sun and Oracle work very closely together to make sure that the Solaris platform running Oracle RAC is the best Oracle platform in the world. There is an incredible team of experts working together to make this happen.
Kudos to Tim, Gia-Khan, and Bob for making this readily available for Sun and Oracle customers.
Thursday Jan 18, 2007
By relling on Jan 18, 2007
Just a quick note, to those who might get confused. It seems that marketing has decided to name the technologies formerly known as "Sun Cluster" to "Solaris Cluster." Old habits die hard, so forgive me if I occasionally use the former name.
Branding is very important and, as you've probably seen over the years, not what I would consider to be Sun's greatest strength. But in the long run, I think this is a good change.
Thursday Dec 28, 2006
By relling on Dec 28, 2006
Sun Cluster version 3.2 has arrived just in time for the holidays. There is a number of features that we have been waiting for in this release. Here is my favorite list, in no particular order:
- ZFS support - ZFS is now supported as a fail-over file system (HAStoragePlus). This is an important milestone for the use of ZFS in mission critical systems.
- Sun Cluster Quorum Server - this is another feature I've wanted for many years. Prior to Sun Cluster 3.2, we used a shared storage device as a voting member for the quorum algorithm. The quorum algorithm ensures that only one cluster is operating with the storage at any time. For a two-node cluster, the shared storage was the tie-breaking vote. For three or more nodes, you can configure the nodes themselves to break the tie with or without storage votes. Over the years we've been stung many times by the implementations of SCSI reservations in storage devices. There are many uhmm... inconsistent implementations of SCSI reservations out there and if they didn't quite work as expected, it would cause a problem with the quorum voting. This is one reason why qualifying shared storage devices for Sun Cluster is so time consuming and frustrating. Now, with the quorum server, you can select some other server on the network to handle the quorum voting tie breakers. This provides additional implementation flexibility and can help eliminate the vagaries of array or disk software/firmware implementations of SCSI reservations.
- Fencing protocol flexibility - use SCSI-2 or SCSI-3 reservations. Prior to Sun Cluster 3.2, the default behavior was to use SCSI-2 (non persistent) reservations for two-node clusters and SCSI-3 (persistent) reservations for more-than-two-node clusters. This was another source of frustration (as above). Not only are SCSI-2 reservations going away, this is one less compatibility speed bump to overcome in cluster designs.
- Disk-path failure handling - a node can be configured to reboot if all of the paths to the shared storage are failed. For a correctly configured cluster, this represents a multiple failure scenario policy. This is also a commonly requested multiple failure scenario "test" so it should create more smiles than frowns.
- More Solaris Zones support - a number of data services have been modified to now work in zones. We introduced basic zone failover support in Sun Cluster 3.1. Now we have over a dozen data services which are ready to work inside zones. This allows you to compartmentalize services to a finer grain than ever before, while still providing highly available services. For example, you might run the Sun Java System Application Server in one zone on on node and the PostgresSQL, mySQL, or Oracle database in another zone on another node. If one node fails, then the zone moves to the surviving node. Since the services are in zones, then you can manage the resource policies at the zone level. Very cool.
- Better integration with Oracle RAC 10g - continued improvements in integrating RAC into a fully functional, multi-tier, highly available platform. We are often asked why use Sun Cluster with RAC when "RAC doesn't need Sun Cluster." One answer is that most people deploy RAC as part of a larger system and Sun Cluster can manage all of the parts of the system in concert (Sun Cluster as the conductor and the various services as performers.)
- More flexibility in IP addresses - no more class B networks for private interconnects. Believe it or not, some of our customers get charged for each possible IP address that a server may directly address! No kidding! Prior to Sun Cluster 3.2, we reserved a private class B IP address range for each interconnect (up to six). This was an old design decision made when the vision was that there would be thousands of nodes in a Sun Cluster and we'd better be prepared to handle that case. In reality, closely coupled clusters with thousands of nodes aren't very practical. So we've changed this to allow more flexibility and smaller address space. Note: these weren't routable IP address ranges anyway, but that argument didn't always make it past the network police.
- Improved upgrade procedures - now dual-partition and Live Upgrade procedures are supported. This eliminates a long standing requirements gotcha: Sun Cluster "supports" Solaris; Solaris has Live Upgrade; Sun Cluster didn't "support" Live Upgrade; huh?
- Improved installation and administration - many changes here which will make life easier for system administrators.
- Improved performance and reliability - faster, better failure detection and decision making and many more hours of stress testing have improved the Sun Cluster foundation and agents. Much of this work occurs back in the labs, largely unseen by the masses. We make detailed studies of how the services work under failure conditions and use those measurements to drive product improvement and test for regressions. This is part of our high quality process built into the DNA of Sun Cluster engineering.
Whew! And this is just my favorite list! I encourage you to check out the Sun Cluster 3.2 docs, especially the Release Notes. And, of course, you can download and try Sun Cluster 3.2 for yourself (and yes, it does work on AMD/Intel platforms, including laptops!)
Tuesday Jan 11, 2005
By relling on Jan 11, 2005
Two new books on Sun Cluster technologies are now available from your favorite bookstore.Creating Highly Available Database Solutions: Oracle Real Application Clusters (RAC) and Sun Cluster 3.x Software is the book I mentioned in my blog last June. It was written by Kristien Hens and Michael Loebmann, two excellent Sun engineers based in Europe. There is a lot of good information here on Oracle services under Sun Cluster, including some of the nuances of RAC systems which can help or hinder you. I'll give it a definite thumbs up recommendation.
Also available is Sun Cluster 3 Programming: Integrating Applications into the SunPlex Environment by Joseph Bianco, Peter Lees, and Kevin Rabito. Although the target is developers, anyone supporting a Sun Cluster will appreciate understanding how good agents are written. I believe that agents are the real key to success with clustered services and this book takes you right inside the mind of some of the best engineers at Sun. More thumbs up for this one.
Sunday Dec 05, 2004
By relling on Dec 05, 2004
It was widely reported that HP is abandoning plans for TruCluster on HP-UX. I view this as a bitter-sweet event. On the one hand, it removes a Sun Cluster competitor (sweet!). But on the other hand, TruCluster and its predecessors have been instrumental in driving cluster development in the UNIX world. Given the moves by HP over the past few years as they distance themselves from UNIX and microprocessor development, I wonder who will be able to fill the gap that their competitive spirit and excellent research leaves open. Invent doesn't mean acquire in my dictionary.
Thursday Sep 23, 2004
By relling on Sep 23, 2004
OK, it was funny at first, but now it is becoming irritating. Once again I hear about problems caused by lack of physical separation of systems. Once again it is storage related. Repeat after me:
"SANs are Networks!"
"SANs are Networks!"
"SANs are Networks!"
If you want to have network security and physical separation, that is fine, but remember, "SANs are Networks!" Even if the SAN is inside the storage array, SANs are Networks! You will need to treat a SAN with the same care you treat your networks.
Tuesday Sep 21, 2004
By relling on Sep 21, 2004
Jim Sangster offers a peek into Sun Cluster futures today at the Sun Network Computing 04Q3 shin-dig. I'd like to post the URL here, but they tried to get too fancy and use Shockwave Flash for all of the screens. Anyway, you can enter the site and navigate to the online chat. Although it is too late to ask new questions on the chat, you can read many of the interesting questions and answers. You can also see the various personalities shine in their answers. I recommend taking a look.
Tuesday Sep 14, 2004
Thursday Aug 19, 2004
By relling on Aug 19, 2004
Chris Hubbell has started blogging about his work with the Configuration and Service Tracker (CST). CST is designed to track availability of single domains. We also have the ability to track and correlate availability for clusters. We have been tracking a number of clusters running Sun Cluster 2.2, 3.0, and 3.1. This allows us to measure how well our improvements to the product are reflected in the actual availability at customer sites. The good news is that the availability is improving. Every week we get a report of the extended outages and we have a team who investigates these to determine who we can prevent them in the future.
Tuesday Aug 17, 2004
By relling on Aug 17, 2004
Q: What can you do with a billion transistors?
A: a whole lot of things!
Intel has announced that they expect to see 1 billion transistor designs in the 2005 timeframe. I'm not talking about memories, which are already at the billion transistor size. I'm talking about logic devices at that density. Wow! Way back in grad school, I did some studies on wafer-scale integration where we were trying to produce multiprocessor designs which could use, essentially, more than one design on a wafer. The defect problems lead you to develop all sorts of exotic inteconnect strategies and make all sorts of irritating compromises. We always dreamed of having 100 million transistors, because then all of our design dreams would come true. Now, that density is almost everywhere. Late-model Itaniums are around 400 million transistors. The new Power5 has around 270 million transistors. Niagara will have lots of transistors too, though I don't know the exact count -- some details will be made public at the Hot Chips conference next week.
Q: So, what does all of this have to do with clusters?
A: a whole lot!
Much of the work put into clusters is intended to solve the problem of using more transistors. Let me explain. Back in the mid-80s, many companies, including the startup I was at, were trying to use lots of processors in clusters to solve compute-intensive problems. For the company I was at, the focus was on solving fluid dynamics problems. These sorts of problems can be solved with parallel systems, but it is not easy. In such systems we were always frustrated by the grainularity of the division of work. If you got the proper mix of work-per-processor, memory-per-processor, and interconnect-use then you could get some impressive throughput. If not, then it might suck worse than a laboring uniprocessor. Debugging programs was very, very difficult and even keeping the hardware running was a big challenge.
Over the years, Moore's law has proven to be quite reliable in predicting what we can do next. In the mid 1980's we barely had 32-bit microprocessors which required a handful of supporting chips to make a system while the memory size was about 4 MBytes and there were no on-processor caches. Today, we can easily have on-chip caches of 4 MBytes (about 25 million transistors with 6-transistor SRAM designs). In 2005, Intel implies, we could see 64 Mbytes of on-processor cache. But why waste all of those transistors on memory? Why not use the space to reduce the distance?
Q: What does distance have to do with anything?
A: everything in a cluster is a function of distance!
It is exceptionally difficult to move information faster than the speed of light. I won't say impossible because of Clarke's First Law. When we talk about on-chip timing difficulties at high speed, we do get concerned about sending data across a chip. For example, assuming a 95% phase velocity, we can move data 1 cm across a chip in about 35 picoseconds. There is a whole bunch of engineering trade-offs which make this difficult to achieve in a meaningful way on a chip, but I won't bore you with the details. When we go off-chip, this time increases dramatically. As we go across a printed circuit board, distance increases by about an order of magnitude and the engineering trade-offs further reduce the meaningful time (distance is directly proportional to time). If you go off of the printed circuit board, add more orders of magnitude.
What this means, practically, is that you really want your processors very close to each other. In the early 1990's when Sun introduced its first symmetric multiprocessor (SMP) system designs, there was a whole bunch of academic literature which said that they would never scale because we couldn't get the right mix of distance and processing power. I recall being at the Usenix conference where a distinguished panelist from a well-known Unix development company stated categorically that SMPs would never scale past 4 processors. Later this became 8... soon after it was 64... today it is over 100. The key is the distance. You will not be able to build a cluster system where the distance is on the order of 100 m (eg. Ethernet) which will be faster than a similar cluster system with a distance of 1 m (eg. a starcat backplane). And there is no way you could possibly beat a system with a distance of 1 cm (eg. Niagara). In a nutshell, this is why processor designs like Niagara are very, very cool. We have the number of transistors needed to shrink the distance between processors. We are still running a cluster of sorts, but the distance between the nodes is 1 cm versus 100 m. We could have only dreamed of this back in the 1980's but it is reality today.
Q: what about clustering for availability, isn't a Niagara a SPOF?"
A: well, yes, actually, but that is fodder for another day...
Thursday Jul 29, 2004
By relling on Jul 29, 2004
You might see some Sun Cluster marketing materials which try to make a big deal about Sun Cluster being integrated into the Solaris kernel. Here's my take...
When we think about being "kernel-based" there is really two (or more) perspectives possible:
running code in the kernel context
using kernel-based private interfaces
Running code in the kernel context is a long standing topic of discussion. It is difficult to answer the question of whether running in the kernel context is a win or not, in and of itself, for many service. Going forward, with increased virtualization, for better or worse, further strains this argument. After all, if you run Solaris 10 zones on a Sun Opteron-based workstation under VMWare, where exactly is the “kernel context”? And where is the ambiguity? And where is that confounded bridge?
I consider that the benefit of Sun Cluster's kernel-based claim is really in the realm of using kernel-based, and other, private interfaces. To understand better, please see the man page onattributes(5) and note the section on Interface Stability. Sun Cluster was developed inside Sun and in close coordination with the kernel development, device driver, file system, logical volume management, systems management, and hardware teams. Features which were needed by the Sun Cluster developers are negotiated with the other teams as part of an architectural review process. In some cases the features require interfaces which are not classified as stable. When this occurs we may make a contract between the source owner and the cluster team to identify the interface and its dependencies. This allows each team to move forward with some awareness of the effects of potential future code changes. This tight relationship allows us to build systems which have intimate knowledge of the failure modes of the components, how to detect such failures quickly, how to recover quickly, and how to protect data from harm when parts of the system aren't fully function. If you look in the Sun Cluster Error Messages Guide, you might see some possible messages and wonder how we could detect that error via a public interface. We may be using private interfaces.
It will be interesting to see how this process changes, if it changes, as we open source Solaris. Clearly, public interfaces will be preserved and private interfaces will continue to exist...
By relling on Jul 29, 2004
Later kernels in Solaris 8 and Solaris 9 have implemented CPU Offlining and Memory Page Retirement. You can read more on these in the Sun BluePrint ( R.I.P.) article Solaris Operating System Availability Features by Tom Chalfant.
A question often asked is, "do we see these features being used in the field to reduce outages?" I'm doing a small study on this, and the preliminary answer is "yes!" This is goodness...
Friday Jun 25, 2004
By relling on Jun 25, 2004
Wednesday Jun 16, 2004
By relling on Jun 16, 2004
There is a class of problems I call "/etc/system viruses" which occur when someone blindly copies someone else's /etc/system settings. This can really cause problems in addition to the propagation of inappropriate settings.
For example, there is at least one Sun Cluster customer who has
in /etc/system. This setting means the system will not automatically reboot after a panic.
There may be a good reason for this setting. Perhaps this is a test cluster. Perhaps they have a 24x7x365 managed datacenter and this is their policy.
I do not recommend this as standard operating procedure for most cases because human intervention will be required to boot if there is a panic. Fortunately, panics are not very common. In general, if you are trying to build a highly available system, then you want to reduce scenarios where human intervention is required.
- Gotta love the spring
- Roch star and Really Cool Stuff
- Oldie but goodie
- Evolution of RAS in the Sun SPARC T5440 server
- Hey! Where did my snapshots go? Ahh, new feature...
- Sample RAIDoptimizer output
- Dependability Benchmarking for Computer Systems
- Smartphones will rule the earth!
- More Enterprise-class SSDs Coming Soon
- ZFS Workshop at LISA'08